For this week's homework, we're using a Hugging Face Trainer wrapper to finetune and evaluate a PyTorch model (because pytorch supports gradient_accumulation_steps to accumulate gradients across training batches). [Perhaps someday Keras' model.fit() will also support a gradient_accumulation_steps argument.]

The console output for this week's Kaggle task includes print(model) output for the PyTorch model.

For my model, the console output can be found here: https://www.cross-entropy.net/ML530/imdb-console.txt

When I look at the "RobertaEmbeddings", I see the following entry for "word_embeddings":

(word_embeddings): Embedding(50265, 1024, padding_idx=1)

This means our vocabulary consists of 50,265 tokens, and each embedding consists of 1,024 features per token.

So the total number of parameters for "word_embeddings" is 51,471,360 (50,265 tokens * 1,024 features per token); and these parameters occupy 205,885,440 (51,471,360 parameters * 4 per bytes per float32 parameters) bytes of memory.

For the model you choose ...

a) Which model did you choose?

b) What is the size of the vocabulary?

c) How many features are used for each token?

d) How many parameters are there for the word embeddings?

e) How much memory is consumed to store these parameters? [Assume float32 representation for the parameters.]

Below are descriptions for each of the 15 models, trained (finetuned for two epochs) and evaluated on an Nvidia A6000 GPU.

Accuracy ranged from 90.392% for google/electra-small-discriminator to 97.056% for microsoft/deberta-v3-large; and elapsed time ranged from 6m42.647s for distilbert-base-cased to 58m02.981s for microsoft/deberta-large.

Note that DeBERTa performs additional matrix multiplications for projecting relative position embeddings (for queries and keys), used for deriving the attention matrix of a transformer block; see equation 4 of https://arxiv.org/abs/2006.03654.

model_name	seq_len	download_size	val_accuracy	val_rank	vocab_size	d_model	d_ff	layers	parameters	elapsed_time
distilbert-base-cased	512	263M	0.91248	14	28996	768	3072	6	65,783,042	16m47.330s
bert-base-cased	512	436M	0.92608	12	28996	768	3072	12	108,311,810	32m6.888s
bert-large-cased	512	1.34G	0.94632	9	28996	1024	4096	24	333,581,314	95m13.883s
distilbert-base-uncased	512	268M	0.92216	13	30522	768	3072	6	66,955,010	16m51.575s
bert-base-uncased	512	440M	0.93224	11	30522	768	3072	12	109,483,778	32m7.832s
bert-large-uncased	512	1.34G	0.94784	8	30522	1024	4096	24	335,143,938	95m7.914s
distilroberta-base	512	331M	0.93488	10	50265	768	3072	6	82,119,938	17m14.605s
roberta-base	512	501M	0.95104	6	50265	768	3072	12	124,647,170	33m11.707s
roberta-large	512	1.43G	0.96560	4	50265	1024	4096	24	355,361,794	95m53.424s
google/electra-small-discriminator	512	54.2M	0.90392	15	30522	256	1024	12	13,549,314	13m12.750s
google/electra-base-discriminator	512	440M	0.95376	5	30522	768	3072	12	109,483,778	32m10.185s
google/electra-large-discriminator	512	1.34G	0.96648	2	30522	1024	4096	24	335,143,938	95m53.773s
microsoft/deberta-v3-small	1536	286M	0.94952	7	128100	768	3072	6	141,896,450	166m33.176s
microsoft/deberta-v3-base	1536	371M	0.96616	3	128100	768	3072	12	184,423,682	314m32.645s
microsoft/deberta-v3-large	1536	874M	0.97056	1	128100	1024	4096	24	435,063,810	797m18.356s

If you'd like to print out parameter names and sizes, you can use something like this ...

    for name, parameters in model.named_parameters():
        print(name + ": " + str(list(parameters.size())))