For this week's homework, we're using a Hugging Face Trainer wrapper to finetune and evaluate a PyTorch model (because pytorch supports gradient_accumulation_steps to accumulate gradients across training batches). [Perhaps someday Keras' model.fit() will also support a gradient_accumulation_steps argument.]
The console output for this week's Kaggle task includes print(model) output for the PyTorch model.
For my model, the console output can be found here: https://www.cross-entropy.net/ML530/imdb-console.txt
When I look at the "RobertaEmbeddings", I see the following entry for "word_embeddings":
(word_embeddings): Embedding(50265, 1024, padding_idx=1)
This means our vocabulary consists of 50,265 tokens, and each embedding consists of 1,024 features per token.
So the total number of parameters for "word_embeddings" is 51,471,360 (50,265 tokens * 1,024 features per token); and these parameters occupy 205,885,440 (51,471,360 parameters * 4 per bytes per float32 parameters) bytes of memory.
For the model you choose ...
a) Which model did you choose?
b) What is the size of the vocabulary?
c) How many features are used for each token?
d) How many parameters are there for the word embeddings?
e) How much memory is consumed to store these parameters? [Assume float32 representation for the parameters.]
Below are descriptions for each of the 15 models, trained (finetuned for two epochs) and evaluated on an Nvidia A6000 GPU.
Accuracy ranged from 90.392% for google/electra-small-discriminator to 97.056% for microsoft/deberta-v3-large; and elapsed time ranged from 6m42.647s for distilbert-base-cased to 58m02.981s for microsoft/deberta-large.
Note that DeBERTa performs additional matrix multiplications for projecting relative position embeddings (for queries and keys), used for deriving the attention matrix of a transformer block; see equation 4 of https://arxiv.org/abs/2006.03654.
model_name | seq_len | download_size | val_accuracy | val_rank | vocab_size | d_model | d_ff | layers | parameters | elapsed_time |
---|---|---|---|---|---|---|---|---|---|---|
distilbert-base-cased | 512 | 263M | 0.91248 | 14 | 28996 | 768 | 3072 | 6 | 65,783,042 | 16m47.330s |
bert-base-cased | 512 | 436M | 0.92608 | 12 | 28996 | 768 | 3072 | 12 | 108,311,810 | 32m6.888s |
bert-large-cased | 512 | 1.34G | 0.94632 | 9 | 28996 | 1024 | 4096 | 24 | 333,581,314 | 95m13.883s |
distilbert-base-uncased | 512 | 268M | 0.92216 | 13 | 30522 | 768 | 3072 | 6 | 66,955,010 | 16m51.575s |
bert-base-uncased | 512 | 440M | 0.93224 | 11 | 30522 | 768 | 3072 | 12 | 109,483,778 | 32m7.832s |
bert-large-uncased | 512 | 1.34G | 0.94784 | 8 | 30522 | 1024 | 4096 | 24 | 335,143,938 | 95m7.914s |
distilroberta-base | 512 | 331M | 0.93488 | 10 | 50265 | 768 | 3072 | 6 | 82,119,938 | 17m14.605s |
roberta-base | 512 | 501M | 0.95104 | 6 | 50265 | 768 | 3072 | 12 | 124,647,170 | 33m11.707s |
roberta-large | 512 | 1.43G | 0.96560 | 4 | 50265 | 1024 | 4096 | 24 | 355,361,794 | 95m53.424s |
google/electra-small-discriminator | 512 | 54.2M | 0.90392 | 15 | 30522 | 256 | 1024 | 12 | 13,549,314 | 13m12.750s |
google/electra-base-discriminator | 512 | 440M | 0.95376 | 5 | 30522 | 768 | 3072 | 12 | 109,483,778 | 32m10.185s |
google/electra-large-discriminator | 512 | 1.34G | 0.96648 | 2 | 30522 | 1024 | 4096 | 24 | 335,143,938 | 95m53.773s |
microsoft/deberta-v3-small | 1536 | 286M | 0.94952 | 7 | 128100 | 768 | 3072 | 6 | 141,896,450 | 166m33.176s |
microsoft/deberta-v3-base | 1536 | 371M | 0.96616 | 3 | 128100 | 768 | 3072 | 12 | 184,423,682 | 314m32.645s |
microsoft/deberta-v3-large | 1536 | 874M | 0.97056 | 1 | 128100 | 1024 | 4096 | 24 | 435,063,810 | 797m18.356s |
If you'd like to print out parameter names and sizes, you can use something like this ...
for name, parameters in model.named_parameters(): print(name + ": " + str(list(parameters.size())))