For this week's homework, we're using a Hugging Face Trainer wrapper to finetune and evaluate a PyTorch model (because pytorch supports gradient_accumulation_steps to accumulate gradients across training batches). [Perhaps someday Keras' model.fit() will also support a gradient_accumulation_steps argument.]

The console output for this week's Kaggle task includes print(model) output for the PyTorch model.

For my model, the console output can be found here: https://www.cross-entropy.net/ML530/imdb-console.txt

When I look at the "RobertaEmbeddings", I see the following entry for "word_embeddings":

(word_embeddings): Embedding(50265, 1024, padding_idx=1)

This means our vocabulary consists of 50,265 tokens, and each embedding consists of 1,024 features per token.

So the total number of parameters for "word_embeddings" is 51,471,360 (50,265 tokens * 1,024 features per token); and these parameters occupy 205,885,440 (51,471,360 parameters * 4 per bytes per float32 parameters) bytes of memory.

For the model you choose ...

a) Which model did you choose?

b) What is the size of the vocabulary?

c) How many features are used for each token?

d) How many parameters are there for the word embeddings?

e) How much memory is consumed to store these parameters? [Assume float32 representation for the parameters.]

 

Below are descriptions for each of the 15 models, trained (finetuned for two epochs) and evaluated on an Nvidia A6000 GPU.

Accuracy ranged from 90.392% for google/electra-small-discriminator to 97.056% for microsoft/deberta-v3-large; and elapsed time ranged from 6m42.647s for distilbert-base-cased to 58m02.981s for microsoft/deberta-large.

Note that DeBERTa performs additional matrix multiplications for projecting relative position embeddings (for queries and keys), used for deriving the attention matrix of a transformer block; see equation 4 of https://arxiv.org/abs/2006.03654.

model_nameseq_lendownload_sizeval_accuracyval_rankvocab_sized_modeld_fflayersparameterselapsed_time
distilbert-base-cased512263M0.9124814289967683072665,783,04216m47.330s
bert-base-cased512436M0.926081228996768307212108,311,81032m6.888s
bert-large-cased5121.34G0.946329289961024409624333,581,31495m13.883s
distilbert-base-uncased512268M0.9221613305227683072666,955,01016m51.575s
bert-base-uncased512440M0.932241130522768307212109,483,77832m7.832s
bert-large-uncased5121.34G0.947848305221024409624335,143,93895m7.914s
distilroberta-base512331M0.9348810502657683072682,119,93817m14.605s
roberta-base512501M0.95104650265768307212124,647,17033m11.707s
roberta-large5121.43G0.965604502651024409624355,361,79495m53.424s
google/electra-small-discriminator51254.2M0.90392153052225610241213,549,31413m12.750s
google/electra-base-discriminator512440M0.95376530522768307212109,483,77832m10.185s
google/electra-large-discriminator5121.34G0.966482305221024409624335,143,93895m53.773s
microsoft/deberta-v3-small1536286M0.94952712810076830726141,896,450166m33.176s
microsoft/deberta-v3-base1536371M0.966163128100768307212184,423,682314m32.645s
microsoft/deberta-v3-large1536874M0.9705611281001024409624435,063,810797m18.356s

 

If you'd like to print out parameter names and sizes, you can use something like this ...

    for name, parameters in model.named_parameters():
        print(name + ": " + str(list(parameters.size())))