dadebarr@dadebarr-lambda:~/imdb$ bash loop.sh Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 29.0/29.0 [00:00<00:00, 18.9kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 411/411 [00:00<00:00, 269kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 213k/213k [00:00<00:00, 331kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 436k/436k [00:00<00:00, 1.06MB/s] max_length for distilbert-base-cased: 512 Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 263M/263M [00:32<00:00, 8.10MB/s] Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias'] - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 65783042 50%|███████████████████████████████████████████████ | 3125/6250 [06:30<06:32, 7.96it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.29711106419563293, 'eval_accuracy': 0.9008, 'eval_runtime': 58.06, 'eval_samples_per_second': 215.295, 'eval_steps_per_second': 107.647, 'epoch': 1.0} {'loss': 0.3103, 'learning_rate': 0.0, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [13:59<00:00, 8.00it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.30013400316238403, 'eval_accuracy': 0.91248, 'eval_runtime': 58.034, 'eval_samples_per_second': 215.391, 'eval_steps_per_second': 107.696, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [14:57<00:00, 8.00it/sSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 898.2418, 'train_samples_per_second': 55.664, 'train_steps_per_second': 6.958, 'train_loss': 0.3103233984375, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [14:58<00:00, 6.96it/s] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [00:57<00:00, 108.10it/s] DistilBertForSequenceClassification( (distilbert): DistilBertModel( (embeddings): Embeddings( (word_embeddings): Embedding(28996, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (transformer): Transformer( (layer): ModuleList( (0): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (1): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (2): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (3): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (4): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (5): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) ) ) ) (pre_classifier): Linear(in_features=768, out_features=768, bias=True) (classifier): Linear(in_features=768, out_features=2, bias=True) (dropout): Dropout(p=0.2, inplace=False) ) total parameters: 65783042 real 16m47.330s user 17m35.278s sys 0m14.145s Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 29.0/29.0 [00:00<00:00, 18.8kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 570/570 [00:00<00:00, 372kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 213k/213k [00:00<00:00, 855kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 436k/436k [00:00<00:00, 1.10MB/s] max_length for bert-base-cased: 512 Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 436M/436M [00:53<00:00, 8.22MB/s] Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 108311810 50%|███████████████████████████████████████████████ | 3125/6250 [12:35<12:37, 4.12it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.2788660526275635, 'eval_accuracy': 0.91696, 'eval_runtime': 114.2169, 'eval_samples_per_second': 109.441, 'eval_steps_per_second': 54.72, 'epoch': 1.0} {'loss': 0.2836, 'learning_rate': 0.0, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [27:05<00:00, 4.14it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.29229456186294556, 'eval_accuracy': 0.92608, 'eval_runtime': 114.2024, 'eval_samples_per_second': 109.455, 'eval_steps_per_second': 54.727, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [29:00<00:00, 4.14it/sSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 1741.4847, 'train_samples_per_second': 28.711, 'train_steps_per_second': 3.589, 'train_loss': 0.28356998046875, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [29:01<00:00, 3.59it/s] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [01:53<00:00, 54.87it/s] BertForSequenceClassification( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(28996, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) (dropout): Dropout(p=0.1, inplace=False) (classifier): Linear(in_features=768, out_features=2, bias=True) ) total parameters: 108311810 real 32m6.888s user 31m41.469s sys 0m32.220s Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 29.0/29.0 [00:00<00:00, 18.9kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 762/762 [00:00<00:00, 490kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 213k/213k [00:00<00:00, 796kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 436k/436k [00:00<00:00, 1.07MB/s] max_length for bert-large-cased: 512 Downloading: 100%|████████████████████████████████████████████████████████████████████████████████| 1.34G/1.34G [02:45<00:00, 8.09MB/s] Some weights of the model checkpoint at bert-large-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-cased and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 333581314 50%|███████████████████████████████████████████████ | 3125/6250 [37:25<37:30, 1.39it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.2440584897994995, 'eval_accuracy': 0.93728, 'eval_runtime': 344.3729, 'eval_samples_per_second': 36.298, 'eval_steps_per_second': 18.149, 'epoch': 1.0} {'loss': 0.2483, 'learning_rate': 0.0, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:20:35<00:00, 1.39it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.24118176102638245, 'eval_accuracy': 0.94632, 'eval_runtime': 344.1469, 'eval_samples_per_second': 36.322, 'eval_steps_per_second': 18.161, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:26:19<00:00, 1.39it/sSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 5184.0562, 'train_samples_per_second': 9.645, 'train_steps_per_second': 1.206, 'train_loss': 0.2482890234375, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:26:24<00:00, 1.21it/s] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [05:44<00:00, 18.17it/s] BertForSequenceClassification( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(28996, 1024, padding_idx=0) (position_embeddings): Embedding(512, 1024) (token_type_embeddings): Embedding(2, 1024) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (12): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (13): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (14): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (15): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (16): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (17): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (18): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (19): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (20): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (21): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (22): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (23): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=1024, out_features=1024, bias=True) (activation): Tanh() ) ) (dropout): Dropout(p=0.1, inplace=False) (classifier): Linear(in_features=1024, out_features=2, bias=True) ) total parameters: 333581314 real 95m13.883s user 73m42.497s sys 20m3.572s Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 34.1kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 483/483 [00:00<00:00, 607kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 765kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 1.16MB/s] max_length for distilbert-base-uncased: 512 Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 268M/268M [00:34<00:00, 7.72MB/s] Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias'] - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 66955010 50%|███████████████████████████████████████████████ | 3125/6250 [06:31<06:32, 7.96it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.2894309461116791, 'eval_accuracy': 0.9092, 'eval_runtime': 58.1303, 'eval_samples_per_second': 215.034, 'eval_steps_per_second': 107.517, 'epoch': 1.0} {'loss': 0.2879, 'learning_rate': 0.0, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [14:01<00:00, 7.98it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.27573397755622864, 'eval_accuracy': 0.92216, 'eval_runtime': 58.1263, 'eval_samples_per_second': 215.049, 'eval_steps_per_second': 107.524, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [14:59<00:00, 7.98it/sSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 900.9175, 'train_samples_per_second': 55.499, 'train_steps_per_second': 6.937, 'train_loss': 0.28786982421875, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [15:00<00:00, 6.94it/s] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [00:57<00:00, 107.82it/s] DistilBertForSequenceClassification( (distilbert): DistilBertModel( (embeddings): Embeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (transformer): Transformer( (layer): ModuleList( (0): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (1): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (2): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (3): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (4): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) (5): TransformerBlock( (attention): MultiHeadSelfAttention( (dropout): Dropout(p=0.1, inplace=False) (q_lin): Linear(in_features=768, out_features=768, bias=True) (k_lin): Linear(in_features=768, out_features=768, bias=True) (v_lin): Linear(in_features=768, out_features=768, bias=True) (out_lin): Linear(in_features=768, out_features=768, bias=True) ) (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (ffn): FFN( (dropout): Dropout(p=0.1, inplace=False) (lin1): Linear(in_features=768, out_features=3072, bias=True) (lin2): Linear(in_features=3072, out_features=768, bias=True) (activation): GELUActivation() ) (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) ) ) ) ) (pre_classifier): Linear(in_features=768, out_features=768, bias=True) (classifier): Linear(in_features=768, out_features=2, bias=True) (dropout): Dropout(p=0.2, inplace=False) ) total parameters: 66955010 real 16m51.575s user 16m54.182s sys 0m15.694s Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 18.4kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 570/570 [00:00<00:00, 369kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 740kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:01<00:00, 453kB/s] max_length for bert-base-uncased: 512 Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 440M/440M [00:54<00:00, 8.12MB/s] Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 109483778 50%|███████████████████████████████████████████████ | 3125/6250 [12:34<12:36, 4.13it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.28203877806663513, 'eval_accuracy': 0.92392, 'eval_runtime': 114.052, 'eval_samples_per_second': 109.599, 'eval_steps_per_second': 54.8, 'epoch': 1.0} {'loss': 0.2776, 'learning_rate': 0.0, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [27:03<00:00, 4.14it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.2618100643157959, 'eval_accuracy': 0.93224, 'eval_runtime': 114.1224, 'eval_samples_per_second': 109.531, 'eval_steps_per_second': 54.766, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [28:58<00:00, 4.14it/sSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 1739.4785, 'train_samples_per_second': 28.744, 'train_steps_per_second': 3.593, 'train_loss': 0.27759447265625, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [28:59<00:00, 3.59it/s] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [01:53<00:00, 54.87it/s] BertForSequenceClassification( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) (dropout): Dropout(p=0.1, inplace=False) (classifier): Linear(in_features=768, out_features=2, bias=True) ) total parameters: 109483778 real 32m7.832s user 32m17.892s sys 0m33.239s Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 18.4kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 571/571 [00:00<00:00, 374kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 605kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 1.14MB/s] max_length for bert-large-uncased: 512 Downloading: 100%|████████████████████████████████████████████████████████████████████████████████| 1.34G/1.34G [02:38<00:00, 8.47MB/s] Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 335143938 50%|███████████████████████████████████████████████ | 3125/6250 [37:25<37:26, 1.39it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.2191108614206314, 'eval_accuracy': 0.94288, 'eval_runtime': 344.3037, 'eval_samples_per_second': 36.305, 'eval_steps_per_second': 18.153, 'epoch': 1.0} {'loss': 0.2396, 'learning_rate': 0.0, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:20:35<00:00, 1.39it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.23103132843971252, 'eval_accuracy': 0.94784, 'eval_runtime': 344.0986, 'eval_samples_per_second': 36.327, 'eval_steps_per_second': 18.163, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:26:19<00:00, 1.39it/sSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 5183.9722, 'train_samples_per_second': 9.645, 'train_steps_per_second': 1.206, 'train_loss': 0.23961296875, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:26:23<00:00, 1.21it/s] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [05:43<00:00, 18.19it/s] BertForSequenceClassification( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(30522, 1024, padding_idx=0) (position_embeddings): Embedding(512, 1024) (token_type_embeddings): Embedding(2, 1024) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (12): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (13): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (14): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (15): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (16): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (17): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (18): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (19): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (20): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (21): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (22): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (23): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=1024, out_features=1024, bias=True) (activation): Tanh() ) ) (dropout): Dropout(p=0.1, inplace=False) (classifier): Linear(in_features=1024, out_features=2, bias=True) ) total parameters: 335143938 real 95m7.914s user 74m23.751s sys 20m4.157s Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 480/480 [00:00<00:00, 309kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.46MB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.12MB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 1.44MB/s] max_length for distilroberta-base: 512 Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 331M/331M [00:40<00:00, 8.27MB/s] Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias'] - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 82119938 50%|███████████████████████████████████████████████ | 3125/6250 [06:38<06:41, 7.78it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.2646738290786743, 'eval_accuracy': 0.92848, 'eval_runtime': 57.7938, 'eval_samples_per_second': 216.286, 'eval_steps_per_second': 108.143, 'epoch': 1.0} {'loss': 0.3028, 'learning_rate': 0.0, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [14:16<00:00, 7.81it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.28093457221984863, 'eval_accuracy': 0.93488, 'eval_runtime': 57.7811, 'eval_samples_per_second': 216.334, 'eval_steps_per_second': 108.167, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [15:14<00:00, 7.81it/sSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 915.911, 'train_samples_per_second': 54.59, 'train_steps_per_second': 6.824, 'train_loss': 0.30278287109375, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [15:15<00:00, 6.82it/s] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [00:57<00:00, 108.59it/s] RobertaForSequenceClassification( (roberta): RobertaModel( (embeddings): RobertaEmbeddings( (word_embeddings): Embedding(50265, 768, padding_idx=1) (position_embeddings): Embedding(514, 768, padding_idx=1) (token_type_embeddings): Embedding(1, 768) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): RobertaEncoder( (layer): ModuleList( (0): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (classifier): RobertaClassificationHead( (dense): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) (out_proj): Linear(in_features=768, out_features=2, bias=True) ) ) total parameters: 82119938 real 17m14.605s user 18m48.423s sys 0m16.505s Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 481/481 [00:00<00:00, 312kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 976kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.01MB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:01<00:00, 1.21MB/s] max_length for roberta-base: 512 Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 501M/501M [01:39<00:00, 5.04MB/s] Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight'] - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 124647170 50%|███████████████████████████████████████████████ | 3125/6250 [12:45<12:47, 4.07it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.23669573664665222, 'eval_accuracy': 0.94672, 'eval_runtime': 112.5479, 'eval_samples_per_second': 111.064, 'eval_steps_per_second': 55.532, 'epoch': 1.0} {'loss': 0.26, 'learning_rate': 0.0, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [27:24<00:00, 4.08it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.23625726997852325, 'eval_accuracy': 0.95104, 'eval_runtime': 112.5581, 'eval_samples_per_second': 111.054, 'eval_steps_per_second': 55.527, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [29:17<00:00, 4.08it/sSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 1758.9127, 'train_samples_per_second': 28.427, 'train_steps_per_second': 3.553, 'train_loss': 0.25999873046875, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [29:18<00:00, 3.55it/s] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [01:52<00:00, 55.69it/s] RobertaForSequenceClassification( (roberta): RobertaModel( (embeddings): RobertaEmbeddings( (word_embeddings): Embedding(50265, 768, padding_idx=1) (position_embeddings): Embedding(514, 768, padding_idx=1) (token_type_embeddings): Embedding(1, 768) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): RobertaEncoder( (layer): ModuleList( (0): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (classifier): RobertaClassificationHead( (dense): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) (out_proj): Linear(in_features=768, out_features=2, bias=True) ) ) total parameters: 124647170 real 33m11.707s user 32m31.526s sys 0m44.743s Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 482/482 [00:00<00:00, 315kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.23MB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 967kB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:01<00:00, 1.05MB/s] max_length for roberta-large: 512 Downloading: 100%|████████████████████████████████████████████████████████████████████████████████| 1.43G/1.43G [02:55<00:00, 8.11MB/s] Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight'] - This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 355361794 50%|███████████████████████████████████████████████ | 3125/6250 [37:43<37:49, 1.38it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.19495820999145508, 'eval_accuracy': 0.95824, 'eval_runtime': 340.6143, 'eval_samples_per_second': 36.698, 'eval_steps_per_second': 18.349, 'epoch': 1.0} {'loss': 0.1871, 'learning_rate': 0.0, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:21:09<00:00, 1.38it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.17903216183185577, 'eval_accuracy': 0.9656, 'eval_runtime': 340.7035, 'eval_samples_per_second': 36.689, 'eval_steps_per_second': 18.344, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:26:50<00:00, 1.38it/sSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 5214.7364, 'train_samples_per_second': 9.588, 'train_steps_per_second': 1.199, 'train_loss': 0.1871165625, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:26:54<00:00, 1.20it/s] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [05:40<00:00, 18.37it/s] RobertaForSequenceClassification( (roberta): RobertaModel( (embeddings): RobertaEmbeddings( (word_embeddings): Embedding(50265, 1024, padding_idx=1) (position_embeddings): Embedding(514, 1024, padding_idx=1) (token_type_embeddings): Embedding(1, 1024) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): RobertaEncoder( (layer): ModuleList( (0): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (12): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (13): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (14): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (15): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (16): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (17): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (18): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (19): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (20): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (21): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (22): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (23): RobertaLayer( (attention): RobertaAttention( (self): RobertaSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): RobertaSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): RobertaIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): RobertaOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (classifier): RobertaClassificationHead( (dense): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) (out_proj): Linear(in_features=1024, out_features=2, bias=True) ) ) total parameters: 355361794 real 95m53.424s user 75m0.324s sys 20m0.814s Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 29.0/29.0 [00:00<00:00, 18.8kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 427kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 785kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 1.08MB/s] max_length for google/electra-small-discriminator: 512 Downloading: 100%|████████████████████████████████████████████████████████████████████████████████| 54.2M/54.2M [00:06<00:00, 7.80MB/s] Some weights of the model checkpoint at google/electra-small-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias'] - This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-discriminator and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 13549314 50%|███████████████████████████████████████████████ | 3125/6250 [05:32<05:35, 9.32it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.2849559187889099, 'eval_accuracy': 0.90224, 'eval_runtime': 34.7946, 'eval_samples_per_second': 359.251, 'eval_steps_per_second': 179.625, 'epoch': 1.0} {'loss': 0.3616, 'learning_rate': 0.0, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [11:37<00:00, 9.37it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.293412446975708, 'eval_accuracy': 0.90392, 'eval_runtime': 34.7769, 'eval_samples_per_second': 359.434, 'eval_steps_per_second': 179.717, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [12:12<00:00, 9.37it/sSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 732.4955, 'train_samples_per_second': 68.26, 'train_steps_per_second': 8.532, 'train_loss': 0.3616459375, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [12:12<00:00, 8.53it/s] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [00:34<00:00, 180.13it/s] ElectraForSequenceClassification( (electra): ElectraModel( (embeddings): ElectraEmbeddings( (word_embeddings): Embedding(30522, 128, padding_idx=0) (position_embeddings): Embedding(512, 128) (token_type_embeddings): Embedding(2, 128) (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (embeddings_project): Linear(in_features=128, out_features=256, bias=True) (encoder): ElectraEncoder( (layer): ModuleList( (0): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=256, out_features=256, bias=True) (key): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=256, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=256, out_features=1024, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=1024, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=256, out_features=256, bias=True) (key): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=256, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=256, out_features=1024, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=1024, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=256, out_features=256, bias=True) (key): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=256, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=256, out_features=1024, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=1024, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=256, out_features=256, bias=True) (key): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=256, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=256, out_features=1024, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=1024, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=256, out_features=256, bias=True) (key): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=256, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=256, out_features=1024, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=1024, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=256, out_features=256, bias=True) (key): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=256, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=256, out_features=1024, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=1024, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=256, out_features=256, bias=True) (key): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=256, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=256, out_features=1024, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=1024, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=256, out_features=256, bias=True) (key): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=256, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=256, out_features=1024, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=1024, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=256, out_features=256, bias=True) (key): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=256, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=256, out_features=1024, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=1024, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=256, out_features=256, bias=True) (key): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=256, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=256, out_features=1024, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=1024, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=256, out_features=256, bias=True) (key): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=256, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=256, out_features=1024, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=1024, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=256, out_features=256, bias=True) (key): Linear(in_features=256, out_features=256, bias=True) (value): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=256, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=256, out_features=1024, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=1024, out_features=256, bias=True) (LayerNorm): LayerNorm((256,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (classifier): ElectraClassificationHead( (dense): Linear(in_features=256, out_features=256, bias=True) (dropout): Dropout(p=0.1, inplace=False) (out_proj): Linear(in_features=256, out_features=2, bias=True) ) ) total parameters: 13549314 real 13m12.750s user 14m16.334s sys 0m12.866s Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 27.0/27.0 [00:00<00:00, 17.3kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 666/666 [00:00<00:00, 424kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 734kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 948kB/s] max_length for google/electra-base-discriminator: 512 Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 440M/440M [00:54<00:00, 8.14MB/s] Some weights of the model checkpoint at google/electra-base-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias'] - This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 109483778 50%|███████████████████████████████████████████████ | 3125/6250 [12:36<12:38, 4.12it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.20373468101024628, 'eval_accuracy': 0.94696, 'eval_runtime': 114.0711, 'eval_samples_per_second': 109.581, 'eval_steps_per_second': 54.79, 'epoch': 1.0} {'loss': 0.2303, 'learning_rate': 0.0, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [27:07<00:00, 4.13it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.19785547256469727, 'eval_accuracy': 0.95376, 'eval_runtime': 114.0943, 'eval_samples_per_second': 109.558, 'eval_steps_per_second': 54.779, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [29:01<00:00, 4.13it/sSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 1742.8334, 'train_samples_per_second': 28.689, 'train_steps_per_second': 3.586, 'train_loss': 0.23026556640625, 'epoch': 2.0} 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [29:02<00:00, 3.59it/s] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [01:53<00:00, 55.01it/s] ElectraForSequenceClassification( (electra): ElectraModel( (embeddings): ElectraEmbeddings( (word_embeddings): Embedding(30522, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): ElectraEncoder( (layer): ModuleList( (0): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (classifier): ElectraClassificationHead( (dense): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) (out_proj): Linear(in_features=768, out_features=2, bias=True) ) ) total parameters: 109483778 real 32m10.185s user 32m21.785s sys 0m31.784s Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 27.0/27.0 [00:00<00:00, 17.9kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 668/668 [00:00<00:00, 411kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 720kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 944kB/s] max_length for google/electra-large-discriminator: 512 Downloading: 100%|████████████████████████████████████████████████████████████████████████████████| 1.34G/1.34G [03:22<00:00, 6.64MB/s] Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias'] - This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-large-discriminator and are newly initialized: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 335143938 50%|███████████████████████████████████████████████ | 3125/6250 [37:27<37:29, 1.39it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.16850042343139648, 'eval_accuracy': 0.95968, 'eval_runtime': 343.5047, 'eval_samples_per_second': 36.39, 'eval_steps_per_second': 18.195, 'epoch': 1.0} {'loss': 0.1734, 'learning_rate': 0.0, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:20:38<00:00, 1.39it/s]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.15813788771629333, 'eval_accuracy': 0.96648, 'eval_runtime': 343.8246, 'eval_samples_per_second': 36.356, 'eval_steps_per_second': 18.178, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:26:22<00:00, 1.39it/sSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 5186.7011, 'train_samples_per_second': 9.64, 'train_steps_per_second': 1.205, 'train_loss': 0.17338658203125, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:26:26<00:00, 1.21it/s] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [05:42<00:00, 18.22it/s] ElectraForSequenceClassification( (electra): ElectraModel( (embeddings): ElectraEmbeddings( (word_embeddings): Embedding(30522, 1024, padding_idx=0) (position_embeddings): Embedding(512, 1024) (token_type_embeddings): Embedding(2, 1024) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): ElectraEncoder( (layer): ModuleList( (0): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (12): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (13): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (14): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (15): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (16): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (17): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (18): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (19): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (20): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (21): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (22): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (23): ElectraLayer( (attention): ElectraAttention( (self): ElectraSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): ElectraSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): ElectraIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): ElectraOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) ) (classifier): ElectraClassificationHead( (dense): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) (out_proj): Linear(in_features=1024, out_features=2, bias=True) ) ) total parameters: 335143938 real 95m53.773s user 74m44.349s sys 19m43.768s Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 52.0/52.0 [00:00<00:00, 33.9kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 578/578 [00:00<00:00, 375kB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████| 2.46M/2.46M [00:00<00:00, 6.90MB/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. /home/dadebarr/.local/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py:446: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text. warnings.warn( Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. max_length for microsoft/deberta-v3-small: 1000000000000000019884624838656 Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 286M/286M [00:33<00:00, 8.43MB/s] Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.classifier.weight', 'lm_predictions.lm_head.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.LayerNorm.weight', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.dense.bias', 'mask_predictions.LayerNorm.bias'] - This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'classifier.bias', 'pooler.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 141896450 50%|█████████████████████████████████████████████ | 3125/6250 [1:00:31<1:00:30, 1.16s/it]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.2002577781677246, 'eval_accuracy': 0.94528, 'eval_runtime': 888.2538, 'eval_samples_per_second': 14.073, 'eval_steps_per_second': 7.036, 'epoch': 1.0} {'loss': 0.2409, 'learning_rate': 0.0, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [2:15:54<00:00, 1.16s/it]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.20250140130519867, 'eval_accuracy': 0.94952, 'eval_runtime': 892.709, 'eval_samples_per_second': 14.002, 'eval_steps_per_second': 7.001, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [2:30:47<00:00, 1.16s/itSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 9049.8038, 'train_samples_per_second': 5.525, 'train_steps_per_second': 0.691, 'train_loss': 0.2409121484375, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [2:30:49<00:00, 1.45s/it] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [14:40<00:00, 7.10it/s] DebertaV2ForSequenceClassification( (deberta): DebertaV2Model( (embeddings): DebertaV2Embeddings( (word_embeddings): Embedding(128100, 768, padding_idx=0) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) (encoder): DebertaV2Encoder( (layer): ModuleList( (0): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (1): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (2): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (3): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (4): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (5): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) ) (rel_embeddings): Embedding(512, 768) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) ) ) (pooler): ContextPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (dropout): StableDropout() ) (classifier): Linear(in_features=768, out_features=2, bias=True) (dropout): StableDropout() ) total parameters: 141896450 real 166m33.176s user 459m8.851s sys 53m23.312s Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 52.0/52.0 [00:00<00:00, 34.0kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 579/579 [00:00<00:00, 400kB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████| 2.46M/2.46M [00:00<00:00, 6.58MB/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. /home/dadebarr/.local/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py:446: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text. warnings.warn( Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. max_length for microsoft/deberta-v3-base: 1000000000000000019884624838656 Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 371M/371M [00:33<00:00, 11.2MB/s] Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.LayerNorm.bias', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.LayerNorm.weight'] - This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['pooler.dense.bias', 'classifier.bias', 'classifier.weight', 'pooler.dense.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 184423682 50%|█████████████████████████████████████████████ | 3125/6250 [1:55:23<1:55:35, 2.22s/it]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.15911158919334412, 'eval_accuracy': 0.96216, 'eval_runtime': 1651.8218, 'eval_samples_per_second': 7.567, 'eval_steps_per_second': 3.784, 'epoch': 1.0} {'loss': 0.1948, 'learning_rate': 0.0, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [4:18:24<00:00, 2.22s/it]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.1609278917312622, 'eval_accuracy': 0.96616, 'eval_runtime': 1652.1755, 'eval_samples_per_second': 7.566, 'eval_steps_per_second': 3.783, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [4:45:57<00:00, 2.22s/itSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 17159.7207, 'train_samples_per_second': 2.914, 'train_steps_per_second': 0.364, 'train_loss': 0.19477109375, 'epoch': 2.0} 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [4:45:59<00:00, 2.75s/it] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [27:29<00:00, 3.79it/s] DebertaV2ForSequenceClassification( (deberta): DebertaV2Model( (embeddings): DebertaV2Embeddings( (word_embeddings): Embedding(128100, 768, padding_idx=0) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) (encoder): DebertaV2Encoder( (layer): ModuleList( (0): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (1): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (2): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (3): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (4): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (5): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (6): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (7): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (8): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (9): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (10): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (11): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=768, out_features=768, bias=True) (key_proj): Linear(in_features=768, out_features=768, bias=True) (value_proj): Linear(in_features=768, out_features=768, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) ) (rel_embeddings): Embedding(512, 768) (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True) ) ) (pooler): ContextPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (dropout): StableDropout() ) (classifier): Linear(in_features=768, out_features=2, bias=True) (dropout): StableDropout() ) total parameters: 184423682 real 314m32.645s user 598m10.645s sys 65m54.987s Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 52.0/52.0 [00:00<00:00, 33.9kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████| 580/580 [00:00<00:00, 377kB/s] Downloading: 100%|████████████████████████████████████████████████████████████████████████████████| 2.46M/2.46M [00:00<00:00, 7.32MB/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. /home/dadebarr/.local/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py:446: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text. warnings.warn( Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. max_length for microsoft/deberta-v3-large: 1000000000000000019884624838656 Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████| 874M/874M [01:45<00:00, 8.26MB/s] Some weights of the model checkpoint at microsoft/deberta-v3-large were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.dense.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.classifier.bias', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight'] - This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.bias', 'classifier.bias', 'pooler.dense.weight', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. max_steps is given, it will override any value given in num_train_epochs ***** Running training ***** Num examples = 25000 Num Epochs = 2 Instantaneous batch size per device = 2 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 4 Total optimization steps = 6250 Number of trainable parameters = 435063810 50%|█████████████████████████████████████████████ | 3125/6250 [4:58:13<4:57:55, 5.72s/it]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.16256800293922424, 'eval_accuracy': 0.96752, 'eval_runtime': 3972.059, 'eval_samples_per_second': 3.147, 'eval_steps_per_second': 1.573, 'epoch': 1.0} {'loss': 0.1596, 'learning_rate': 0.0, 'epoch': 2.0} 100%|███████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [11:02:36<00:00, 5.72s/it]***** Running Evaluation ***** Num examples = 12500 Batch size = 2 {'eval_loss': 0.1460186094045639, 'eval_accuracy': 0.97056, 'eval_runtime': 3970.6365, 'eval_samples_per_second': 3.148, 'eval_steps_per_second': 1.574, 'epoch': 2.0} 100%|███████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [12:08:47<00:00, 5.72s/itSaving model checkpoint to ./output/checkpoint-6250 Configuration saved in ./output/checkpoint-6250/config.json Model weights saved in ./output/checkpoint-6250/pytorch_model.bin Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 43732.9584, 'train_samples_per_second': 1.143, 'train_steps_per_second': 0.143, 'train_loss': 0.159613408203125, 'epoch': 2.0} 100%|███████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [12:08:52<00:00, 7.00s/it] ***** Running Prediction ***** Num examples = 12500 Batch size = 2 100%|████████████████████████████████████████████████████████████████████████████████████████████| 6250/6250 [1:06:07<00:00, 1.58it/s] DebertaV2ForSequenceClassification( (deberta): DebertaV2Model( (embeddings): DebertaV2Embeddings( (word_embeddings): Embedding(128100, 1024, padding_idx=0) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) (encoder): DebertaV2Encoder( (layer): ModuleList( (0): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (1): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (2): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (3): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (4): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (5): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (6): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (7): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (8): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (9): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (10): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (11): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (12): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (13): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (14): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (15): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (16): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (17): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (18): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (19): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (20): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (21): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (22): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (23): DebertaV2Layer( (attention): DebertaV2Attention( (self): DisentangledSelfAttention( (query_proj): Linear(in_features=1024, out_features=1024, bias=True) (key_proj): Linear(in_features=1024, out_features=1024, bias=True) (value_proj): Linear(in_features=1024, out_features=1024, bias=True) (pos_dropout): StableDropout() (dropout): StableDropout() ) (output): DebertaV2SelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) (intermediate): DebertaV2Intermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) (intermediate_act_fn): GELUActivation() ) (output): DebertaV2Output( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) (dropout): StableDropout() ) ) ) (rel_embeddings): Embedding(512, 1024) (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True) ) ) (pooler): ContextPooler( (dense): Linear(in_features=1024, out_features=1024, bias=True) (dropout): StableDropout() ) (classifier): Linear(in_features=1024, out_features=2, bias=True) (dropout): StableDropout() ) total parameters: 435063810 real 797m18.356s user 1022m54.561s sys 106m59.397s dadebarr@dadebarr-lambda:~/imdb$