Internet Movie DataBase (IMDB) Data: Long Short-Term Memory (LSTM)

In [1]:
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = 100000)
print(train_data.shape)
print(train_labels.shape)
print(test_data.shape)
print(test_labels.shape)
word_to_index = imdb.get_word_index()
print(len(word_to_index))
index_to_word = dict([(value, key) for (key, value) in word_to_index.items()])
encoded_review = train_data[0]
decoded_review = ' '.join([index_to_word.get(i - 3, '?') for i in encoded_review])
print(encoded_review)
print(decoded_review)
# 0: padding
# 1: start of sequence
# 2: unknown
Using CNTK backend
(25000,)
(25000,)
(25000,)
(25000,)
88584
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
In [2]:
'''Trains a LSTM on the IMDB sentiment classification task.
The dataset is actually too small for LSTM to be of any advantage
compared to simpler, much faster methods such as TF-IDF + LogReg.
Notes:

- RNNs are tricky. Choice of batch size is important,
choice of loss and optimizer is critical, etc.
Some configurations won't converge.

- LSTM loss decrease patterns during training can be quite different
from what you see with CNNs/MLPs/etc.
'''
from __future__ import print_function

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb
In [3]:
max_features = 20000
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
x_train shape: (25000, 80)
x_test shape: (25000, 80)
In [4]:
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)
Build model...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/15
/home/dadebarr/anaconda3/lib/python3.5/site-packages/Keras-2.0.5-py3.5.egg/keras/backend/cntk_backend.py:989: UserWarning: Warning: CNTK backend does not support collapse of batch axis with inferred dimension. The reshape did not take place.
25000/25000 [==============================] - 59s - loss: 0.4565 - acc: 0.7849 - val_loss: 0.3740 - val_acc: 0.8377
Epoch 2/15
25000/25000 [==============================] - 59s - loss: 0.2903 - acc: 0.8820 - val_loss: 0.3729 - val_acc: 0.8329
Epoch 3/15
25000/25000 [==============================] - 59s - loss: 0.2071 - acc: 0.9194 - val_loss: 0.4200 - val_acc: 0.8345
Epoch 4/15
25000/25000 [==============================] - 59s - loss: 0.1492 - acc: 0.9457 - val_loss: 0.5069 - val_acc: 0.8241
Epoch 5/15
25000/25000 [==============================] - 59s - loss: 0.1061 - acc: 0.9624 - val_loss: 0.6315 - val_acc: 0.8273
Epoch 6/15
25000/25000 [==============================] - 59s - loss: 0.0742 - acc: 0.9741 - val_loss: 0.6869 - val_acc: 0.8209
Epoch 7/15
25000/25000 [==============================] - 59s - loss: 0.0550 - acc: 0.9813 - val_loss: 0.7332 - val_acc: 0.8156
Epoch 8/15
25000/25000 [==============================] - 59s - loss: 0.0429 - acc: 0.9858 - val_loss: 0.7720 - val_acc: 0.8146
Epoch 9/15
25000/25000 [==============================] - 58s - loss: 0.0361 - acc: 0.9880 - val_loss: 0.7016 - val_acc: 0.8129
Epoch 10/15
25000/25000 [==============================] - 59s - loss: 0.0294 - acc: 0.9902 - val_loss: 0.8774 - val_acc: 0.8129
Epoch 11/15
25000/25000 [==============================] - 59s - loss: 0.0212 - acc: 0.9926 - val_loss: 0.9228 - val_acc: 0.8125
Epoch 12/15
25000/25000 [==============================] - 58s - loss: 0.0173 - acc: 0.9943 - val_loss: 1.1015 - val_acc: 0.8150
Epoch 13/15
25000/25000 [==============================] - 59s - loss: 0.0162 - acc: 0.9954 - val_loss: 1.0285 - val_acc: 0.8173
Epoch 14/15
25000/25000 [==============================] - 59s - loss: 0.0114 - acc: 0.9966 - val_loss: 1.0289 - val_acc: 0.8092
Epoch 15/15
25000/25000 [==============================] - 59s - loss: 0.0110 - acc: 0.9969 - val_loss: 1.0519 - val_acc: 0.8138
25000/25000 [==============================] - 16s    
Test score: 1.05194123522
Test accuracy: 0.81384
In [ ]: