Suppose we use the following code to train the 2-layer model1, then use the parameters of model1 to create the 1-layer model2. from tensorflow.keras import datasets, models, layers, initializers (trnX, trnY), (tstX, tstY) = datasets.mnist.load_data() trnX = trnX.reshape(trnX.shape[0], trnX.shape[1] * trnX.shape[2]).astype('float32') / 255 tstX = tstX.reshape(tstX.shape[0], tstX.shape[1] * tstX.shape[2]).astype('float32') / 255 model1 = models.Sequential([ layers.Dense(512, input_shape = (trnX.shape[1],)), layers.Dense(10, activation = 'softmax') ]) model1.compile(optimizer = 'rmsprop', loss = 'sparse_categorical_crossentropy', metrics = [ 'accuracy' ]) model1.fit(trnX, trnY, validation_split = 0.1, epochs = 10) import numpy as np W1, b1 = model1.layers[0].get_weights() # weights matrix and bias vector for layer 1 W2, b2 = model1.layers[1].get_weights() # weights matrix and bias vector for layer 2 W = np.matmul(W1, W2) b = np.matmul(np.expand_dims(b1, axis = 0), W2) + b2 model2 = models.Sequential([ layers.Dense(10, activation = 'softmax', kernel_initializer = initializers.Constant(W), bias_initializer = initializers.Constant(b)) ]) model2.compile(optimizer = 'rmsprop', loss = 'sparse_categorical_crossentropy', metrics = [ 'accuracy' ]) a) Will model2 produce the same predictions as model1? Yes; model2's predictions will essentially be identical to model1's predictions. b) If so, what is the bug that makes it possible for the 1-layer model to replace a 2-layer model? We have forgotten to specify an activation function for the first layer of model1, so it has the default activation function (i.e. it is using a linear activation function). We can effectively replace any pair of adjacent Dense() layers with a single layer, if the first dense layer has a linear activation function.