问题描述:

I am using the LSTM language model implemented in https://github.com/wojzaremba/lstm

It uses the following lstm function

local function lstm(x, prev_c, prev_h)

-- Calculate all four gates in one go

local i2h = nn.Linear(params.rnn_size, 4*params.rnn_size)(x)

local h2h = nn.Linear(params.rnn_size, 4*params.rnn_size)(prev_h)

local gates = nn.CAddTable()({i2h, h2h})

-- Reshape to (batch_size, n_gates, hid_size)

-- Then slize the n_gates dimension, i.e dimension 2

local reshaped_gates = nn.Reshape(4,params.rnn_size)(gates)

local sliced_gates = nn.SplitTable(2)(reshaped_gates)

-- Use select gate to fetch each gate and apply nonlinearity

local in_gate = nn.Sigmoid()(nn.SelectTable(1)(sliced_gates))

local in_transform = nn.Tanh()(nn.SelectTable(2)(sliced_gates))

local forget_gate = nn.Sigmoid()(nn.SelectTable(3)(sliced_gates))

local out_gate = nn.Sigmoid()(nn.SelectTable(4)(sliced_gates))

local next_c = nn.CAddTable()({

nn.CMulTable()({forget_gate, prev_c}),

nn.CMulTable()({in_gate, in_transform})

})

local next_h = nn.CMulTable()({out_gate, nn.Tanh()(next_c)})

return next_c, next_h

end

Which is used in the following network (I removed the softmax and criterion layers and added them separately somewhere else in the code)

local function create_network()

local x = nn.Identity()()

local prev_s = nn.Identity()()

local i = {[0] = x}

local next_s = {}

local split = {prev_s:split(2 * params.layers)}

for layer_idx = 1, params.layers do

local prev_c = split[2 * layer_idx - 1]

local prev_h = split[2 * layer_idx]

local dropped = nn.Dropout(params.dropout)(i[layer_idx - 1])

local next_c, next_h = lstm(dropped, prev_c, prev_h)

table.insert(next_s, next_c)

table.insert(next_s, next_h)

i[layer_idx] = next_h

end

local res = nn.Identity()(i[params.layers])

local module = nn.gModule({x, prev_s},

{res, nn.Identity()(next_s)})

return module

end

The above network returns the output of the network and the state of the lstm layers to be used in the next iteration. The states are saved in a table in the following order { cell_1, output_1, cell_2, output_2 } for a 2 layer lstm network. The network output and output_2 are the same.

I have two questions:

(1) When I apply forward and backward propagation on this network, how are the gradients of the states arranged? Will they have the same order as the above table or are they going to be reversed like this: {grad_cell_2, grad_output_2, grad_cell_1, grad_output_1}

I initially thought they would be in the same order as the output table, but I have a reason to suspect that the order is reversed (based on some tests where I manually set the gradients for each iteration). I don't know for sure though and I don't know how to debug this code to know exactly what is going on.

(2) In the backward step, if I know the gradients for the output only (which is the same as the last entry in the state table) should I pass the gradients for the output (res) or the state table (next_s) or both? I would think that passing the gradient to the output only or the last entry of the state table only would give me exactly the same result, since the output is simply the last entry in the table. However, I get different results when I try it both ways.

相关阅读:
Top