5002 Recurrent Neural Network Memo

Difference with Neural Network

The output of RNN are passed to the next RNN network (internal variable).

Multilayer

Actually, these rnn can have multiple layers and multiple memory units.

Basic RNN

The basic rnn is simple. It use internal variable(s) as output variable(y). And the s can be calculated by using: $s_t = tanh(W [x_{t}, s_{t-1}] + b)$ $y_t = s_t$

LSTM

There are components:

internal variable to store memory
forget feature to forget some portion of internal variable
input feature to decide portion of input and strength of input
output feature to decide portion of output and strength of output

Some significant difference

First, the output and internal state previous are input, too.

Components and Framework

Component:

Forget gate (portion of memory)
Input gate (portion of input)
Input activatin gate (weight of input)
New internal state is Input(Input gate combine with Input activation gate + Forget gate combine with previous Internal state)
Output gate (portion of output)
Final output state gate (Multiplication of tanh internal state and Output gate) Formulas:
Forget gate$f_t = \sigma(W_f [x_t, y_{t-1}] + b_f)$
Input gate $I_t = \sigma(W_i (x_t, y_{t-1})+ b_i)$
Input Activation gate $a_t = tahn(W_a[x_t, y_{t-1}]+b_a)$
New internal state: $s_t = I_t \times a_t + s_{t-1}$
Output gate: $O_t = \sigma(W_o [x_t, y_{t-1}] + b_o)$
Final output: $O_t \times \tanh(s_t)$

Gated Recurrent Unit

Advantages

training time shorter due to simple architecture
few data point to capture properties
no interval variable

Key difference

No internal varible here, just use previous prediction y.

Component

Reset Gate: using previous prediction as reference to store memory. Portion of memory
Input activation Gate: just input activation gate
Output: Combine portion of predicted target and portion of the processed input variable. (ratio come from the update feature)

Summary

reset component
input activation component
update component
final output component

Formula

reset gate: $r_t = \sigma(W_r [x_t, y_{t-1}] + b_r)$
input activation gate: $a_t = \tanh(W_a [x_t, r_t \times y_{t-1}] + b_a)$
update gate: $u_t = \sigma(W_u [x_t, y_{t-1}] + b_u)$
final output: $y_t = (1-u_t) y_{t-1} + u_t a_t$