【深度学习-模型系列】递归神经网络 RNN

简介

递归神经网络(RNN)是两种人工神经网络的总称。一种是时间递归神经网络(recurrent neural network),另一种是结构递归神经网络(recursive neural network)。 时间递归神经网络的神经元间连接构成矩阵,而结构递归神经网络利用相似的神经网络结构递归构造更为复杂的深度网络。RNN一般指代时间递归神经网络。单纯递归神经网络因为无法处理随着递归,权重指数级爆炸或消失的问题(Vanishing gradient problem),难以捕捉长期时间关联;而结合不同的LSTM可以很好解决这个问题。[1][2] —— 维基百科

  • recurrent
  • recursive
  • feedforward

历史-- 发展史--背景,rnn之前

before rnn

  • Memoryless models for sequences
  • HMM

RNN Overview

  • 狭义上的RNN,指vanilla RNN。
  • 广义上的RNN,lstm gru等都属于RNN框架。

该文章针对广义上的RNN。

RNN的形式有很多种

该图的总结很好,大部分应用都可归入该框架。具体的应用可参考karpathy

核心:

1
out, hidden = lstm(input, hidden)  # 来自pytorch的抽象

名词解释:

  • hidden 也叫cell, hidden_state, cell_state。它是forget的关键
  • out 也叫 output,
  • hidden
  • cell
  • output在stack RNN中也叫hidden

output = new_state =

Most basic RNN:

1
output = new_state = act(W * input + U * state + B)

https://www.quora.com/How-is-the-hidden-state-h-different-from-the-memory-c-in-an-LSTM-cell

vanilla RNN中没有cell,所以hidden=cell=out LSTM中

RNN的高层抽象

抽象不是实现,是API。由整体到局部,可把RNN当做一个黑盒子,有需求的情况下再细看其具体实现。

keras的RNN抽象

1
2
3
keras.layers.RNN(cell, return_sequences=False, return_state=False, go_backwards=False, stateful=False, unroll=False)
# return_sequences: 是否返回整个序列out
# return_state: 是否返回整个序列的hidden

应用示例--lstm用于二分类

1
2
3
4
model = Sequential()
model.add(Embedding(num_words, 128)) # 输入整个sequence
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2)) # 输出最后一个cell的output
model.add(Dense(1, activation='sigmoid')) # 二分类

这是上图中的many to one模式。

  • 关于
  • 关于静态图:sequence的数目固定为80
应用示例--基于lstm的seq2seq

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

应用示例--基于lstm的attention-seq2seq

keras是对整个sequence做的抽象。因为keras是面向tensorflow和theano的静态图做的封装。

pytorch的RNN抽象

应用示例--基于lstm的

1
2
3
4
# for a sequence inputs
for input in inputs:
# Step through the sequence one element at a time
out, hidden = lstm(input, hidden)

1
output, (h_n, c_n) = lstm(input, (h_0, c_0))

pytorch是动态图,会随着inputsequence

源码实现

tensorflow的抽象

示例--基于lstm的语言模型

1
2
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
output, state = lstm(words, state) # 这里的输入和输出都是符号,类型是tf.placeholder,lstm参数是tf.variable

BasicLSTMCell源码

基于RNN的变形(mainstream variation)

cell

cascade rnn

char-rnn + word rnn (Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation)
char-cnn + word rnn (Exploring the Limits of Language Modeling)

sequence labeling

Part-of-speech Tagging

attention

未分类

recurrent highway network

框架 总结

Sequence Modeling按照架构一般分为:

  1. Encoder 架构
    • Sequence Classification
    • Sequence Labeling/Prediction
  2. Deep Encoder 架构
  3. Encoder – Decoder 架构
    其中input和output都是sequence的架构,又叫seq2seq
    • encoder:
      • CNN:图片一般采用CNN,文本也可以采用CNN
      • RNN:
    • decoder:
      • simple decoder: 通常是LSTM作为解码器123. encodes the “meaning” of the input sequence into a single vector of a fixed dimensionality. Then another deep LSTM to decode the target sequence from the vector. (only last output of the encoder. This last output is sometimes called the context vector as it encodes context from the entire sequence. map the input sequence)
        缺陷:只用了一个vector(context vector)表征输入序列压力太大(it carries the burden of encoding the entire sentence).
        解决: 整个序列都用
        1. 一个简单的方式是对序列vector取均值,高大上点叫mean pooling
        2. 对序列vector线性加权。难点:加权系数怎么来?因为这是不定长序列,不能像DNN那样放一个全连接参数W让模型去学。于是就出现了attention,以及self-attention。
      • attention decoder: LSTM+attention
        The decoder decides parts of the source sentence to pay attention to.
        It relieves the encoder from the burden of having to encode all information in the source sentence into a fixedlength vector.
        Attention allows the decoder network to “focus” on a different part of the encoder’s outputs for every step of the decoder’s own outputs. First we calculate a set of attention weights.

列个表,input和output

参考