seq2seq缺陷：无论之前的context有多长，包含多少信息量，最终都要被压缩成一个几百维的vector。这意味着context越大，最终的state vector会丢失越多的信息。

Attention based model的核心思想: 一个模型完全可以在decode的过程中利用context的全部信息，而不仅仅是最后一个state。

各种各样的attention

2014年google mind团队的这篇论文《Recurrent Models of Visual Attention》，他们在RNN模型上使用了attention机制来进行图像分类。
Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是第一个将attention机制应用到NLP领域中。
接着attention机制被广泛应用在基于RNN/CNN等神经网络模型的各种NLP任务中。
2017年，google机器翻译团队发表的《Attention is all you need》中大量使用了自注意力（self-attention）机制来学习文本表示。

什么是attention，attention的起源

见google得transformer。

Attention函数的本质可以被描述为一个查询（query）到一系列（键key-值value）对的映射，

什么是self-attention，下面这个图就是self-attention。

可以理解为没有target的attention，也可以理解为自己把自己当做target进行attention。

参考 A Structured Self-attentive Sentence Embedding

Self-Attention with Relative Position Representations（基于相对位置表示的子注意力模型）

Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling（增强的自注意力网络:一种对序列建模的硬和软注意力的混合）

Distance-based Self-Attention Network for Natural Language Inference（基于距离的自注意力网络的自然语言推理）

Hierarchical Attention Networks for Document Classification

采用了word-level和sentent-level的attention。

a word sequence encoder
- 采用的GRU。Document Modeling with Gated Recurrent Neural Network… 这篇文章提到，在文本分类领域 GRU比LSTM效果好。
- 也可以采用CNN
a word-level attention layer
a sentence encoder
a sentence-level attention layer

可视化分析