Attention networks have proven to be an effective approach for embeddingcategorical inference within a deep neural network. However, for many tasks wemay want to model richer structural dependencies without abandoning end-to-endtraining. In this work, we experiment with incorporating richer structuraldistributions, encoded using graphical models, within deep networks. We showthat these structured attention networks are simple extensions of the basicattention procedure, and that they allow for extending attention beyond thestandard soft-selection approach, such as attending to partial segmentations orto subtrees. We experiment with two different classes of structured attentionnetworks: a linear-chain conditional random field and a graph-based parsingmodel, and describe how these models can be practically implemented as neuralnetwork layers. Experiments show that this approach is effective forincorporating structural biases, and structured attention networks outperformbaseline attention models on a variety of synthetic and real tasks: treetransduction, neural machine translation, question answering, and naturallanguage inference. We further find that models trained in this way learninteresting unsupervised hidden representations that generalize simpleattention.
translated by 谷歌翻译