Applying Unsupervised Pretraining to Language Generation: Semantic Parsing + ELMo

8 minute read


For those who haven’t heard it yet, NLP’s ImageNet moment has arrived; approaches such as ULMFiT, ELMo, OpenAI GPT, and BERT have gained significant traction in the community in the last year by using the unsupervised pretraining of language models to achieve significant improvements above prior state-of-the-art results on a diverse set of language understanding tasks (including classification, commonsense reasoning, and coreference resolution, among others) and datasets. (For more on unsupervised pretraining and the motivations behind it, read the blog post about NLP’s ImageNet moment I have linked above.)

Having worked with sequence transduction models such as the classic sequence-to-sequence LSTM and the Transformer for the past year, I began to wonder if these techniques could be applied to language generation tasks, such as machine translation, summarization, semantic parsing, etc. I found this paper from Google, published “way” back in 2016, which applied a very similar technique to machine translation: it trained two language models, one for the source language and one for the target language, connected them with an encoder-decoder (sequence-to-sequence) LSTM network, and fine-tuned the entire model. It set state-of-the-art results on the benchmark WMT dataset at the time but has since been surpassed by the Transformer and its variations. Immediately, it seemed to me that a new state-of-the-art machine translation model could be built by utilizing the same approach but with a Transformer for the language model (as in BERT and OpenAI GPT) and encoder-decoder layers. Unfortunately, I do not have access to enough compute to compete with SOTA NMT models, which can take weeks to train even with 8 GPUs (I’m stuck with one GPU, maybe two), so I decided to instead try a similar approach for semantic parsing (and leave the unsupervised pretraining + Transformer for NMT idea as a free, possibly SOTA-setting idea for the NMT folks!).

Using AllenNLP and ELMo for Semantic Parsing

I decided to use the AllenNLP library (which sits on top of PyTorch) for this project. I have been using the framework since the beginning of the summer and have really enjoyed using it as a general deep learning framework, but AllenNLP has more NLP-specific functionality and includes ELMo, OpenAI GPT, and BERT as built-in modules. Further, AllenNLP models are built to be hyper-modular and the framework is particularly well-suited for experimenting: swapping pieces of models in and out (such as normal vs. ELMo embeddings) only requires changing a few lines of a JSON configuration file. I have uploaded all the data, configuration files, and scripts needed to reproduce my experiments to a GitHub repository.

My approach was simple: replace the randomly initialized embeddings in the bidirectional LSTM encoder with pretrained ELMo features (leaving the embeddings for the decoder randomly initialized). I used the exact same LSTM sequence-to-sequence architecture across experiments, except the ELMo version provided 1124-dimensional pretrained embeddings as input to the encoder as opposed to randomly initialized 100-dimensional embeddings, and I varied training settings slightly across datasets. Specifics may be found in the configuration files in the published repository.

Aside: The Difficulty with Semantic Parsing

Reproducibility and comparability of experimental results is a big concern in the broader deep learning community, but nowhere have I seen it be such a problem as in semantic parsing. Last year, while trying to reproduce Dong and Lapata’s seq2tree architecture, I realized that other papers (such as Jia & Liang’s paper that added data recombination and a copy mechanism to the seq2seq architecture) used different versions of the same dataset: some used lambda calculus to represent the logical output, while some used Prolog. Further, some had anonymized variable names to standard forms (such as changing specific names like “Dallas” to a token representing its type, such as “city_0”), while others left them as is. This problem has been more specifically addressed in more recent papers such as “Improving Text-to-SQL Evaluation Methodology,” which additionally points out that “the current division of data into training and test sets measures robustness to variations in the way questions are asked, but only partially tests how well systems generalize to new queries.”

For these experiments, I used the same versions of ATIS, GEO, and JOBS as Dong and Lapata in an attempt to have a comparable result. However, problems lie beyond the version of the datasets used; these semantic parsing datasets are very small (a few hundred to a few thousand examples) and are thus particularly sensitive to changes in seeding, random initialization, regularization, model implementation, etc. I found this out the hard way last year when I ended up having to write a line-for-line, op-for-op conversion of Dong and Lapata’s lua/torch code to python/PyTorch in order to get within even two percentage points of their reported accuracies. All of this makes it particularly difficult to interpret results on these datasets, but I nonetheless believe my experiments show some promising results.

Results and Analysis

S2S + attention75.262.558.6
S2S + attention + ELMo79.071.174.3
S2S + attention (Dong and Lapata)84.284.687.1

The above table shows the sequence-level accuracies of the baseline sequence-to-sequence model as well as the version with incorporated ELMo features. I have also included the reported sequence-to-sequence results from Dong and Lapata’s paper for comparison; I was not able to reproduce their results for my baseline. There are a few possible explanations for this discrepancy. The foremost is likely my use of AllenNLP’s implementation of the sequence-to-sequence model, which is described as being a “simple” sequence-to-sequence model. AllenNLP has more extensively tested models for other tasks in NLP but not for sequence transduction tasks that require encoder-decoder architectures. For example, it only uses dropout within the LSTM encoder and not after embedding layers, between the encoder and decoder, or before the projection layer, as other implementations have done. This makes the model harder to regularize, which is paramount to good performance on small datasets such as the three semantic parsing datasets used here.

Nonetheless, I believe my results give strong empirical evidence in favor of the effectiveness of utilizing pretrained models for semantic parsing. Although none of my models were able to match the baseline reported by Dong and Lapata, the models with ELMo strongly outperformed my own baseline, which is the more relevant comparison, as differences in implementations and training are controlled for. The ELMo models see absolute gains in performance of 3.8%, 8.6%, and 15.7% on ATIS, GEO, and JOBS, respectively. This supports the trend reported in the literature of pretraining being most important for smaller datasets, as ATIS is the largest dataset, while JOBS is the smallest. These results indicate that incorporating features that have been pretrained as a language model into a sequence-to-sequence model is a simple way to significantly increase performance; the baseline models were tuned on the validation set to optimize performance, but the ELMo models used exactly the same architecture and training settings with the addition of ELMo features, no further tuning, and noticeably improved performance across the board.


These results certainly aren’t publication-worthy, but I think they adequately demonstrate the promise of improving performance on language generation tasks via unsupervised pretraining. Moreover, I wanted to write up these results and publish my code in case someone else might find my work useful for further exploration. Personally, I want to experiment some more and test out other pretrained options such as OpenAI GPT and BERT to see how they compare to ELMo, but I admit that my exploration has been somewhat hampered by both AllenNLP’s seq2seq implementation and my choice of task in semantic parsing as a result of limited compute. I think the use of pretrained models will become an increasingly important part of NLP, in both language generation and language understanding tasks, and I’m very excited to see where the research community takes this emerging trend next.