Latent Tree Models Learned on Word Embeddings, Part 2

In my previous post, I introduced the idea of learning a tree structure on word embedding data using an agglomerative clustering algorithm and applying the learned modified latent tree model (MLTM) as a feature extractor on text data. I examined the power of the extracted features on a simple text classification task using news article texts, and found the MLTM features provided for strong classification accuracy when used as inputs to a simple MLP classifier.

Extending this idea, I’ve been experimenting with augmenting a standard neural machine translator (NMT) with these features. Machine translation is the task of automatically translating a sentence (or any set of text, really) from one natural language to another (e.g. English to French). While deep learning has made great progress on this task, more improvement will be necessary to get automatic machine translators to match expert human ability.

Recurrent Neural Networks

Until a few years ago, recurrent neural networks (RNN) were generally considered the best tool for machine translation (along with convolutional neural networks). They’ve been replaced by a newer architecture known as the Transformer, which does away with recurrence altogether and outperforms RNN’s on standard benchmarks. For now, though, I’ve decided to stick with a RNN to test my MLTM feature set.

When I began diving into deep learning for natural language processing (NLP), one of the books I bought was Natural Language Processing with PyTorch by Rao & McMahan. It some nice Python example code, including full logic required to run a NMT simulation. A Jupyter notebook based on the material in chapter 8 can be found here, which I’ll be using as a template for my simulations. The ANN architecture is an Encoder-Decoder GRU RNN with an Attention mechanism.

Attention

Earlier RNN models proved capable of translating short sentences with good accuracy, but their performance fell off quickly when faced with longer strings of text. This is due to the logic of the encoder-decoder structure: the final hidden state from the encoder is used as the initial input to the decoder. As the encoder steps forward over the elements of a sequence, the information stored in the hidden state from the first words in a sentence decays.

The solution for this problem is a mechanism known as Attention. Built inside the decoder, after each step, a context vector is computed using the current decoder hidden state and all encoder hidden states from the sequence. You can think of the context vector as a representation of the most relevant encoder hidden state(s). The context vector is concatenated to the target input vector or hidden state and fed into the next layer. You can read find a deeper explanation of the Attention mechanism here.

Attention + MLTM

My initial idea for incorporating the MLTM features into a NMT model is to concatenate the features to the context vector within the decoder, or rather, concatenate a projected version of the features. The projection is done by adding a Linear layer in PyTorch to change the dimensionality of the MLTM features to match the context vector. One thing I find a bit awkward about this approach is that the MLTM features are static – they don’t change over each frame of the sequence like the other values. It’s easy to implement, however, and doesn’t add much processing cost.

Simulations

I compare the baseline NMT model from Rao & McMahan to my augmented version, using their example code and accompanying English/French data. All code and some of the necessary data can be found here.

I generally used the same parameters as found in Rao & McMahan’s example, though I reduced the number of epochs from 100 to 50 to save time (from the logs, the validation loss appears to saturate around 20-30 epochs).

Parameters for simulation with no tree pruning

Dataset

The data consists of pairs of English/French sentences, sourced from the Rao & McMahan repository. Because I’m running my simulations on cheap hardware, following their example, I’ve sliced out a very narrow subset of the data – my subset is actually narrower than theirs, consisting of only 3375 pairs total. All English sentences begin with “i am,” “he is,” “she is,” “they are,” “you are,” or “we are.” Mine is smaller because I ignore sentences with contractions that match (e.g. “he’s”).

Results

Scoring machine translation is difficult, as many sentences with equivalent meanings can be phrased in different ways, and not all words translate directly across languages. Currently, the most popular scoring method is BLEU, but my understanding is that it has a number of flaws, and may be replaced with something better in the future. To keep things simple, for now I’m using the word accuracy computation function provided by Rao & McMahan as the performance metric.

Table I shows word accuracy for various values of a pruning parameter on the MLTM, both with and without dropout on the linear transformation from the MLTM binary features to context augmentation vector.

Min DescendantsMLTM FeaturesDropoutAccuracy
Baseline NTM ModelN/AN/A43.25%
3257None49.23%
16123None 45.68%
8251None 46.02%
4501None 46.18%
None1672None 49.65%
32570.247.53%
161230.2 47.33%
82510.2 46.80%
45010.2 47.78%
None16720.2 51.19%
Table I.

While the results are a bit noisy, dropout is generally beneficial for the MLTM feature projection (unhelpful for the smallest number of features), and the best performance comes from the model with no pruning. Every parameter combination outperforms the baseline, with the best performer providing an almost 8% absolute improvement.

Going Forward

While the preliminary results for a MLTM-augmented NMT model look good, they come from a very limited dataset. Furthermore, the base NTM model is no longer considered state-of-the-art. I’ll be looking to expand the dataset and compare against a Transformer, the current SoA model. I also want to look at alternative performance metrics such as BLEU.

Among ways to further improve the model, I have a couple other ideas:

  • The modified latent tree model could be extended to a modified latent random forest model, in which each tree uses a randomized bootstrap sampling of the dimensions of the word embeddings.
  • As I mentioned in part 1, the MLTM architecture is similar to a MLP with fixed weights and step function activations. I’d like to try injecting the tree itself into the neural model and allowing the training procedure to optimize the weights. This could lead to a “neural latent tree model” (NLTM).

One thought on “Latent Tree Models Learned on Word Embeddings, Part 2

Leave a Reply