Latent Tree Models Learned on Word Embeddings, Part 3

Continuing my exploratory studies on latent tree models within NLP (Part 1, Part 2), I’ve run a few more simulations on a simplified English->French translation dataset. In my most recent post, I looked at a few different feature sets and hyperparameters to compare a simple RNN neural machine translation (NMT) model with an augmented version that has modified latent tree model (MLTM) features injected into the context vector of the decoder module. For this post, I’ve expanded the dataset somewhat (it’s still a very small, narrow set) and looked at the effect of using pretrained word embeddings in the model.

Simulations

The methods I’m using here are similar to those in my previous post. I’m now using the full simplified dataset as seen in Chapter 8 of Rao & McMahan, and I’m reporting results with the more standard BLEU score (BLEU-4, with an unweighted average of unigrams…4-grams). As I’ve played around with these tests, I found that the random initialization of weights within the models had a larger impact on the final results than I expected, so I’ve run a set of tests with different random number seeds to look at the distribution of scores. I believe it’s common practice to use pretrained word embeddings within these kinds of models, while still allowing the embedding weights to be fine-tuned on the data, so I’ve run two sets of tests to compare the performance between using pretrained embeddings and starting with random initializations.

All code and data for these simulations can be found here. I’ve generally used the same hyperparameters as found in Rao & MchMahan, but I’ve changed the embedding dimensions to match the dimension from the GloVe data (50d) I’ve been using.

Results

Table I compares BLEU scores for the baseline RNN-GRU model with attention along with my MLTM-feature-augmented model using randomly initialized word embedding weights. Table II shows the results for the models with pretrained embedding weights from the GloVe embeddings, the data I used to generate the MLTM. I was concerned that the performance improvement I saw from my MLTM feature enhancement may have been solely due to the introduction of word embeddings from a much larger dataset, but I still see significant improvement in translation performance for the MLTM model even when both models make use of pretrained embeddings.

RNG SeedBaseline BLEUMLTM BLEU
133748.3957.24
322450.8657.91
86149.5257.52
944950.8857.71
1962450.6457.81
1911247.6859.41
1440945.7957.67
1255451.5156.78
1477546.1857.91
600351.3658.22
Average49.2857.82
Table I. BLEU scores without pretrained embeddings.

RNG SeedBaseline BLEUMLTM BLEU
133755.0757.91
322454.3760.84
86152.9560.89
944951.2960.11
1962455.8660.70
1911255.3460.33
1440956.2059.87
1255444.6858.74
1477554.5360.24
600349.7059.81
Average53.0059.94
Table II. BLEU scores with pretrained embeddings.

Note these BLEU scores are much higher than you would normally see, which is primarily because the dataset is limited to sentences of a very specific format.

Going Forward

I mentioned some ideas for improving my feature enhancements in the previous post, but for now my plan is to move the current model to larger datasets with longer, more complex sentences. I’ve run many of the initial tests for the MLTM features on a Lenovo laptop with a dual-core Intel i3 CPU (I’m clearly doing serious ML research here 🤣), but I’ve recently received a major hardware upgrade, which I will talk about a little in my next post…

Leave a Reply