Latent Tree Models Learned on Word Embeddings, Part 3

Continuing my exploratory studies on latent tree models within NLP (Part 1, Part 2), I’ve run a few more simulations on a simplified English->French translation dataset. In my most recent post, I looked at a few different feature sets and hyperparameters to compare a simple RNN neural machine translation (NMT) model with an augmented version that has modified latent tree model (MLTM) features injected into the context vector of the decoder module. For this post, I’ve expanded the dataset somewhat (it’s still a very small, narrow set) and looked at the effect of using pretrained word embeddings in the model.

Simulations

The methods I’m using here are similar to those in my previous post. I’m now using the full simplified dataset as seen in Chapter 8 of Rao & McMahan, and I’m reporting results with the more standard BLEU score (BLEU-4, with an unweighted average of unigrams…4-grams). As I’ve played around with these tests, I found that the random initialization of weights within the models had a larger impact on the final results than I expected, so I’ve run a set of tests with different random number seeds to look at the distribution of scores. I believe it’s common practice to use pretrained word embeddings within these kinds of models, while still allowing the embedding weights to be fine-tuned on the data, so I’ve run two sets of tests to compare the performance between using pretrained embeddings and starting with random initializations.

All code and data for these simulations can be found here. I’ve generally used the same hyperparameters as found in Rao & MchMahan, but I’ve changed the embedding dimensions to match the dimension from the GloVe data (50d) I’ve been using.

Results

Table I compares BLEU scores for the baseline RNN-GRU model with attention along with my MLTM-feature-augmented model using randomly initialized word embedding weights. Table II shows the results for the models with pretrained embedding weights from the GloVe embeddings, the data I used to generate the MLTM. I was concerned that the performance improvement I saw from my MLTM feature enhancement may have been solely due to the introduction of word embeddings from a much larger dataset, but I still see significant improvement in translation performance for the MLTM model even when both models make use of pretrained embeddings.

RNG SeedBaseline BLEUMLTM BLEU
133748.3957.24
322450.8657.91
86149.5257.52
944950.8857.71
1962450.6457.81
1911247.6859.41
1440945.7957.67
1255451.5156.78
1477546.1857.91
600351.3658.22
Average49.2857.82
Table I. BLEU scores without pretrained embeddings.

RNG SeedBaseline BLEUMLTM BLEU
133755.0757.91
322454.3760.84
86152.9560.89
944951.2960.11
1962455.8660.70
1911255.3460.33
1440956.2059.87
1255444.6858.74
1477554.5360.24
600349.7059.81
Average53.0059.94
Table II. BLEU scores with pretrained embeddings.

Note these BLEU scores are much higher than you would normally see, which is primarily because the dataset is limited to sentences of a very specific format.

Going Forward

I mentioned some ideas for improving my feature enhancements in the previous post, but for now my plan is to move the current model to larger datasets with longer, more complex sentences. I’ve run many of the initial tests for the MLTM features on a Lenovo laptop with a dual-core Intel i3 CPU (I’m clearly doing serious ML research here 🤣), but I’ve recently received a major hardware upgrade, which I will talk about a little in my next post…

Latent Tree Models Learned on Word Embeddings, Part 2

In my previous post, I introduced the idea of learning a tree structure on word embedding data using an agglomerative clustering algorithm and applying the learned modified latent tree model (MLTM) as a feature extractor on text data. I examined the power of the extracted features on a simple text classification task using news article texts, and found the MLTM features provided for strong classification accuracy when used as inputs to a simple MLP classifier.

Extending this idea, I’ve been experimenting with augmenting a standard neural machine translator (NMT) with these features. Machine translation is the task of automatically translating a sentence (or any set of text, really) from one natural language to another (e.g. English to French). While deep learning has made great progress on this task, more improvement will be necessary to get automatic machine translators to match expert human ability.

Recurrent Neural Networks

Until a few years ago, recurrent neural networks (RNN) were generally considered the best tool for machine translation (along with convolutional neural networks). They’ve been replaced by a newer architecture known as the Transformer, which does away with recurrence altogether and outperforms RNN’s on standard benchmarks. For now, though, I’ve decided to stick with a RNN to test my MLTM feature set.

When I began diving into deep learning for natural language processing (NLP), one of the books I bought was Natural Language Processing with PyTorch by Rao & McMahan. It some nice Python example code, including full logic required to run a NMT simulation. A Jupyter notebook based on the material in chapter 8 can be found here, which I’ll be using as a template for my simulations. The ANN architecture is an Encoder-Decoder GRU RNN with an Attention mechanism.

Attention

Earlier RNN models proved capable of translating short sentences with good accuracy, but their performance fell off quickly when faced with longer strings of text. This is due to the logic of the encoder-decoder structure: the final hidden state from the encoder is used as the initial input to the decoder. As the encoder steps forward over the elements of a sequence, the information stored in the hidden state from the first words in a sentence decays.

The solution for this problem is a mechanism known as Attention. Built inside the decoder, after each step, a context vector is computed using the current decoder hidden state and all encoder hidden states from the sequence. You can think of the context vector as a representation of the most relevant encoder hidden state(s). The context vector is concatenated to the target input vector or hidden state and fed into the next layer. You can read find a deeper explanation of the Attention mechanism here.

Attention + MLTM

My initial idea for incorporating the MLTM features into a NMT model is to concatenate the features to the context vector within the decoder, or rather, concatenate a projected version of the features. The projection is done by adding a Linear layer in PyTorch to change the dimensionality of the MLTM features to match the context vector. One thing I find a bit awkward about this approach is that the MLTM features are static – they don’t change over each frame of the sequence like the other values. It’s easy to implement, however, and doesn’t add much processing cost.

Simulations

I compare the baseline NMT model from Rao & McMahan to my augmented version, using their example code and accompanying English/French data. All code and some of the necessary data can be found here.

I generally used the same parameters as found in Rao & McMahan’s example, though I reduced the number of epochs from 100 to 50 to save time (from the logs, the validation loss appears to saturate around 20-30 epochs).

Parameters for simulation with no tree pruning

Dataset

The data consists of pairs of English/French sentences, sourced from the Rao & McMahan repository. Because I’m running my simulations on cheap hardware, following their example, I’ve sliced out a very narrow subset of the data – my subset is actually narrower than theirs, consisting of only 3375 pairs total. All English sentences begin with “i am,” “he is,” “she is,” “they are,” “you are,” or “we are.” Mine is smaller because I ignore sentences with contractions that match (e.g. “he’s”).

Results

Scoring machine translation is difficult, as many sentences with equivalent meanings can be phrased in different ways, and not all words translate directly across languages. Currently, the most popular scoring method is BLEU, but my understanding is that it has a number of flaws, and may be replaced with something better in the future. To keep things simple, for now I’m using the word accuracy computation function provided by Rao & McMahan as the performance metric.

Table I shows word accuracy for various values of a pruning parameter on the MLTM, both with and without dropout on the linear transformation from the MLTM binary features to context augmentation vector.

Min DescendantsMLTM FeaturesDropoutAccuracy
Baseline NTM ModelN/AN/A43.25%
3257None49.23%
16123None 45.68%
8251None 46.02%
4501None 46.18%
None1672None 49.65%
32570.247.53%
161230.2 47.33%
82510.2 46.80%
45010.2 47.78%
None16720.2 51.19%
Table I.

While the results are a bit noisy, dropout is generally beneficial for the MLTM feature projection (unhelpful for the smallest number of features), and the best performance comes from the model with no pruning. Every parameter combination outperforms the baseline, with the best performer providing an almost 8% absolute improvement.

Going Forward

While the preliminary results for a MLTM-augmented NMT model look good, they come from a very limited dataset. Furthermore, the base NTM model is no longer considered state-of-the-art. I’ll be looking to expand the dataset and compare against a Transformer, the current SoA model. I also want to look at alternative performance metrics such as BLEU.

Among ways to further improve the model, I have a couple other ideas:

  • The modified latent tree model could be extended to a modified latent random forest model, in which each tree uses a randomized bootstrap sampling of the dimensions of the word embeddings.
  • As I mentioned in part 1, the MLTM architecture is similar to a MLP with fixed weights and step function activations. I’d like to try injecting the tree itself into the neural model and allowing the training procedure to optimize the weights. This could lead to a “neural latent tree model” (NLTM).

Latent Tree Models Learned on Word Embeddings, Part 1

Back in June of 2008, I successfully defended my doctoral dissertation and received my PhD in Electrical and Computer Engineering from Marquette University. My field of research in grad school was automatic speech recognition, with an emphasis on ASR under noisy conditions. After I graduated, I started working for Crabel Capital Management. In my first months, I tried using some of the machine learning (ML) approaches I learned in college to develop automated trading strategies, but I never succeeded in making that work. I’m still with Crabel, and up until recently, I’d almost completely abandoned the field of machine learning.

I’ve become interested in doing ML research again, and have acquainted myself with PyTorch, Facebook’s deep learning toolkit for Python. While there’s been a ton of progress in ML since I left it more than a decade ago, I don’t believe deep neural networks – not by themselves, anyway – will lead to artificial generalized intelligence (AGI), or even robust AI. As I’ve been reading papers and articles on ML/AI, I came across a concept known as latent tree models (LTM) I found pretty interesting, and I decided to play around with this model on basic datasets.

Latent Tree Models

In a latent tree model, the leaves of the tree correspond to observed variables, such as words in English, and the non-leaf nodes correspond to latent variables, which could be interpreted as higher-level concepts. In some of my recent work at Crabel, I’ve used the sklearn toolkit to perform agglomerative clustering on data, and one of my first thoughts when I initially read about LTM’s is that building a tree with agglomerative clustering on word embeddings seemed like a natural task. Word embeddings are numerical feature vectors assigned to words within a dictionary, and are typically learned through some form of deep learning.

I was unable to find any papers that took this exact approach, though I did find one which looked LTM’s on text data using word co-occurrences. Here’s a good survey paper on LTM’s for anyone interested.

I wrote some python code to learn the structure of an LTM using word embeddings from Stanford NLP’s GloVe embeddings, which is an open-source, freely available project. You can find my code and the data referenced below here on Github.

A standard LTM is a particular type of Bayesian network (a causal tree), where each edge has a link matrix of probabilities associated with it. The code I’ve written modifies the standard form to ignore the link matrices, and simply treat every parent node (latent variable) as a binary random variable (proposition), using a form of OR function on its children, according to the rule:

In contrast, in a typical causal tree, within the upward message-passing algorithm, the likelihood values are multiplied, which would be more analogous to an AND function. Another way to view my logic is to see it as an ANN with a tree structure, where the inputs are the words, all weights are fixed at 1, the bias for each node is fixed at -0.5, and the activation function is the step function originally seen in perceptrons. The output of the network is every hidden node, which forms the latent feature vector. I refer to this model as a modified LTM (MLTM) in the code.

Simulations

To test my approach, I decided to start with a simple text classification task. In this task, we have a dataset of text with class labels for various categories. The goal is to train a classifier with labeled data to be able to determine the correct class for unlabeled test data.

Dataset

The dataset I chose contains brief news articles from AG News, labeled with one of four news topics. I performed some filtering and pre-processing to clean it up before running it through my classification code.

Methods

The comparison baseline classifier I used is a naive Bayes classifier. It’s an extremely simple method, but it works quite well on this type of task. All spam filters were originally built using naive Bayes.

To build the MLTM, I use the AgglomerativeClustering class from the sklearn package. Here’s some sample code:

The parameters shown will cause the clustering algorithm to build a full binary tree with the number of non-leaf nodes (latent variables) equal to the number of leaf nodes (words) minus one. After the MLTM is built, I extract binary feature vectors from a set of text by setting the leaf node values to True (1) if the corresponding word appears in the text:

The beta values are propagated upward according to the first equation. All beta values from non-leaf nodes are extracted from the tree to form the binary feature vector.

My experimental MLTM classifier consists of an MLP with two hidden layers, with the MLTM binary feature vectors as inputs, implemented with PyTorch. I added an option to the MLTM to prune the tree down by grouping together observed variables to meet a minimum count of leaf nodes for any parent node.

Results

The table below shows the test (out-of-sample) classification accuracy for the baseline and the MLTM-MLP classifier for various values of the pruning variable, min_descendants.

Min DescendantsClassification Accuracy
Naive Bayes88.32%
25680.34%
12883.30%
6485.19%
3286.83%
1687.72%
888.59%
489.21%
None89.89%

This is a four-class classification problem, so a random classifier should give an accuracy of around 25%. The MLTM-MLP classifier is able to outperform naive Bayes when the pruning parameter is no higher than 8. I find it interesting that my experimental classifier still performs reasonably well even when I prune the tree down by requiring every latent variable have a minimum of 256 descendant leaf nodes.

Going Forward

Text classification is a classic example of an NLP task, but it’s not terribly challenging or interesting at this point, at least not when the data is relatively clean. Going forward, I’ll be looking at using the MLTM in conjunction with recurrent neural networks (RNN) for machine translation tasks (e.g. translating English sentences to French).

Comments on P-values

Noah Smith, popular economics blogger, recently posted a rebuttal to the criticism on the use of p-values in hypothesis testing. While he makes a few good points on why p-values and significance testing have value, I think that his post fails to address a couple of major issues.

First, he states that good science involves replication of results. This is absolutely true, and is, in my opinion, the best antidote for many of the issues related to significance testing. But from my experience in academia (I was an engineering grad student from 2003-2008), the real problem isn’t the lack of good scientists, it’s the poor system of incentives. This extends from the management of journals to the peer review process to the tenure system itself.

Because journals are reluctant to publish negative (non-significant) results, if dozens of independent groups perform similar studies, but only one of these shows significance, this single positive may be the only one published. In fact, the groups that found non-significance will probably not even attempt to publish their work, and no one will have any reason to believe that the lone positive result is false. In this case, no researcher has to do anything wrong in order to produce a bad conclusion by the field.

Also, the tenure system requires that professors continually publish papers in respected journals, which requires doing original work and finding interesting, previously unknown effects. Replicating studies that others have already accepted as legitimate (whether your own or not) gets you no closer to tenure.

The other major problem with p-values is the way they’re interpreted. The common perception is that a p-value of 0.05 means there’s a 95%  chance the effect is real (non-random). But the p-value actually represents p(x|h0), where x is the data and h0 is the null hypothesis. What the researcher wants to know is p(h0|x). The first value (what you have) tells you the probability of observing the data you found, assuming that the null hypothesis is true. But you want to know the probability of the null hypothesis, given the data.

Bayes’ theorem could be used to convert from the term you have to the one you want if you knew p(x), the prior probability of the data. Unfortunately, there’s no way to find this value. However, this paper does a nice job of setting bounds on the value of p(h0|x), depending on the form of the distribution on the data. An interesting result from this work is that for many types of tests, simply subtracting 1 from the t-stat will give you a decent approximation.