Latent Tree Models Learned on Word Embeddings, Part 1

Back in June of 2008, I successfully defended my doctoral dissertation and received my PhD in Electrical and Computer Engineering from Marquette University. My field of research in grad school was automatic speech recognition, with an emphasis on ASR under noisy conditions. After I graduated, I started working for Crabel Capital Management. In my first months, I tried using some of the machine learning (ML) approaches I learned in college to develop automated trading strategies, but I never succeeded in making that work. I’m still with Crabel, and up until recently, I’d almost completely abandoned the field of machine learning.

I’ve become interested in doing ML research again, and have acquainted myself with PyTorch, Facebook’s deep learning toolkit for Python. While there’s been a ton of progress in ML since I left it more than a decade ago, I don’t believe deep neural networks – not by themselves, anyway – will lead to artificial generalized intelligence (AGI), or even robust AI. As I’ve been reading papers and articles on ML/AI, I came across a concept known as latent tree models (LTM) I found pretty interesting, and I decided to play around with this model on basic datasets.

Latent Tree Models

In a latent tree model, the leaves of the tree correspond to observed variables, such as words in English, and the non-leaf nodes correspond to latent variables, which could be interpreted as higher-level concepts. In some of my recent work at Crabel, I’ve used the sklearn toolkit to perform agglomerative clustering on data, and one of my first thoughts when I initially read about LTM’s is that building a tree with agglomerative clustering on word embeddings seemed like a natural task. Word embeddings are numerical feature vectors assigned to words within a dictionary, and are typically learned through some form of deep learning.

I was unable to find any papers that took this exact approach, though I did find one which looked LTM’s on text data using word co-occurrences. Here’s a good survey paper on LTM’s for anyone interested.

I wrote some python code to learn the structure of an LTM using word embeddings from Stanford NLP’s GloVe embeddings, which is an open-source, freely available project. You can find my code and the data referenced below here on Github.

A standard LTM is a particular type of Bayesian network (a causal tree), where each edge has a link matrix of probabilities associated with it. The code I’ve written modifies the standard form to ignore the link matrices, and simply treat every parent node (latent variable) as a binary random variable (proposition), using a form of OR function on its children, according to the rule:

In contrast, in a typical causal tree, within the upward message-passing algorithm, the likelihood values are multiplied, which would be more analogous to an AND function. Another way to view my logic is to see it as an ANN with a tree structure, where the inputs are the words, all weights are fixed at 1, the bias for each node is fixed at -0.5, and the activation function is the step function originally seen in perceptrons. The output of the network is every hidden node, which forms the latent feature vector. I refer to this model as a modified LTM (MLTM) in the code.


To test my approach, I decided to start with a simple text classification task. In this task, we have a dataset of text with class labels for various categories. The goal is to train a classifier with labeled data to be able to determine the correct class for unlabeled test data.


The dataset I chose contains brief news articles from AG News, labeled with one of four news topics. I performed some filtering and pre-processing to clean it up before running it through my classification code.


The comparison baseline classifier I used is a naive Bayes classifier. It’s an extremely simple method, but it works quite well on this type of task. All spam filters were originally built using naive Bayes.

To build the MLTM, I use the AgglomerativeClustering class from the sklearn package. Here’s some sample code:

The parameters shown will cause the clustering algorithm to build a full binary tree with the number of non-leaf nodes (latent variables) equal to the number of leaf nodes (words) minus one. After the MLTM is built, I extract binary feature vectors from a set of text by setting the leaf node values to True (1) if the corresponding word appears in the text:

The beta values are propagated upward according to the first equation. All beta values from non-leaf nodes are extracted from the tree to form the binary feature vector.

My experimental MLTM classifier consists of an MLP with two hidden layers, with the MLTM binary feature vectors as inputs, implemented with PyTorch. I added an option to the MLTM to prune the tree down by grouping together observed variables to meet a minimum count of leaf nodes for any parent node.


The table below shows the test (out-of-sample) classification accuracy for the baseline and the MLTM-MLP classifier for various values of the pruning variable, min_descendants.

Min DescendantsClassification Accuracy
Naive Bayes88.32%

This is a four-class classification problem, so a random classifier should give an accuracy of around 25%. The MLTM-MLP classifier is able to outperform naive Bayes when the pruning parameter is no higher than 8. I find it interesting that my experimental classifier still performs reasonably well even when I prune the tree down by requiring every latent variable have a minimum of 256 descendant leaf nodes.

Going Forward

Text classification is a classic example of an NLP task, but it’s not terribly challenging or interesting at this point, at least not when the data is relatively clean. Going forward, I’ll be looking at using the MLTM in conjunction with recurrent neural networks (RNN) for machine translation tasks (e.g. translating English sentences to French).

2 thoughts on “Latent Tree Models Learned on Word Embeddings, Part 1

Leave a Reply