Identifying Unknown Authors of The Federalist Papers with Machine Learning

In 1787, eleven years after the Declaration of Independence was signed and just after the Constitutional Convention took place, a series of anonymous essays began to appear in New York newspapers urging the ratification of the U.S. Constitution. Later, the creators of these influential documents were identified as Alexander Hamilton, James Madison, and John Jay. Each of these founding fathers individually contributed essays to make up the collection of eighty-five documents known as The Federalist Papers, with the exception that Hamilton and Madison composed a handful of the papers together. The Federalist Papers are an essential source for understanding and interpreting the original intent of the Constitution. However, even today it remains uncertain which of these men authored eleven of The Federalist Papers. Experts identify the author of these articles as either Alexander Hamilton or James Madison, but cannot distinguish between the two men. Can modern-day natural language processing techniques be used to conclude who authored the remaining eleven anonymous documents?

Natural language processing allows us to convert text documents into a data matrix without loss of information. Although the human brain isn’t capable of comprehending language in this form, computers can use data matrices to analyze the text mathematically. There are infinite ways to code The Federalist Papers into data. I chose to look at a simple feature: word frequency. The word frequency is the total number of times a particular word appears in a document divided by the total number of words in that document. Implementing this technique, we transform each document from the familiar format of paragraphs, sentences, and words into a long list of numbers, where each number is the frequency of a unique word. Individual articles from The Federalist Papers will have distinct word frequencies and perhaps the word frequencies can tell us something about the author. 

Once the data is represented in mathematical space, we can design a model that predicts the author of the document based on its mathematical representation. In this case, we want an algorithm that ties the word frequencies of a text to a particular author. I achieved this by combining two separate processes: principal component analysis (PCA) followed by linear discriminant analysis (LDA). Principal component analysis extracts the most important features from our data and yet again mutates its format so that a model can analyze the text more easily. Linear discriminant analysis, emphasis on discriminant, models how word frequencies differ by author. Based on this model, LDA can then predict the author of a document.

  • PCA High-Level Overview: PCA performs a change of basis of the data, the basis being the multi-dimensional space where the points sit, to reduce the dimensionality. Instead of representing each federalist paper by a long list of 10,000 word frequencies, PCA will represent each paper by a short series of a specified number of vectors, say 50. This preserves the most important information in the data, while allowing for simplicity and interpretability in the machine learning model. Furthermore, the machine learning algorithm is less likely to overfit; in other words, we can rely on the model to make accurate predictions with new data (the papers with unknown authors).

  • PCA Algorithm Technical Overview: Say we want to represent each federalist paper in d dimensions. PCA finds the axis in d-dimensional space for which the coordinates of the data points projected onto that axis would have the largest variance (we preserve the greatest amount of information). It does this by identifying d eigenvalue-eigenvector pairs. Each eigenvalue is the variance of the data points’ coordinates when projected onto the span of the corresponding eigenvector. The eigenvector associated with the largest eigenvalue is the principal component for this iteration. Next, PCA searches all possible axes that are orthogonal to the previous principal components to identify the next axis for which the coordinates have the largest variance. The process continues for d iterations.

These are a lot of complicated words. For those of us who don’t think in mathematical jargon, how can we understand what PCA is actually doing? A great disadvantage of machine learning algorithms is that they are often “black boxes.” The data and the algorithm are so far out in abstract mathematical space that they are incomprehensible to the human brain. In addition, we don’t know the details of the inner workings of the algorithm or how exactly it makes decisions. However, PCA allows us to visualize an approximation of what The Federalist Papers look like in this new form by plotting the first two principal components (the ones that contain the most information about the documents). The figure below shows each Federalist Paper represented in this way. The colors indicate which founding father authored each text. PCA does not take the authorship of the documents into consideration, yet we can already see that the data points appear to cluster based on the author; documents written by the same author are close to each other in mathematical space where closeness is determined by patterns of word frequencies. This is a promising sign that word frequencies will indeed help distinguish which author wrote each of the documents.

Once the data is represented by its principal components, the model then performs linear discriminant analysis. LDA attempts to separate the documents in mathematical space based on their unique pattern of word frequencies, resulting in groupings based on authorship. The algorithm does this by modeling how word frequencies differ from author to author.

  • LDA High-Level Overview: First, LDA computes the “separability” between different authors’ word frequencies. In other words, separability is the mathematical distance between the documents associated with one author and the documents associated with the other authors. Next, LDA estimates how much variability there is within all the documents associated with a particular author. LDA uses these two values (separability between classes and variance within the classes, where the class is the author) to project the data into lower-dimensional space so that the data points are separated by authorship. In this lower-dimensional space, the text data is now represented by a combination of linear discriminant functions.

Once again, LDA allows us to peek into the black-box. Below, I have plotted several visualizations of what LDA does to the text data. These are again approximations, looking only at the first two linear discriminant functions associated with each document. Many of the plots do not show clear separation between the groups. This is actually a good sign. For eight of the nine plots below, I scrambled the authors of the papers; now documents are not associated with their true author, therefore, the unique word frequencies are no longer associated with their corresponding authors. This effectively severs the relationship between author and word frequency. In these eight instances, LDA should not be able to accurately use word frequencies to group the data points by author; word frequency cannot accurately predict authorship if no relationship exists between the two. However, in the plots below, LDA nevertheless attempts to make the predictions. It does not do a good job. Each plot with the colored-in data points corresponds to the data with randomly assigned authors. In these plots, there is overlap between the different colored points, showing that LDA cannot distinguish who wrote which paper. The plot with the open dots is the plot corresponding to the real data. Here, we see clear separation between the groups. This means there are distinctive patterns in the word frequencies used by the different Founding Fathers. If the author and the word frequencies were not related, we would expect the true results to look like the manipulated results with lots of overlap.

Once the data is separated by group as illustrated above, think of drawing fixed straight lines between the different groupings. These are called decision boundaries. To predict the author of a document, we represent its text data using PCA followed by LDA as described above. The model makes the prediction of who authored the document based on where the point falls in relation to the decision boundaries. If the point corresponding to the text falls in the Hamilton region, the model will predict that Alexander Hamilton is the author. 

Finally, we have our model! The evidence above indicates that the model should work, but how do we quantify how well it works? How do we know if it will work on documents it hasn’t seen before, like the essays with uncertain authorship? The best way to determine the model’s accuracy and generalizability is cross-validation. Cross-validation randomly takes a specified number of Federalist Papers out of consideration so that the model is created using only the remaining Federalist Papers. The papers taken out of consideration are the validation set. The remaining Federalist Papers are the training set. We then perform PCA followed by LDA on the training set as described above to produce our classification model. This model is then used to predict the authors of the papers the model has not seen before, the validation set. The true authors of the documents in the validation set are known, therefore, we can calculate the misclassification rate of our model. Over 1,000 trials of cross-validation, the model only predicted the author incorrectly 0.8% of the time. There are eleven essays with unknown authorship, if we take the average cross-validation error to be an accurate proxy, we would expect to classify all of these documents correctly. The plot below demonstrates the process of cross-validation. The data points in the validation set are signified by the Xs surrounded by a circle. The color of the X corresponds to the true author whereas the color of the circle corresponds to the predicted author. In this example, the model predicted the authors of all five validation points correctly.

An important consideration, however, is that several of the essays were written by both Alexander Hamilton and James Madison. It is possible that some of the documents with unknown authorship were similarly written by more than one author. We see below that the texts written by both Hamilton and Madison (marked with turquoise Xs) fall squarely in the Madison distribution rather than in between the Hamilton and Madison distributions as we might assume. Although the model does not have the option of predicting dual authorship, we might expect to see evidence of it in the visualizations. This indicates that if any of the uncertain essays have more than one author, our algorithm is not likely to show an indication of this. 

With these considerations taken into account, let’s finally find out who wrote the remaining anonymous Federalist Papers. I created several slightly different models by varying both the text data and the model parameters. In some models, I removed “stop” words from the text data. Stop words are commonly used words like “a”, “the”, etc. In other versions, I removed capitalization, punctuation, or removed topic related words like “constitution”, “congress”, “bill of rights” and more. I also varied the model by changing the number of principal components used in the PCA step. Comparing predictions from different models helps us assess both how reliable and generalizable our results are. For ten out of the eleven documents examined, each of these models made identical predictions. However, on Federalist Paper No. 55, the predictions were split. For each Federalist Paper with unknown authorship, the table below shows the probability with which the document was authored by Hamilton, Jay, or Madison according to one of the models. Probabilities close to 1 indicate high certainty about the authorship. Overall, due to the consistency of the predictions between the various models, the high probabilities indicating the model has high certainty about its predictions, and the very low cross-validation error, I feel confident that our approach correctly predicts most of the authors of the anonymous documents.

We cannot determine the authors with absolute certainty however. We see in the table below that this particular model is almost 100% sure about the authorship of each Federalist Paper, indicated by the fact that the probabilities are all extremely close to 1. This is only an estimation of the actual certainty of authorship. For example, each model has a probability close to 1 when predicting the author of Federalist Paper No. 55, yet the models are split in their predictions of the author of this document. In other words, although each model is telling us it has high certainty who the author is, this doesn’t necessarily translate to a correct prediction; if a single author wrote Federalist Paper No. 55, then one of the models has to be wrong. An alternative is that Federalist Paper No. 55 was written by both Hamilton and Madison, but the models are not allowed to predict both authors. This could potentially explain the disagreement between the models.

The figures below visualize the word frequencies of the texts with unknown authorship compared to the texts with known authorship. The papers with uncertain authors are all marked with a blue X, while the turquoise X distinguishes Federalist Paper No. 55. The color of the circle around the X indicates the predicted author. In the top plot, we see this model predicts Hamilton as the author of Federalist Paper No. 55, while in the bottom plot, a slightly different model that did not consider topic related words predicts Madison as the author. 

We will never pull back the curtain that separates us from the past to definitively see who wrote the eleven unidentified Federalist Papers, but as machine learning techniques evolve, we draw closer to the answer. I have demonstrated one simple way of predicting the authors of these eleven documents. There are endless other methods to explore. However, here, we get a rare opportunity to peek inside the “black box” of machine learning algorithms and visualize how the texts are represented in mathematical space and how the model makes its predictions. As machine learning becomes increasingly complex, we tend to lose this accountability. Sometimes the simplest methods are also the best.

Previous
Previous

Up in Smoke