Essay Classification with Trigram Language Model

Completion: September 2023

This model is trained on the Brown corpus, which is a sample of American written English collected in the 1950s.
The model is trained on the Brown corpus to calculate the log probabilities of a the sequence of words in a sentence using smoothed n-gram probabilties (via linear interpolation). This program then finds the perplexity of the model on a test corpus, measuring how well the model predicts a sample of text.

The model's accuracy is then tested on a dataset of essays written by non-native speakers of English for the ETS TOEFL test. These essays are scored according to skill level low, medium, or high. Only the essays that have been scored as 'high' or 'low' are considered. Different language models are trained on the training sets of each category and then we use these models to automatically score unseen essays. The perplexity of each language model on each essay is computed. The model with the lower perplexity determines the class of the essay.

The model accuracy is calculated as (correct predicitions/total predictions).
This model successfully scored an accuracy greater than 80%.

Tags: Natural Language Processing Python Columbia

How to contact me

Email: cjd2186@columbia.edu