BERT Lexical Substitution

Completion: November 2023

Task:
In this task, the goal is to find lexical substitutes for individual target words in context. For example, given the following sentence: 'Anyway, my pants are getting tighter every day.' the goal is to propose an alternative word for tighter, such that the meaning of the sentence is preserved. Such a substitute could be 'constricting', 'small' or 'uncomfortable.'
To complete this task the pretrained model, DistilBERT from HuggingFace, is used to find substitutes.

Five different approaches are taken to find lexical substitutes: WordNet Frequency Baseline, Lesk Algorithm, Word2Vec Embeddings, BERT Pretrained Embeddings, Query with Chatgpt API.

WordNet Frequency Baseline
Given a context, use WordNet to find the possible synynomy with the highest total occurrence frequency in the same context in WordNet.
Lesk Algorithm Word Sense Disambiguation (WSD) is performed using the Lesk Algorithm to select a synset for the target word where the definition of the synset and the context of the target word are similar. The algorithm then returns the most frequent synonym from that synset as a substitute.
Word2Vec Embeddings Find possible synonyms for the target word using WordNet. Return the synonym that is most similar to the target word according to the Word2Vec embedding.
BERT Pretrained Embedding Using Keras, feed in an embedded sentenced with the target word masked. BERT will returned the candidate for the target word given the context.
Query Chatgpt API Feed the sentence into a Chatgpt prompt using Chatgpt's API. Chatgpt will give the best substitute. Results are improved using few-shot prompting.

Tags: Natural Language Processing Python HuggingFace Rest API Word2Vec Neural Network TensorFlow Keras Prompting Columbia

How to contact me


Email
cjd2186@columbia.edu