Finetuning BERT for Semantic Role Labeling

Completion: December 2023

Data Preparation/Encoding:
The data is retrieved from the Ontonotes 5.0 dataset, which is an extension of PropBank.
Using the BERT tokenizer, sentences are converted into IDs and attention masks to show BERT which words are real tokens vs padding, highlighting words important for the context.

Model Composition and Training:
A pretrained BERT model is frozen, and a classifier layer is added. The pretrained BERT model provided good contextualized embeddings for the input, so the classifier layer can now be trained using the OntoNotes 5.0 dataset, minimizing the loss and backpropagating/updating the model weights to improve prediction accuracy.

Decoding:
Now the classifier will return logits to representation the predicted labels for each word in the sentence. This decoding process involves mapping the logits to labels and aligning them with the input tokens. Once decoded, the model's performance can be evaluated based on how well the predicted labels match the ground truth labels.

Tags: Natural Language Processing Python Neural Network HuggingFace PyTorch Finetuning Columbia

How to contact me

Email: cjd2186@columbia.edu