UHG
Search
Close this search box.

Complete Guide on Language Modelling: Unigram Using Python

Language modelling is the speciality of deciding the likelihood of a succession of words. These are useful in many different Natural Language Processing applications like Machine translator, Speech recognition, Optical character recognition and many more.In recent times language models depend on neural networks, they anticipate precisely a word in a sentence dependent on encompassing words. However, in this project, we will discuss the most classic of language models: the n-gram models.

Share

Language_model
Table of Content

Language modelling is the speciality of deciding the likelihood of a succession of words. These are useful in many different Natural Language Processing applications like Machine translator, Speech recognition, Optical character recognition and many more.In recent times language models depend on neural networks, they anticipate precisely a word in a sentence dependent on encompassing words. However, in this project, we will discuss the most classic of language models: the n-gram models.

In natural language processing, an n-gram is an arrangement of n words. For example “Python” is a unigram (n = 1), “Data Science” is a bigram (n = 2), “Natural language preparing” is a trigram (n = 3) etc.Here our focus will be on implementing the unigrams(single words) models in python.

Assumptions For a Unigram Model

1.  It depends on the occurrence of the word among all the words in the dataset.

2.  Probability of a word is independent of all the words before its occurrence.

Code Implementation

Import all the libraries required for this project.

import nltk
nltk.download('reuters')
from nltk.corpus import reuters
nltk.download('punkt')

Reuters dataset consists of 10788 documents from the Reuters financial newswire services.

Store the words in a list.

words = list(reuters.words())
words

len(words)

We will start by creating a class and defining every function in it. The idea is to generate words after the sentence using the n-gram model. Predicting the next word with Bigram or Trigram will lead to sparsity problems. To solve this issue we need to go for the unigram model as it is not dependent on the previous words.

Let’s calculate the unigram probability of a sentence using the Reuters corpus.

class NGrams:
    def __init__(self, words, sentence):
        self.words = words
        self.sentence = sentence
        self.tokens = sentence.split()
    def get_tokens(self):
        return self.tokens
    def add_tokens(self,value):
        temp = self.tokens
        temp.append(value)
        self.tokens = temp
        return self.tokens
    def unigram_model(self):
        self.next_words = np.random.choice(words, size=3)
        return self.next_words

Here we need to calculate the probabilities for all the various words present in the results of the over unigram model. Select the top three words based on probabilities.

   def get_top_3_next_words(self,next_words):
        next_words_dict = dict()
        for word in next_words:
            if not word in next_words_dict.keys():
                next_words_dict[word] = 1
            else:
                next_words_dict[word] += 1
          for i,j in next_words_dict.items():
              next_words_dict[i] = np.round(j/len(next_words),2)
        return sorted(next_words_dict.items(), key = lambda k:(k[1], k[0]), reverse=True)[:3]
    def model_selection(self):
            top_words = self.unigram_model()
            print("unigram-model")
            return top_words
model = NGrams(words=words, sentence=start_sent)
import numpy as np
for i in range(5):
    values = model.model_selection()
    print(values)
    value = input()
    model.add_tokens(value)

The model generates the top three words. We can select a word from it that will succeed in the starting sentence. Repeat the process up to 5 times. The result is displayed below.

print(model.get_tokens())

Final step is to join the sentence that is produced from the unigram model.

print(" ".join(model.get_tokens()))

Final Thoughts

In this article, we have discussed the concept of the Unigram model in Natural Language Processing. Further, we can research on the topic of Bi-gram and Trigram to generate words after the sentences. Finally, I hope this article is useful to you.

📣 Want to advertise in AIM? Book here

Related Posts
19th - 23rd Aug 2024
Generative AI Crash Course for Non-Techies
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.
Flagship Events
Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord-icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.