Spell check features, or spell checkers, are software applications that check words against a digital dictionary to ensure they are correctly spelled. Words that are identified as misspelled by the spell checker are usually highlighted or underlined. Among the numerous spelling checking tools and applications available, this post will focus on NeuSpell, a neural network, and Python-based spelling checking toolbox. The following are the key points that will be addressed in this article.
Table of Contents
- How does the Spell Checker Works?
- Under the Hood of NeuSpell
- Models in NeuSpell
- Implementation details of NeuSpell
- Implementing NeuSpell
Let’s start the discussion by understanding how various tools work for spell correction.
How does the Spell Checker Works?
When presenting a document to clients, professors, or any other audience, saying something smart and valuable is crucial. However, if your content is riddled with typos, misspellings, and errors, most people are likely to overlook it. Perfect copy is a sign of professionalism, and most businesses expect nothing less from their documentation. A spell checker program or the spell checking functions provided by a word processor are two useful tools that computer users can use to edit their documents.
The most common type of error in written text is misspelt words. As a result, spell checkers are commonplace, appearing in a variety of applications such as search engines, productivity and collaboration tools, messaging platforms, and so on. Many high-performing spelling correction systems, on the other hand, are developed by businesses and trained on massive amounts of proprietary user data.
Many freely available off-the-shelf correctors, such as Enchant, GNU Aspell, and JamSpell, on the other hand, do not make effective use of the misspelt word’s context. For example, based on the context, they fail to distinguish between thaught and taught or thought: “Who thaught you calculus?” vs. “I never imagined I’d be given the fellowship.”
Under the Hood of NeuSpell
In their paper, Sai Muralidhar et al. propose a spelling checker toolkit called NeuSpell. They show a spelling correction toolkit that consists of several neural models that accurately capture context around misspellings. They use several text noising strategies to train these neural spell correctors by curating synthetic training data for spelling correction in context.
For word-level noising, these strategies use a lookup table, and for character-level noising, they use a context-based character-level confusion dictionary. Harvest isolated misspelling-correction pairs from various publicly available sources to populate this lookup table and confusion matrix.
NeuSpell is an open-source toolkit for English spelling correction. This toolkit includes ten different models that are tested against naturally occurring misspellings from a variety of sources. When models are trained on our synthetic examples, correction rates improve by 9% (absolute) when compared to training on randomly sampled character perturbations.
The correction rate is increased by another 3% when richer contextual representations are used. This toolkit allows users to use proposed and existing spelling correction systems through a unified command line and a web interface.
Models in NeuSpell
This toolkit includes ten different spelling correction models, (i) including two commercially available nonneural models, (ii) four published neural models for spelling correction, and (iii) four of our extensions. The following are the details of the first six systems:
SC-LSTM
It uses semi-character representations fed through a bi-LSTM network to correct misspelt words. The semi-character representations combine one-hot embeddings for the first, last, and bag of internal characters.
CHAR-LSTM-LSTM
The model creates word representations by feeding each character into a bi-LSTM. These representations are then fed into a second biLSTM that has been trained to predict the corrective action.
CHAR-CNN-LSTM
This model, like the previous one, uses a convolutional network to create word-level representations from individual characters.
BERT
A pre-trained transformer network is used in the model. The word representations are obtained by averaging the sub-word representations, which are then fed to a classifier to predict its correction.
GNU Aspell
To score candidate words, it employs a combination of the Metaphone phonetic algorithm, Ispell’s near-miss strategy, and a weighted edit distance metric.
They enhanced the SC-LSTM model with deep contextual representations from pre-trained ELMo and BERT to better capture the context around a misspelt token. They append them to semi-character embeddings before feeding them to the biLSTM or to the biLSTM’s output because the best point to integrate such embeddings varies by task. Our toolkit currently includes four such trained models: ELMo/BERT coupled with a semi-character-based bi-LSTM model at input/output.
The Implementation details of NeuSpell
In NeuSpell, neural models are trained by treating spelling correction as a sequence labelling task, with a correct word labelled as itself and its correction labelled as to its correction. The abbreviation UNK refers to labels that aren’t in the dictionary. A softmax layer is used to train models to output a probability distribution over a finite vocabulary for each word in the input text sequence.
During training, they used 50,100,100,100 sized convolution filters with lengths of 2,3,4,5 in CNNs and set the hidden size of the bi-LSTM network in all models to 512. The bi-LSTM outputs were dropped out at 0.4, and the models were trained using cross-entropy loss.
For models with a BERT component, we used the BertAdam optimizer, and for the rest, we used the Adam optimizer. The default parameter settings are used with these optimizers. I used a batch size of 32 examples and trained for 3 epochs of patience.
Replace UNK predictions with their corresponding input words during inference, then evaluate the results. The accuracy (percentage of correct words among all words) and word correction rate of the models are then assessed (percentage of misspelt tokens corrected).
To use ELMo and BERT, the libraries AllenNLP and Huggingface were used. The Pytorch library is used to implement all of the neural models in this toolkit, and they are compatible with both CPU and GPU environments.
Now let’s see how we can implement NeuSpell.
Implementing NeuSpell
To move further we need to install the NeuSpell from its official repository by cloning and installing the dependencies from the requirement.txt file as mentioned in the repository or we can directly install it by using pip command as pip install neuspell.
Import all the dependencies
import neuspell from neuspell import BertChecker, CnnlstmChecker
Now instantiate the BertChecker Class and download the pre-trained model.
checker_bert = BertChecker() # Download BERT Pre-trained model checker_bert.from_pretrained()
Now let’s take some samples of incorrectly spelled sentences and see how the model can correct them.
checker_bert.correct("I luk foward to receving your reply")
And here is the output.
Let’s take another example,
checker_bert.correct_strings(["Thee wors are often used together. You can go to the defition of spellig or the defintion of mistae. Or, see other combintions with mistke.", ])
The beautiful thing that I observed from this toolkit is that we can even pass our text file directly and it can return the cleaned version in the form of text file just like in the above example by using just a single line of code as below.
checker_bert.correct_from_file(src="/content/History_100.txt")
The above code returns a clean_version.txt
in the local directory.
Further from this step, we can also evaluate our text files. For that under the checker_bert.evaluate()
we need to pass the original clean file and the corrupted file as shown below.
checker_bert.evaluate(clean_file="clean_version.txt", corrupt_file="History_100.txt")
Final Words
Through this post, we have seen how Spelling checker tools can play a vital role. We talked about NeuSpell, a spelling correction toolkit with ten different models, in relation to the various toolkits available. Unlike popular open-source spell checkers, our models accurately capture the context around misspelt words, and we have seen everything beforehand.