Last updated December 14, 2021
In AI News

Google Adds Fast Wordpiece Tokenization To Tensorflow

Google’s LinMaxMatch approach improves performance, makes computation faster and reduces complexity

Share

Published on December 14, 2021

by Meeta Ramnani

Google presented the ‘Fast WordPiece Tokenization’ at EMNLP 2021, where they developed an improved end-to-end WordPiece tokenisation system. It has the capability to speed up the tokenisation process, saving computing resources and reducing the overall model latency. In comparison to traditional algorithms, this approach improved performance up to 8x faster as it reduces the complexity of the computation by order of magnitude. Google has applied this in a number of systems and has released it in TensorFlow Text.

Using the greedy longest-match-first strategy to tokenise a single word, WordPiece iteratively picks the longest prefix of the remaining text that matches a word in the model’s vocabulary. This approach is called MaxMatch and has been used in Chinese word segmentation since the 1980s. Despite its wide use in NLP, it is still computation intensive. Google has proposed an alternative to this – the LinMaxMatch, which has a tokenisation time that is strictly linear with respect to n.

In other terms, if tire-matching cannot match an input character for a given node, the standard algorithm backtracks to the last character where a token was matched and restarts the trie matching procedure from there. This results in repetitive and wasteful iterations. Instead of backtracking, Google’s method triggers a ‘failure transition’ in two steps. It first collects the precomputed tokens stored at that node, known as ‘failure pops’ and then follows the precomputed ‘failure link’ to a new node from where the trie matching process continues. As n operations are required to read the entire input, LinMaxMatch is asymptotically optimal for the MaxMatch problem.

As the existing systems pre-tokenise the input text by splitting it into words by whitespace, punctuation and characters, and then call WordPiece tokenisation on each resulting word, Google has proposed an end-to-end WordPiece tokeniser. It combines pre-tokenisation and WordPiece into a single, linear-time pass. It uses the LinMaxMatch trie-matching and failure transitions and only checks for whitespace and punctuation characters among the few input characters that are not handled by the loop. This makes it more efficient as it traverses the input only once, performs fewer whitespace/punctuation checks, and skips the creation of intermediate words.

📣 Want to advertise in AIM? Book here

Meeta Ramnani

Meeta’s interest lies in finding out real practical applications of technology. At AIM, she writes stories that question the new inventions and the need to develop them. She believes that technology has and will continue to change the world very fast and that it is no more ‘cool’ to be ‘old-school’. If people don’t update themselves with the technology, they will surely be left behind.