Google presented the ‘Fast WordPiece Tokenization’ at EMNLP 2021, where they developed an improved end-to-end WordPiece tokenisation system. It has the capability to speed up the tokenisation process, saving computing resources and reducing the overall model latency. In comparison to traditional algorithms, this approach improved performance up to 8x faster as it reduces the complexity of the computation by order of magnitude. Google has applied this in a number of systems and has released it in TensorFlow Text.
Using the greedy longest-match-first strategy to tokenise a single word, WordPiece iteratively picks the longest prefix of the remaining text that matches a word in the model’s vocabulary. This approach is called MaxMatch and has been used in Chinese word segmentation since the 1980s. Despite its wide use in NLP, it is still computation intensive. Google has proposed an alternative to this – the LinMaxMatch, which has a tokenisation time that is strictly linear with respect to n.
In other terms, if tire-matching cannot match an input character for a given node, the standard algorithm backtracks to the last character where a token was matched and restarts the trie matching procedure from there. This results in repetitive and wasteful iterations. Instead of backtracking, Google’s method triggers a ‘failure transition’ in two steps. It first collects the precomputed tokens stored at that node, known as ‘failure pops’ and then follows the precomputed ‘failure link’ to a new node from where the trie matching process continues. As n operations are required to read the entire input, LinMaxMatch is asymptotically optimal for the MaxMatch problem.
As the existing systems pre-tokenise the input text by splitting it into words by whitespace, punctuation and characters, and then call WordPiece tokenisation on each resulting word, Google has proposed an end-to-end WordPiece tokeniser. It combines pre-tokenisation and WordPiece into a single, linear-time pass. It uses the LinMaxMatch trie-matching and failure transitions and only checks for whitespace and punctuation characters among the few input characters that are not handled by the loop. This makes it more efficient as it traverses the input only once, performs fewer whitespace/punctuation checks, and skips the creation of intermediate words.