Google launches TensorFlow.Text – Text processing in Tensorflow
Google has unveiled TensorFlow.Text (TF.Text), a newly launched library for preprocessing language models using TensorFlow, the company’s end-to-end open source platform for machine learning (ML).
“TensorFlow provides a wide breadth of ops that greatly aid in building models from images and video. However, there are many models that begin with text and the language models built from these require some preprocessing before the text can be fed into the model,” explained Robby Neale, a software engineer at TensorFlow in a blog post on Medium.
“TF.Text is a TensorFlow 2.0 library that can be easily installed using PIP and is designed to ease this problem by providing ops to handle the preprocessing regularly found in text-based models, and other features useful for language modeling not provided by core TensorFlow,” he added.
The advantage of using these ops in text preprocessing is that they are done in the TensorFlow graph. Also, TF.Text comes with the ability to break the string of text up into tokens and analyze text like words, numbers, and punctuation.
It can recognize white space, Unicode script, and predetermined sequences of word fragments like suffixes or prefixes that Google calls “wordpieces”, a method commonly used in BERT models (a pretraining technique for language models). Along with the tokenizers, the library also includes ops for normalization, n-grams, sequence constraints for labeling and more.
The news of TF.TXT comes just days after the beta release of TensorFlow 2.0, which utilizes fewer APIs, deeper Keras integration and improvements to runtime for Eager Execution.