In natural language processing, we mainly find the use cases of text classification. There are various processes like sentiment analysis, positive and negative review analysis that is purely dependent on the classification modelling. There can be various situations where we may also require regression modelling on the text data. Applying regression modelling on text data can be called a text regression process. In this article, we are going to discuss text regression and how we can implement it. The major points to be discussed in the article are listed below.
Table of contents
- What is text regression
- Implementing text regression
- Importing and preprocessing data
- Text regression model
- Predicting values
Let’s start with understanding the text regression.
What is text regression?
We can think of text regression as a method of using attributes from the text data as a covariate in regression models. There are various fields where we may require regression analysis methods such as predicting salary based on the text where work requirement is mentioned or views on any website based on the content written on the website. The basic difference between text classification and text regression is the target variable. Where we find categorical information in the classification data and ordinal data in the regression data. In this article, we aim to learn how we can perform this method of machine learning or specifically natural language processing. Let’s start the implementation of text regression.
Are you looking for for a complete repository of Python libraries used in data science, check out here.
Implementing text regression
For implementing a model for text regression we are required to acquire data. For this article, we are using the IMDB data set. Let’s start the implementation of text regression by importing and pre[rocesssing of the data.
Importing and preprocessing data
Since it is difficult to get real-life data, we are going to use standard data for natural language processing which is our IMDB dataset. We all know that the IMDB dataset is classification data so we will be required to convert it into regression data. For converting it we will treat the 0 and 1 values as numerical values and make them our target variable. We can find the data here. Let’s download the data.
import tensorflow as tf
IMDB = tf.keras.utils.get_file(
fname="aclImdb.tar.gz",
origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
extract=True,
)
Output:
Setting the path to IMDB data.
import os IMDB_DIR = os.path.join(os.path.dirname(IMDB), "aclImdb")
Extracting test and training files from the downloaded data.
from sklearn.datasets import load_files
labels = ["pos", "neg"]
train = load_files(
os.path.join(IMDB_DIR, "train"), shuffle=True, categories=labels
)
test = load_files(
os.path.join(IMDB_DIR, "test"), shuffle=False, categories=labels
)
Splitting the dataset
x_train = np.array(train.data)
y_train = np.array(train.target)
x_test = np.array(test.data)
y_test = np.array(test.target)
Let’s check the data
print(x_train.shape)
print(y_train.shape)
print(x_train[10][:50])
Output:
Here our data collection and data preprocessing are completed. Now we are ready to set our regression model on the data.
Text regression Model
In this article, we are going to use a class for the AutoKeras library. More information about the AutoKeras can be found here. The class TextRegressor under this library provides us with a facility to perform text regression on text data. We can install this library in our environment using the following lines of code.
!pip install autokeras
Once the installation is completed we can use the class by importing it. Let’s see how we can do it.
import autokeras
from autokeras import TextRegressor
text_reg = TextRegressor(overwrite=True, max_trials=1)
In the above code, we call the TextRegressor class from the library and define a model instance that will try on 10 different regression models and fit the training data.
Let’s fit the model with 5 epochs.
text_reg.fit(x_train, y_train, epochs=5)
Output:
The above output is output while data is getting fitted on the model. One thing that is very important about AutoKeras is it helps us in getting an optimized model by using hyperparameter tuning and model testing. It is a library that provides features for automated machine learning. We don’t need to specify epochs; it can automatically detect and adapt the number of epochs. Now the outcome of this model fitting will look as follows:
The above output represents the validation loss and mean_squared_error at every epoch.
Predicting values
After fitting the model we are ready to predict on test data using the optimized model.
y_pred = reg.predict(x_test)
print(y_pred)
Output:
In the above, we can see that the values in the prediction array we have are float values. We can check what are the values we have in the test set.
print(y_test)
Output:
Here we can see that the values we have in the test are in binary form. As we have discussed the model assumed that the target is numerical values and predicted accordingly as we do in regression modelling.
Final words
In this article, we discussed text regression which is a method of performing regression analysis on text data and we looked at a class from the AutoKeras library that helped us in performing text regression very easily and accurately.