Humans do not start learning everything from the beginning; they basically relate the things to each other to make inferences about the new thing in their mind. For example, when they learn how to ride a motorcycle and they already know how to cycle, they don’t need to learn about braking or any other basic things because they already know them. they just add the extra information with the regular or older information. Traditional neural networks can’t do this. It’s a shortcoming of neural networks and RNN(recurrent neural network) fixes this issue. They are networks with various loops to persist the information and LSTM(long short term memory) are a special kind of recurrent neural networks. Which are very useful when dealing with sequential data like time series data and NLP data. There are various types of LSTM models. Basically we can divide them into three main LSTMs.
- LSTM forward pass
- LSTM backward pass.
- Bidirectional LSTM or Bi-LSTM
As the name suggests the forward pass and backward pass LSTM are unidirectional LSTM which processes the information in one direction either in the forward side or in the backward side where the bidirectional LSTM processes the data in both sides to persist the information.
What is Bi-LSTM ?
Bidirectional long short term memory (bi-lstm) is a type of LSTM model which processes the data in both forward and backward direction. This feature of flow of data in both directions makes the BI-LSTM different from other LSTMs.
For example, let’s consider the sentence ‘ I love the movie, it was a fantastic feeling watching it ‘ and to classes like and hate. From the inference of the sentence we need to classify from which class it is related. For this a unidirectional LSTM models layer will go through ‘I love ……….watching it’ like this or ‘it watching ……….love i’ like this and persist the information in the sequences and in bidirectional LSTM the models layer with process the sentence in both direction and persist the information of both type (start to end) and (end to start) so that when ever the words like love or fantastic comes again in any sentence model can classify them them in love class.
The image below represents a single forward LSTM layer.
And the below image represents a Bi-LSTM model.
This article is focused about the Bi-LSTM with Attention. To know more in depth about the Bi-LSTM you can go to this article. Where I have explained more about the Bi-LSTM and how we can develop it. One thing which is different from this article is here we will use the attention layer to make the model more accurate.
Attention Mechanism
The attention mechanism is one of the most valuable breakthroughs in deep learning model preparation in the last few decades. It has been used broadly in NLP problems.
Here what attention means?
Let’s consider an example where we need to recognize a person from a photo of few known people. Basically it’s a group photograph of the people we know. Now we need to recognize only one person so how our mind assists our subconscious mind. Mind will generate an image of that person and by matching we can recognize that person. Which means our mind is paying attention only to the image of that person which was generated. So focusing on only one person in a group can be considered as attention.
Before the introduction of the attention mechanism the basic LSTM or RNN model was based on an encoder-decoder system. Where encoding is used to process the data for encoding it into a context vector and creates a good summary of the input data. Then this summary goes through the decoding part where in decoding the model understands the data and translates the data.
Here if the summarization of the data is not good then the then the decoder translate and understands the data in bad manner and this is shortcoming of the basic model where the accuracy of the model is good but in situation of the long information the summarization doesn’t works well and model shows the bad results, it is called the long-range dependency problem of RNN or LSTMs.
Let’s say from a sentence we need to predict the next word using the context of a given sentence “despite paying from google pay, he started paying from phonepe because he is more comfortable with phonepe.” in this group of words we want to predict comfortable using paying and phonepe word. In this situation we need to give more weight to the phonepe and paying and ignoring google pay . basic lstm gets confused between the words and sometimes can predict the wrong word.
So whenever this type of situation occurs the encoder step needs to search for the most relevant information, this idea is called ‘Attention’.
A simple structure of the bidirectional LSTM model can be represented by the above image.
Where the encoder states reads the and summarizes the sequential data and the summarized part gets weights from the attention layer of the model so that the decoder state can translate it better and model can predict it better.
Implementation of the BI-LSTM with attention.
Next in the article we will implement a simple Bi-lstm model and Bi-models with Attention and will see the variation in the results.
Importing the libraries.
import numpy as np
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.datasets import imdb
In the process i am using keras.dataset provided imdb dataset. Where the dataset is having classified reviews of the viewers of the movie. To know more about the data reader can go to this link.
Importing the dataset.
n_unique_words = 10000
(x_train, y_train),(x_test, y_test) = imdb.load_data(num_words=n_unique_words)
Output:
Sequencing the data.
maxlen = 200
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
y_train = np.array(y_train)
y_test = np.array(y_test)
Defining a Bi-LSTM model.
model = Sequential()
model.add(Embedding(n_unique_words, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
Output:
Here we have made a model without an attention mechanism. Let’s see the results.
Here we can see the losses and the accuracy of the model now we will define an attention layer.
Importing the libraries.
from keras.layers import *
from keras.models import *
from keras import backend as K
Defining the attention class.
class attention(Layer):
def __init__(self, return_sequences=True):
self.return_sequences = return_sequences
super(attention,self).__init__()
def build(self, input_shape):
self.W=self.add_weight(name="att_weight", shape=(input_shape[-1],1)
initializer="normal")
self.b=self.add_weight(name="att_bias", shape=(input_shape[1],1),
initializer="normal")
self.b=self.add_weight(name="att_bias", shape=(input_shape[1],1),
self.b=self.add_weight(name="att_bias", shape=(input_shape[1],1),
super(attention,self).build(input_shape)
def call(self, x):
e = K.tanh(K.dot(x,self.W)+self.b)
a = K.softmax(e, axis=1)
output = x*a
if self.return_sequences:
return output
return K.sum(output, axis=1)
Here I have defined a class called attention in which I have defined two functions. Let’s see what these functions will do for the mechanism.
def build(self, input_shape):
self.W=self.add_weight(name="att_weight", shape=(input_shape[-1],1),
initializer="normal")
self.b=self.add_weight(name="att_bias", shape=(input_shape[1],1),
initializer="zeros")
Inside build (), the function will define biases with weights. If any LSTM layer’s output shape is (None, 64, 128) then our output weight and bias will be of (128, 1) shape.
def call(self, x):
e = K.tanh(K.dot(x,self.W)+self.b)
a = K.softmax(e, axis=1)
output = x*
In the call(), the function will take the product of weights and add the bias terms to flow forward as inputs. After that, the ‘tanh’ is followed by a softmax layer. This softmax layer gives the alignment to the scores.
So if in the model we will be applying a return sequence as true in our attention layer the model will provide a 3D output if the input is 3D.
Making a model using attention where return_sequence is true.
model2 = Sequential()
model2.add(Embedding(n_unique_words, 128, input_length=maxlen))
model2.add(Bidirectional(LSTM(64, return_sequences=True)))
model2.add(attention(return_sequences=True)) # receive 3D and output 3D
model2.add(Dropout(0.5))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model2.summary()
Output:
Here we can see that we have given an attention layer to the model. Let’s check for the results.
history3d=model2.fit(x_train, y_train,
batch_size=batch_size,
epochs=12,
validation_data=[x_test, y_test])
print(history3d.history['loss'])
print(history3d.history['accuracy'])
Output:
Here we can see that our accuracy and losses of the model in the data has changed drastically where we are receiving the accuracy around 72% for 12 epochs using aBi-Lstm model. After using the attention in the model we increased the accuracy to 99% and also the loss has decreased to 0.0285.
We can also check the model for 2D outputs using return_sewuence as False.
model3 = Sequential()
model3.add(Embedding(n_unique_words, 128, input_length=maxlen))
model3.add(Bidirectional(LSTM(64, return_sequences=True)))
model3.add(attention(return_sequences=False)) # receive 3D and output 3D
model3.add(Dropout(0.5))
model3.add(Dense(1, activation='sigmoid'))
model3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model3.summary()
Output:
Let’s check for the performance of the model.
history2d=model3.fit(x_train, y_train,
batch_size=batch_size,
epochs=12,
validation_data=[x_test, y_test])
print(history3d.history['loss'])
print(history3d.history['accuracy'])
Output:
Here we can again see the accuracy has reached to around 99% and the loss count is also lower.
In the article we have seen how the Bi-LSTM model works in both directions and we have seen how the attention mechanism boosts the performance of the model. It can be used with any RNN model also keras give the function for attention layer which you can check it here. I encourage you to use it with real life data with different models to see how we can improve it more.
References: