Semantic Similarity in Natural language Processing. Part 2.

This is the second part of the article. Please read the first part here – Semantic Similarity in Natural language Processing. Part 1.

Custom model

It is time to change the approach and try training a custom model tailored for a specific task. Below I will provide an example of a relatively simple Siamese network trained on a dataset from Kaggle.

This is not something we would use in a production-grade system. But unfortunately, I am unable to share neither the architecture of models nor datasets we’ve used to train models in production. Having said that, I think the example below will provide you with a strong intuition about what the semantic similarity model could look like.

Assuming we will start from scratch, let’s import the required libraries.

!pip install -q -U trax    #install trax because colab vms have no pre-installed

import pandas as pd        #pandas for loading and working with dataset             
import numpy as np         #numpy for operations with matrices and other math

import os                  #os for operations with files

import nltk                #nltk 
nltk.download("punkt")     #download ntlk unsupervised sentence tokenizer

import trax                #download trax and components 
from trax import layers as tl 
from trax.supervised import training
from trax.fastmath import numpy as fastnp

from collections import defaultdict #to generate dictionaty-like objects 
from functools import partial       #fixate part of the func arguments and generates new function (we will use it to create loss layer) 

import random              
random.seed(111)

Download the dataset.

data = pd.read_csv('<path/to/the/dataset.csv>')

If you run data.head() you will see that our dataset contains pairs of questions and label if particular pair is a duplicate.

You can download the dataset here:

Questions Download

Divide dataset to train and test:

n_train = 300000                                               
n_test = 10240                                                 
data_train = data[:n_train]
data_test = data[n_train:n_train+n_test]   
del(data)

Select indexes of all duplicate questions:

train_idx = (data_train["is_duplicate"] == 1).to_numpy()
train_idx = [i for i,x in enumerate(train_idx) if x]

Turn train dataset to NumPy array so we can feed it to the model:

Q1_train_words = np.array(data_train["question1"][train_idx])
Q2_train_words = np.array(data_train["question2"][train_idx])

Prepare the test dataset the same way:

Q1_test_words = np.array(data_test["question1"])
Q2_test_words = np.array(data_test["question2"])
y_test = np.array(data_test["is_duplicate"])

Next, we will create arrays that we will fill with tokenized sentences:

#test
Q1_train = np.empty_like(Q1_train_words) 
Q2_train = np.empty_like(Q2_train_words)
#train
Q1_test = np.empty_like(Q1_test_words) 
Q2_test = np.empty_like(Q2_test_words)

Create a vocabulary dictionary. With defaultict, all out-of-vocabulary tokens will be zero.

vocab = defaultdict(lambda: 0)

Set padding token to 1. I hope you know the concept of padding. If not, let me try to explain it very briefly. All sentences have different lengths, making it difficult to feed their vectors to the model. To overcome this challenge, we define the length of the longest sentence and pad all others to have the same length. I hope it is clear now.

vocab["<PAD>"] = 1

Tokenize train sentences and add words from test sentences to vocabulary:

for idx in range(len(Q1_train_words)):
  Q1_train[idx] = nltk.word_tokenize(Q1_train_words[idx])    
  Q2_train[idx] = nltk.word_tokenize(Q2_train_words[idx]) 
  q = Q1_train[idx] + Q2_train[idx]
  #build vocabulary where keys are words and values are indexes
  for word in q:
    if word not in vocab:
      vocab[word] = len(vocab) + 1

Tokenize test sentences:

for idx in range(len(Q1_test_words)):
  Q1_test[idx] = nltk.word_tokenize(Q1_test_words[idx])   
  Q2_test[idx] = nltk.word_tokenize(Q2_test_words[idx])

Now we will convert arrays of words (tokenized sentences) to arrays of integers. Remember that we’ve created a vocabulary with numbers representing each word in the train set. We also know that all other words we did not have in the training dataset will be replaced by 0.

#convert train dataset tokenized sentences to integers 
for i in range(len(Q1_train)):
  Q1_train[i] = [vocab[word] for word in Q1_train[i]]
  Q2_train[i] = [vocab[word] for word in Q2_train[i]]

#convert test dataset tokenized sentences to integers 
for i in range(len(Q1_test)):
  Q1_test[i] = [vocab[word] for word in Q1_test[i]]
  Q2_test[i] = [vocab[word] for word in Q2_test[i]]

For the model’s training, we will need train and validation data, so let’s take part of the train data for validation purposes.

split = int(len(Q1_train) * 0.8)
train_Q1, train_Q2 = Q1_train[:split], Q2_train[:split]  
val_Q1, val_Q2 = Q1_train[split:], Q2_train[split:]

Helper functions

Now we will create a helper function that will generate batches of data. We will use it to train and evaluate our model and serve predictions. This function will receive question pairs (Q1 and Q2) and the desired batch size and produce batches with tuples. Each tuple consists of two arrays and looks like this: ( [q1.1, q1.2, q1.3, …], [q2.1, q2.2, q2.3, …] ). q1.1 is a duplicate of q2.1 but is not duplicated with any other question in batch.

def data_generator(Q1, Q2, batch_size, pad=1, shuffle=True):
  
  #initialize variables
  input1, input2 = [], []
  idx = 0
  len_q = len(Q1)
  question_index = [*range(len_q)]

  #shuffle questions if necessary 
  if shuffle:
    random.shuffle(question_index)
  
  #launch an infinite loop that yield batches
  while True:
    #check if we are not exceeding size of dataset
    if idx >= len_q:
      #if yes, start over 
      idx = 0
      if shuffle:
        random.shuffle(question_index)
    #get question pairs at the index positions
    q1 = Q1[question_index[idx]]
    q2 = Q2[question_index[idx]]
    #increment index
    idx += 1
    #start preparing the output arrays
    input1.append(q1)
    input2.append(q2)
    #wait untill we have batch
    if len(input1) == batch_size:
      #calculate the maximum length to which we will be padding all sentences
      max_len = max(max([len(q) for q in input1]),
                    max([len(q) for q in input2]))
      max_len = 2**int(np.ceil(np.log2(max_len)))
      b1, b2 = [], []
      #perform padding
      for q1, q2 in zip(input1, input2):
        q1 = q1 + [pad] * (max_len - len(q1))                         
        q2 = q2 + [pad] * (max_len - len(q2))                         
        #add padded sentences to batch
        b1.append(q1)
        b2.append(q2)
      #return new batch
      yield np.array(b1), np.array(b2)
      #clean arrays
      input1, input2 = [], []

The following helper function will generate our model. It is a good practice to create such a function. For example, later, we might want to perform Neural Architecture Search as described in Production ML: Model Tuning and Neural Architecture Search, where such generator function will be handy.

The model generator function will not have the required parameters because we pretty much have all we need. We can calculate the size of our vocabulary and set the default number of dimensions in embeddings. It will return an instance of a model.

def Siamese(vocab_size=len(vocab), d_model=128):
  
  #function normalizing the output
  def normalize(x):
    return x / fastnp.sqrt(fastnp.sum(x*x, axis=-1, keepdims=True))
  
  #prepare a sequential questions processor
  processor = tl.Serial(  
      #layer producing embeddings 
      tl.Embedding(vocab_size=vocab_size, d_feature=d_model),
      #LSTM layer 
      tl.LSTM(n_units=d_model),    
      #calculate mean over each column 
      tl.Mean(axis=1),                 
      #normalize the output
      tl.Fn("Normalize", lambda x: normalize(x))
  )
  #combined two processors for parallel processing of two questions 
  model = tl.Parallel(processor, processor)

  return model

Helper function calculates Triplet Loss and another function that will package Triplet Loss into a model layer. The Triplet Loss calculator receives two vectors with dimensions (batch_size, model_dimension) associated with Q1 and Q2.

def TripletLossFn(v1, v2, margin=0.25):
  
  #calculate dot product of two vectors (cosine similarity)
  scores = fastnp.dot(v1, v2.T)                                
  #calculate batch size
  batch_size = len(scores)        
  #separate positive diagonal entries in cosine similarity scores matrix
  positive = fastnp.diagonal(scores)            
  #separate negative entries
  negative_without_positive = scores - 2.0 * fastnp.eye(batch_size)
  #calculate minimum negative (get closest element) in each row
  closest_negative = negative_without_positive.max(axis=1) 
  #create matrix with zeros in diagonal and negative scores elsewhere 
  negative_zero_on_duplicate = scores * (1.0 - fastnp.eye(batch_size))
  #calculate mean of negative elements 
  mean_negative = fastnp.sum(negative_zero_on_duplicate, axis=1)/(batch_size - 1)
  #A = subtract `positive` from `margin` and add `closest_negative` 
  triplet_loss1 = fastnp.maximum(0, margin - positive + closest_negative)
  #B = subtract `positive` from `margin` and add `mean_negative`
  triplet_loss2 = fastnp.maximum(0, margin - positive + mean_negative)
  #add the two losses together and take the `fastnp.mean` of it
  triplet_loss = fastnp.mean(triplet_loss1 + triplet_loss2)
  #it was simple, right? ;) 
  return triplet_loss

#create a layer from the function above
def TripletLoss(margin=0.25):
  triplet_loss_fn = partial(TripletLossFn, margin=margin)
  return tl.Fn("TripletLoss", triplet_loss_fn)

And finally, we create a function orchestrating the model training. It will receive a model generator, Triplet Loss layer generator, train and validation data generators, learning rate change schedule, and (optionally) path where we would like it to save the trained model. It will create and return an instance of a training loop that we will use to perform training.

def train_model(Siamese, TripletLoss, lr_schedule, train_generator=train_generator,
                val_generator=val_generator, output_dir="model/"):
  
  #get full path to model directory 
  output_dir = os.path.expanduser(output_dir)
  
  #initialize training task
  train_task = training.TrainTask(
      labeled_data = train_generator,           
      loss_layer = TripletLoss(),             
      optimizer = trax.optimizers.Adam(0.001),   
      lr_schedule = lr_schedule         .
  )
  
  #initialize evaluation task
  eval_task = training.EvalTask(
      labeled_data = val_generator,
      metrics = [TripletLoss()],  
      n_eval_batches = 3
  )
  
  #initialize training loop
  training_loop = training.Loop( 
      Siamese(),            
      train_task, eval_tasks = eval_task,
      output_dir = output_dir
  )

  return training_loop

Now we will prepare and launch training:

batch_size = 256
#initialize training data generator
train_generator = data_generator(train_Q1, train_Q2, batch_size, vocab["<PAD>"])
#initialize validation data generator
val_generator = data_generator(val_Q1, val_Q2, batch_size, vocab["<PAD>"])

#initialize training loop
training_loop = train_model(Siamese, TripletLoss, lr_schedule)
#kickoff training for 1000 steps
training_loop.run(1000)

After a while, you will see the output looking like on the picture below:

When the training is done, the model is saved to the output directory:

Let us load and evaluate model accuracy:

#initialize model and load saved weights
model = Siamese()
model.init_from_file("/content/model/model.pkl.gz")

#model evaluation helper function
#receives validation data, labels, evaluation threshold above which we consider questions to be similar 
#vocabulary, data generator function, and batch size
def classify(test_Q1, test_Q2, y, threshold, model, vocab, data_generator=data_generator, batch_size=64):
  
  #initiate accuracy variable
  accuracy = 0                                                                               
  #launch evaluation loop for each batch 
  for i in range(0, len(test_Q1), batch_size):
    #call data generator without shuffle and ask it to yield next batch of vectors
    q1, q2 = next(data_generator(test_Q1[i:i+batch_size], test_Q2[i:i+batch_size],
                                 batch_size, vocab["<PAD>"], shuffle=False))
    #take the next batch of labels
    y_test = y[i:i+batch_size]               
    #get two vectors from siamese model
    v1, v2 = model((q1, q2))
    #for every element in batch vectors                                                                  
    for j in range(batch_size):
      #calculate cosine similarity
      d = np.dot(v1[j], v2[j].T)   
      #and check if similarity is higher then a threshold                                                          
      res = d > threshold
      #increment accuracy if judgement was correct 
      accuracy += (y_test[j] == res)
  #calculate total accuracy percentage 
  accuracy = accuracy / len(test_Q1)
  return accuracy

We have everything we need to calculate model accuracy:

accuracy = classify(Q1_test, Q2_test, y_test, 0.7, model, vocab, batch_size=512)             
print("Accuracy of the Model:", accuracy)

You should see something like ‘Accuracy of the Model: 0.68544921875’.

Now we will test if the model will perform better than a pre-trained model above on subtle differences in concepts. We will create another helper function that receives questions, model, vocabulary, and data generator and returns the classification result.

def predict(question1, question2, threshold, model, vocab, data_generator=data_generator, verbose=False):
  
  #tokenize questions
  q1 = nltk.word_tokenize(question1)                             
  q2 = nltk.word_tokenize(question2)                       
  Q1, Q2 = [], []
  #get integer representation of words in sentences 
  for word in q1:
    Q1 += [vocab[word]]                                    
  for word in q2:
    Q2 += [vocab[word]]  
  #create vector representations of questions
  Q1, Q2 = next(data_generator([Q1], [Q2], 1, vocab["<PAD>"])
  v1, v2 = model((Q1, Q2))
  #calculate cosine similarity
  d = fastnp.dot(v1[0], v2[0].T)
  #define if similarity score is above the threshold. 
  res = d > threshold
  if (verbose):
    print("Q1 = ", question1, "\nQ2 = ", question2)
    print("similarity score = ", d)
    print("result = ", res)
  return res

Let us run our questions pairs through the evaluation:

#pair 1
question1 = "How are you?"
question2 = "Are you fine?"
example1 = predict(question1, question2, 0.8, model, vocab, verbose=True)

#pair 2
question1 = "Do you enjoy eating the dessert?"
question2 = "Do you like hiking in the desert?"
example2 = predict(question1, question2, 0.8, model, vocab, verbose=True)

#pair 3
question1 = "How are you?"
question2 = "Do you like hiking in the desert?"
example3 = predict(question1, question2, 0.8, model, vocab, verbose=True)

#pair 4
question1 = "How are you?"
question2 = "How are you today?"
example4 = predict(question1, question2, 0.8, model, vocab, verbose=True)

The results should be similar to the numbers below:

question1 = "How are you?"
question2 = "Are you fine?"
similarity score =  0.9263276
result =  True

question1 = "Do you enjoy eating the dessert?"
question2 = "Do you like hiking in the desert?"
similarity score =  0.7217777
result =  False

question1 = "How are you?".lower().split()
question2 = "Do you like hiking in the desert?"
similarity score =  0.6647145
result =  False

question1 = "How are you?"
question2 = "How are you today?"
similarity score =  0.8646995
result =  True

We can see how the model captured the difference between visually similar “Do you enjoy eating the dessert?” and “Do you like hiking in the desert?”. As well, it was able to capture the fact that “How are you?” and “Are you fine?” are conceptually closer compared to the pair “How are you?” (in general) and “How are you today?”. The difference is subtle but often critical for angry people writing about their issues with the software.

The model size should not exceed 40 Mb, which is definitely an improvement. With quantization techniques (which I will describe in further articles), it can be as little as 10 Mb, allowing us to deploy it to mobile or an edge device.

Conclusion

In this article, we tested two possible approaches in Semantic Similarity segmentation.

The first was to use a pre-trained model. It was lightweight and quick in implementation but had some accuracy and model deployment parameters challenges.

Another was building and training own neural net tailored for a specific task. This approach definitely takes more time and effort, especially given that we’ve only scratched the surface. The model architecture will be more complicated in the production environment and data transformation and preparation pipelines.

But the second approach demonstrated better performance, and the model size was much less than in the first approach.

There is another challenge in the second approach, and the cost is higher. First, we need an appropriate dataset that usually requires time and effort. Also, training and tuning require more time and expertise.

I hope this article gave you a nice overview of possible approaches to Semantic Similarity segmentation, and you will be able to make a wise decision when the time comes.

The following articles will talk about model quantization and other fun stuff, so until next time, and may the Force be with You!

Semantic Similarity in Natural language Processing. Part 2.

Custom model

Helper functions

Conclusion

Semantic Similarity in Natural language Processing. Part 1.

Production ML: Model Deployment and Serving Introduction