Custom Class for Glove Embeddings in a Scikit-learn Pipeline

In this tutorial, I'm going to share with you how I implemented Glove Vector Embeddings for a text classification task that uses the Scikit-learn Pipeline. I will explain how Pipelines work in general and how to create a custom class in Scikit-learn for Glove Vector Embeddings while going through an example classification task using a dataset from Kaggle.

Custom Class for Glove Embeddings in a Scikit-learn Pipeline

This article is divided into several sections, as outlined below. The "Creating the GloveVector class" contains the code for creating the custom class. You can simply jump to that section, if you're just interested in the custom class. However, if you are interested in the details of how it all works and see how it can be used in an example project, then I'd encourage you to read the entire article.

Table of Contents

  1. The Dataset
  2. How Scikit-learn Pipelines Work
  3. How Custom Scikit-learn Classes Work
  4. Creating the GloveVectorTransformer Custom Class
  5. Using the GloveVectorTransformer Custom Class in the pipeline
  6. Conclusion

1. The Dataset

The dataset that we will use to demonstrate the process comes from the Kaggle competition: Real or Not? NLP with Disaster Tweets. The dataset contains a column called "text" for the text of the tweet and a column called "target" which is 1 if the tweet is about a real disaster or 0 if not. The "train.csv" file contains data for training and the "test.csv" file is used for the final prediction and making a submission to the competition. In this exercise, we will work with just the "train.csv" file.

You can download the dataset from here.

Let's start writing some code by importing this data and performing some basic analysis and pre-processing.

import re
import numpy as np
import pandas as pd
import spacy
import string
nlp = spacy.load("en_core_web_lg")

This will import all the libraries that we will need. We also define the nlp object of the spacy model that we will eventually use to create our custom class.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, metrics, svm
from sklearn.utils import shuffle

We will use ColumnTransformer and Pipeline classes to create our pipeline. The TransformerMixin and BaseEstimator classes will be used to create the custom class. I have chosen svm as our classifier for this demonstration.

Initially, I will use the TfidfVectorizer class to showcase how a pipeline can be used to solve this problem without creating a custom class. Then I will replace the TfidfVectorizer class with the custom one in the subsequent sections.

Loading in the data

SEED = 40
df = pd.read_csv('train.csv') # Your location might be different.
df = shuffle(df, random_state=SEED)
print(df['target'].value_counts())
# Outputs:
# 0    4342
# 1    3271
# Name: target, dtype: int64

This will load and shuffle the data. We also printed the frequency count of the two label values, just as a sanity check. Next, we will pass the "text" column through some functions that perform basic text pre-processing

Cleaning the data

def clean_text(text):
    # remove_URL
    url = re.compile(r'https?://\S+|www\.\S+')
    text =  url.sub(r'', text)

    # remove_html
    html = re.compile(r'<.*?>')
    text = html.sub(r'', text)

    # remove_emoji
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags = re.UNICODE)
    text = emoji_pattern.sub(r'', text)

    # remove_punct
    table = str.maketrans('', '', string.punctuation)
    text = text.translate(table)

    return text


df['text'] = df['text'].apply(lambda x : clean_text(x))

Now that we have loaded the data and performed some pre-processing, we can vectorize the text and then train a classifier. I will do this in the next steps while explaining the concepts behind the process.

2. How Scikit-learn Pipelines Work

Typically, every machine learning problem involves two major steps. The first one is Data preparation and the second one is building the Machine Learning model itself. The Data preparation step usually takes up the majority of the time as it involves operations that can be quite tedious to implement. These operations include handling of any missing data, normalizing or scaling numerical data and one-hot encoding categorical data. Scikit-learn offers several "Transformer" classes such as the Normalizer, StandardScalar and the One Hot Encoder to help perform much of these operations, and it also allows us to create custom transformer classes for our own custom needs, which is what we will do in this tutorial.

Scikit-Learn Pipelines offers a way to streamline much of the operations mentioned above so that the entire process is much easier to implement, optimize, and also experiment with later on. The pipelines are made up of steps. Each step, except for the last one, can only be a transformer. The last step can either be a transformer or an estimator. In this section, I will build a classifier using a pipeline and the TFIDF Vectorizer of Scikit-learn.

Preparing data for training

# Setting Features and labels
X = df.copy().drop(['target'], axis=1)
y = df.copy()['target']

# Train-Test split
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.15, random_state=SEED)

# Creating Vectors for the text column and dropping the rest
column_preprocessor = ColumnTransformer(
    [
        ('text_tfidf', TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}'), 'text'),
    ],
    remainder='drop',
    n_jobs=1
)

The ColumnTransformer class of Scikit-learn is used to apply transformers to specific columns in the dataset. It takes in a list of transformers. Each transformer is a tuple with three elements, like so: (Name of the transformer, Transformer object, the Column to apply the transform to). Here we apply the TfidfVectorizer transform to the "text" column, and call the transformer "text_tfidf". The remainder='drop' option drops the columns that weren't specified.

Defining the pipeline and training

pipeline = Pipeline([
    ('column_preprocessor', column_preprocessor),
    ('svm', svm.SVC(kernel='rbf', C=1.2, gamma=0.2))
])

# Training
pipeline.fit(X_train, y_train)

The pipeline consists of just two steps. The first is the ColumnTransformer object we defined earlier, and the second is the svm classifier. All the data, for both training and inference, will go through these two steps.

# Printing Accuracy score
predictions = pipeline.predict(X_test)
print(metrics.accuracy_score(y_test, predictions))
# Outputs:
# 0.809106830122592

3. How Custom Scikit-learn Classes Work

If we want to implement transformations in our data pipeline that cannot be achieved by using built-in options within Scikit-learn, we need to create our own custom transformers. Scikit-learn has guidelines on how to implement such a solution by writing our own custom classes. All transformers and estimators in Scikit-learn are implemented as Python classes. In order for our custom transformer to be compatible with a Scikit-learn pipeline, it must be implemented as a class with methods such as fit, transform, fit_transform, get_params and set_perams. We can either create a class and write all these functions on our own, or simply write the code only for the transformation that we need, and inherit everything else from the built-in classes available in Scikit-learn. This concept is called Inheritance in Python, and this what we will use to create our custom transformer class.

4. Creating the GloveVectorTransformer Custom Class

Scikit-learn provides us with two great base classes TransformerMixin and BaseEstimator. We get fit_transform from TransformerMixin and get_params and set_params from BaseEstimator. In our class, we just need to define the fit method which only needs to return the object self. So we really only need to define the transform method.

class GloveVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, nlp):
        self.nlp = nlp
        self.dim = 300

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([self.nlp(text).vector for text in X])

The transform method uses the spacy object (nlp) created earlier to create and return sentence vectors of the text we pass to it. We can now use this in our pipeline.

5. Using the GloveVectorTransformer Custom Class in the pipeline

Following is the final code for training. We simply replaced the TfidfVectorizerclass with our custom GloveVectorTransformer class.

Final code for training
column_preprocessor = ColumnTransformer(
    [
        ('text_glove', GloveVectorTransformer(nlp), 'text'),
    ],
    remainder='drop',
    n_jobs=1
)

pipeline = Pipeline([
    ('column_preprocessor', column_preprocessor),
    ('svm', svm.SVC(kernel='rbf', C=1.2, gamma=0.2))
])

print("start pipeline fit")
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
print(metrics.accuracy_score(y_test, predictions))
# Outputs:
# 0.8239929947460596

So as you can see, we used our custom GloveVectorTransformer class and achieved significantly better results, as compared to the TfidfVectorizer class.

6. Conclusion

Hopefully, this tutorial was useful to you. I also have a public Kaggle notebook that you may find useful. The code there trains a model on the entire dataset and makes a submission to the competition.

You may also find these resources useful:

Thank you for reading!