Summary: Multiclass Classification, Naive Bayes, Logistic Regression, SVM, Random Forest, XGBoosting, BERT, Imbalanced Dataset

Task: The goal of this project is to build a classification model to accurately classify text documents into a predefined category. The dataset consists of a collection of customer complaints in the form of free text along with their corresponding departments (i.e. predifined categories). The original dataset is available here. However, I created a new dataset from it, which only includes "complaint id", "product_group" and "text" columns.

In order to achieve this goal, I performed the following steps:

  1. Get/Load the dataset

  2. Explore the data

  3. Prepare the data

  4. Build the model

  5. Fine-tune the model

Functions

load_data[source]

load_data(file_path)

Load the csv file and return a Dataframe.

clean_text[source]

clean_text(text)

clean the text by removing special characters, punctuations, etc.

preprocess_corpus[source]

preprocess_corpus(df, column='text')

Preprocess the entire corpus including cleaning the text documents and return the updated dataframe.

encode_labels[source]

encode_labels(labels)

Encode the class labels into a numbers.

compute_tfidf[source]

compute_tfidf(corpus, stop_words='english', ngram_range=(1, 1), max_features=None)

Calculate the tfidf features for all the text documents and return a (documents, fatures) matrix.

train_test_model[source]

train_test_model(model, X_train, X_test, y_train, y_test, labels)

Train and test the model using the training and test data sets. Return the predictions, accuracy and metric reports.

Load the Dataset

file_path = os.path.join('wells_data', 'case_study_data.csv')
print('Load the data...', end='')
df = load_data(file_path)
print('done.')
Load the data...done.

Explore the Data

After the data is loaded, we should examine and gain insight from the data.

df.head()
complaint_id product_group text
0 2815595 bank_service On XX/XX/2017 my check # XXXX was debited from...
1 2217937 bank_service I opened a Bank of the the West account. The a...
2 2657456 bank_service wells fargo in nj opened a business account wi...
3 1414106 bank_service A hold was placed on my saving account ( XXXX ...
4 1999158 bank_service Dear CFPB : I need to send a major concern/com...
categories = df.product_group.unique()
print('Number of categories: ', categories)
print()
df.product_group.value_counts().plot(kind='bar', title='Categories vs Number of Documents', cmap='plasma')
Number of categories:  ['bank_service' 'credit_card' 'credit_reporting' 'debt_collection' 'loan'
 'money_transfers' 'mortgage']

<matplotlib.axes._subplots.AxesSubplot at 0x7ff01a860b50>

The bar chart above shows that our dataset is imbalanced, i.e. the number of observations per class is not equally distributed. For example, the number of documents belonging to "credit_reporting" is more than 8 times of "money_transfers" documents. There are several common ways to deal with imbalanced datasets. However, I decided to use a method called stratification or stratified sampling. This technique removes the proportions variance, meaning it maintains the same class distribution in each subset of the complete training dataset while training, which gives more stability during the training. Particularly, I use stratified 10-fold cross-validation and stratified version of a train/test split approaches to managing this imbalanced dataset.

# Create a dataframe that contain number of words for each document 
dlength_df = pd.DataFrame({'doc_length': df.text.apply(lambda x: len(x.split()))})

# Group the documents based on their number of words (i.e. length)
grouped = dlength_df.groupby('doc_length')

indices = grouped.indices
word_count = []
doc_count = []
counter = 0
for w,d in indices.items():
    word_count.append(w)
    doc_count.append(len(d))

# Plot the distribution of words vs documents in the corpus
plt.figure(figsize=(8,5))
plt.plot(word_count, doc_count)
plt.xlabel('Word count in document')
plt.ylabel('Number of Documents')
plt.title('Word count vs Number of documents')

# Plot the Cumulative distribution of documents length
plt.figure(figsize=(8,5))
plt.hist(word_count, 30, density=True, histtype='step', cumulative=True, label='Complaints', color='red', linewidth=2)
plt.xlabel('Word count in document')
plt.ylabel('Fraction of Documents')
plt.legend(loc='upper left')
plt.title('Cumulative distribution of documents length')
plt.show()

By looking at the charts above, we realize that more than 70% of the documents have less than 1000 words. Also, the fraction of documents with 2000 words and more is less than 5%.

Preprocess the Dataset

One of the primary tasks before developing a model on textual data is cleaning the text. It usually includes lowercasing the text, removing stop words, removing puncutations, alphanumeric and/or special characters and so on. The functions preprocess_corpus() and clean_text() will clean up our dataset.

# Randomly show 3 complaints
samples = [np.random.randint(len(df)) for i in range(3)]
for index in samples:
    print(df['text'][index])
    print()
I 've tried disputing a couple of fraudulent accounts on my credit report, and the Bureau has taken some of it off but only the collection things. I 've sent letters to this company, XXXX XXXX, and XXXX XXXX  and have gotten little response. Not sure what is being looked into when the bureaus send me the results showing verified. How can they verify something that 's not mine? Most recently about a month ago I did send in an Identity Theft document and all of the items it asked me to include. Feel like after six months of disputing and working hard that it 's really going nowhere. Would like to have this stuff taken off my credit report bc it is not mine. It has been well over 30 days on more than one occasion, and neither XXXX XXXX or this bureau has done anything to address the creditors disregard for the federal laws. I leave out of the country for months at a time, and someone must have used my information and apply for credits because I was always getting declined no one could tell me what happened all I know is I never had XXXX XXXX account. I found out that the bureaus collect money from creditors when it comes to credit reporting consumers. Read that the agencies make more off bad reporting than good which makes me think this is being kept on my credit report so this company can profit off my misfortune.

I opened up a Coinbase account to trade Cryptocurrencies. First purchase on XX/XX/XXXX. XXXX bitcoin for XXXX. Second purchase on XX/XX/XXXX. XXXX Etherium for XXXX dollars. Sold XXXX bitcoin for XXXX ( after fees ) on XX/XX/XXXX. Sold XXXX etherium for XXXX on XX/XX/XXXX. Currently have {$180.00} in a stagnant USD account on coinbase. I am unable to withdrawal the money to either a bank account or credit/debit card. I have not taken action with the company, because upon research this is not an uncommon problem and the company does not take action when it occurs. Per my research this is the best coarse of action to get my money back.

I had a repossession back in early XX/XX/XXXX with a balance of 2100.00. I purchased a vehicle for my sons father and he did n't pay when we split. In XX/XX/XXXX, I disputed the debt after 7 years and the repo was taken off my credit report. Today I received a letter insisting I pay the debt. There are Barr laws in NC. I have n't heard from this debtor in years since I had the debt removed from my credit. It 's been 10 years.

# Preprocess the corpus, i.e. clean the text
print('Cleaning the text...', end='')
df = preprocess_corpus(df, column='text')
print('done.')
Cleaning the text...done.

Prepare the Dataset

After text preprocessing, we need to perform feature engineering and data prepration. We convert the text to vector representation using tfidf method. In this process, I consider the following options:

  • Remove stop words

  • Remove domain specific stop words, i.e. words that are rare (appear in 3 documents or less) or too common (occur in more that 90% of the documents)

  • Utilize unigrams, bigrams, etc

  • Use all the features (words) or a subset of them

  • Encode the class labels into numbers

Please see the function compute_tfidf() for more information.

# Create tfidf features from the text
stop_words    = 'english'
ngram_range   = (1, 1)
max_features  = None
X, vectorizer = compute_tfidf(df['text'], stop_words, ngram_range, max_features)

# Encode the labels
labels = df.product_group.unique()
label_encoder = encode_labels(labels)
y = label_encoder.transform(df.product_group)
Computing tfidf features...done!

Split the Dataset into Train and Test Sets

We make use of stratified train_test_split() function. This function will determine the distributions of classes and maintain the same distribution for both train and test sets.

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
print('Training set Shape: {}  | Test set Shape: {}'.format(X_train.shape, X_test.shape))
Training set Shape: (214688, 33028)  | Test set Shape: (53673, 33028)

Build the Model

In this section, I will several different models for our multiclass classification task and compare them at the end. There are a wide variety of techniques that can be used. However, I decided to apply the following methods:

  • Naive Bayes (NB)

  • Logistic Regression

  • Linear SVM

  • Random Forest

  • XGBoosting

  • BERT

Naive Bayes Classifier

nb_model = MultinomialNB()
print('Number of documents = {}  |  Number of features = {}'.format(X_train.shape[0], X_train.shape[1]))
st_time = time.time()
predictions, accuracy, metrics_report, nb_prf = train_test_model(nb_model, X_train, X_test, y_train, y_test, labels)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
print('accuracy: {}'.format(accuracy))
print('='*100)
print(metrics_report)
plot_confusion_matrix(nb_model, X_test, y_test, display_labels=labels, xticks_rotation='vertical', cmap="BuPu")
# plt.show()
Number of documents = 214688  |  Number of features = 33028
Start training...done!
Start testing...done!
Total time: 0.28s
accuracy: 0.809028748160155
====================================================================================================
                  precision    recall  f1-score   support

    bank_service       0.79      0.70      0.74      4014
     credit_card       0.75      0.75      0.75      5911
credit_reporting       0.79      0.88      0.83     16246
 debt_collection       0.81      0.79      0.80     12292
            loan       0.82      0.71      0.76      6207
 money_transfers       0.98      0.30      0.46       947
        mortgage       0.88      0.93      0.90      8056

        accuracy                           0.81     53673
       macro avg       0.83      0.72      0.75     53673
    weighted avg       0.81      0.81      0.81     53673

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7ff128c65d50>

Logistic Regression

log_model = LogisticRegression(penalty='l2', max_iter=500)
print('Number of documents = {}  |  Number of features = {}'.format(X_train.shape[0], X_train.shape[1]))
st_time = time.time()
predictions, accuracy, metrics_report, log_prf = train_test_model(log_model, X_train, X_test, y_train, y_test, labels)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
print('accuracy: {}'.format(accuracy))
print('='*100)
print(metrics_report)
plot_confusion_matrix(log_model, X_test, y_test, display_labels=labels, xticks_rotation='vertical', cmap="BuPu")
Number of documents = 214688  |  Number of features = 33028
Start training...done!
Start testing...done!
Total time: 111.02s
accuracy: 0.84783783280234
====================================================================================================
                  precision    recall  f1-score   support

    bank_service       0.81      0.81      0.81      4014
     credit_card       0.81      0.81      0.81      5911
credit_reporting       0.86      0.86      0.86     16246
 debt_collection       0.83      0.84      0.84     12292
            loan       0.82      0.77      0.80      6207
 money_transfers       0.83      0.73      0.78       947
        mortgage       0.92      0.93      0.93      8056

        accuracy                           0.85     53673
       macro avg       0.84      0.82      0.83     53673
    weighted avg       0.85      0.85      0.85     53673

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7ff0181d5ed0>

SVM

svm = LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-4, C=0.1)
print('Number of documents = {}  |  Number of features = {}'.format(X_train.shape[0], X_train.shape[1]))
st_time = time.time()
predictions, accuracy, metrics_report, svm_prf = train_test_model(svm, X_train, X_test, y_train, y_test, labels)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
print('accuracy: {}'.format(accuracy))
print('='*100)
print(metrics_report)
plot_confusion_matrix(svm, X_test, y_test, display_labels=labels, xticks_rotation='vertical', cmap="BuPu")
Number of documents = 214688  |  Number of features = 33028
Start training...done!
Start testing...done!
Total time: 5.59s
accuracy: 0.8447822927729026
====================================================================================================
                  precision    recall  f1-score   support

    bank_service       0.77      0.82      0.80      4014
     credit_card       0.78      0.84      0.81      5911
credit_reporting       0.89      0.82      0.86     16246
 debt_collection       0.84      0.84      0.84     12292
            loan       0.79      0.81      0.80      6207
 money_transfers       0.69      0.82      0.75       947
        mortgage       0.91      0.94      0.93      8056

        accuracy                           0.84     53673
       macro avg       0.81      0.84      0.83     53673
    weighted avg       0.85      0.84      0.85     53673

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7ff01a860bd0>

Random Forest

rf_model = RandomForestClassifier(n_estimators=100, max_depth=100, min_samples_split=10, n_jobs=-1, verbose=0)
print('Number of documents = {}  |  Number of features = {}'.format(X_train.shape[0], X_train.shape[1]))
st_time = time.time()
predictions, accuracy, metrics_report, rf_prf = train_test_model(rf_model, X_train, X_test, y_train, y_test, labels)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
print('accuracy: {}'.format(accuracy))
print('='*100)
print(metrics_report)
plot_confusion_matrix(rf_model, X_test, y_test, display_labels=labels, xticks_rotation='vertical', cmap="BuPu")
Number of documents = 214688  |  Number of features = 33028
Start training...done!
Start testing...done!
Total time: 280.10s
accuracy: 0.8414659139604643
====================================================================================================
                  precision    recall  f1-score   support

    bank_service       0.81      0.76      0.79      4014
     credit_card       0.82      0.75      0.79      5911
credit_reporting       0.82      0.93      0.87     16246
 debt_collection       0.84      0.84      0.84     12292
            loan       0.88      0.69      0.77      6207
 money_transfers       0.93      0.44      0.60       947
        mortgage       0.90      0.93      0.91      8056

        accuracy                           0.84     53673
       macro avg       0.86      0.76      0.80     53673
    weighted avg       0.84      0.84      0.84     53673

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7ff01a899c50>

Gradient Boosting

gb_model = GradientBoostingClassifier(n_estimators=50, max_depth=10)
predictions, accuracy, metrics_report, gb_prf = train_test_model(gb_model, X_train, X_test, y_train, y_test, labels)
print('accuracy: {}'.format(accuracy))
print(metrics_report)
plot_confusion_matrix(gb_model, X_test, y_test, display_labels=labels, xticks_rotation='vertical', cmap="BuPu")

Start training...
      Iter       Train Loss   Remaining Time 
         1      274073.2086          104.34m
         2      240114.7651          104.04m
         3      215548.5383          102.49m
         4      196626.8849          100.89m
         5      181335.8100           99.41m
         6      168818.3973           97.47m
         7      158202.7714           95.42m
         8      149309.3867           93.23m
         9      141564.8736           91.14m
        10      134992.9559           88.86m
        20       97935.1018           65.44m
        30       82408.8805           42.23m
        40       73946.2785           20.42m
        50       67992.7312            0.00s
Training done!
Number of documents = 214688  |  Number of features = 10000

Start testing...
done!
accuracy: 0.8323179252137947
                  precision    recall  f1-score   support

    bank_service       0.80      0.78      0.79      3994
     credit_card       0.80      0.78      0.79      5964
credit_reporting       0.85      0.86      0.86     16310
 debt_collection       0.79      0.83      0.81     12229
            loan       0.80      0.77      0.79      6139
 money_transfers       0.75      0.63      0.68       911
        mortgage       0.92      0.90      0.91      8126

        accuracy                           0.83     53673
       macro avg       0.82      0.79      0.80     53673
    weighted avg       0.83      0.83      0.83     53673

Issues with Gradient Boosting

In gradient boosting trees are built sequentially, therefore, training takes longer time. In our case, since the dataset is large it even makes the training take significantly longer. As you can observe in the above cell, training took over 104 minutes and the accuracy is 83%, which ranks it at fourth positions only slightly better than Naive Bayes classifier.

Since I was willing to try a few other configuration of the gradient boosting and see if it improves the performance, I decided to install the GPU supported version of XGBoosting algorithm on Google Colab. After searching, I found the XGBoosting library from here and installed it on Google Colab. After that, I implemented two forms of XGBoosting algorithm:

  1. xgb: This is the direct xgboost library.

  2. XGBClassifier: This is a sklearn wrapper for XGBoost. This enables us to use sklearn’s Grid Search with parallel processing.

The next two cells includes two versions of XGBoost algorithm I implemented and copied the code from Colab. Although the gpu-supported version were much faster, the performance didn't improve.

XGBClassifier Implementation for sklearn

import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

xgb_model = XGBClassifier(objective='multi:softmax', n_estimators=100, learning_rate=0.3, max_depth=4, subsample=0.8, n_iter_no_change=2, verbosity=1)
xgb_param = xgb_model.get_xgb_params()
xgb_param['num_class'] = 7
cvresult = xgb.cv(xgb_param, xgb_train, num_boost_round=xgb_model.get_params()['n_estimators'], nfold=5, early_stopping_rounds=10, verbose_eval=True)
xgb.set_params(n_estimators=cvresult.shape[0])
predictions, accuracy, metrics_report = train_test_model(xgb_model, X_train, X_test, y_train, y_test)
print('accuracy: {}'.format(accuracy))
print(metrics_report)
plot_confusion_matrix(xgb_model, X_test, y_test, display_labels=labels, xticks_rotation='vertical', cmap="BuPu")

XGB Implementation

xgb_train = xgb.DMatrix(X_train, label=y_train)
xgb_test = xgb.DMatrix(X_test, label=y_test)

# setup parameters for xgboost
param = {}
# use softmax multi-class classification
param['objective'] = 'multi:softmax'
param['eta'] = 0.05
param['max_depth'] = 12
param['nthread'] = 4
param['num_class'] = 7
param['gpu_id'] = 0

watchlist = [(xgb_train, 'train'), (xgb_test, 'test')]
num_round = 5
bst = xgb.train(param, xgb_train, num_round)

bst = xgb.train(param, xgb_train, num_round, watchlist)
# get prediction
pred = bst.predict(xgb_test)
error_rate = np.sum(pred != y_test) / y_test.shape[0]
print('Test error using softmax = {}'.format(error_rate))
[0] train-merror:0.19384    test-merror:0.22162
[1] train-merror:0.18778    test-merror:0.21685
[2] train-merror:0.18410    test-merror:0.21366
[3] train-merror:0.18071    test-merror:0.21107
[4] train-merror:0.17842    test-merror:0.20849
Test error using softmax = 0.208484712984182

Performance Comparison

xx = np.array([1,2,3])
width = 0.15
gb_prf = np.array([0.83, 0.83, 0.83])
ax = plt.subplot(111)
ax.bar( xx - width, height=np.array(nb_prf), width=width, color='b', align='center', label='NB', tick_label=['Precision', 'Recall', 'f1'])
ax.bar(xx ,height=np.array(log_prf), width=width, color='g', align='center', label='Log_Reg')
ax.bar(xx + width, height=np.array(svm_prf), width=width, color='r', align='center', label='SVM')
ax.bar(xx + 2*width, height=np.array(rf_prf), width=width, color='y', align='center', label='Random Forest')
ax.bar(xx + 3*width, height=np.array(gb_prf), width=width, color='black', align='center', label='Gradient Boosting')


plt.xlabel('Classification Metrics')
plt.ylabel('Scores')
plt.legend(loc='lower right')
plt.show()

Training Time Comparison

The plot below shows that Naive Bayes is incredibly fast and SVM is following it by having the 0.28s and 5.59s training times, respectively. I intentionally omitted Gradient boosting from the plot, because it has very large training time, to be able to have a better visualization.

tr_times = [('NB', 0.28), ('Log_Reg', 111.02), ('SVM', 5.59), ('Random_Forest', 280.10)]
x_vals = []
h_vals = []
for t in tr_times:
    x_vals.append(t[0])
    h_vals.append(t[1])
    
ax = plt.subplot(111)
ax.bar(np.linspace(0,1,4) - width, height=np.array(h_vals), width=width, color='Green', tick_label=x_vals)
plt.xlabel('Techniques')
plt.ylabel('Training Time')
plt.show()

Cross Validation Implementation of the Models

The second approach to coping with imbalanced datasets is cross validation. For this purpose, I used the stratified Cross Validation from sklearn. This method creates the folds by preserving the percentage of samples for each class. Please note that cross validation in general is computationally expensive. Below are the results for all the previous techniques.

Naive Bayes Cross Validation

skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=123)
nb_model = MultinomialNB()

accs = []
reports = []
f1_scores = []
st_time = time.time()
for train_index, test_index in skfold.split(X,y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    predictions, accuracy, metrics_report, nb_prf = train_test_model(nb_model, X_train, X_test, y_train, y_test, labels)
    accs.append(accuracy)
    reports.append(metrics_report)
    
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))   
print('mean accuracy: {:.2f}'.format(np.mean(accs)))
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Total time: 1.34s
mean accuracy: 0.81

Logistic Regression Cross Validation

log_model = LogisticRegression(penalty='l2', max_iter=500)
accs = []
reports = []
st_time = time.time()
for train_index, test_index in skfold.split(X,y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    predictions, accuracy, metrics_report, log_prf = train_test_model(log_model, X_train, X_test, y_train, y_test, labels)
    accs.append(accuracy)
    reports.append(metrics_report)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))   
print('mean accuracy: {:.2f}'.format(np.mean(accs)))
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Total time: 633.33s
mean accuracy: 0.85

SVM Cross Validation

svm = LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-4, C=0.1)
accs = []
reports = []
st_time = time.time()
for train_index, test_index in skfold.split(X,y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    predictions, accuracy, metrics_report, svm_prf = train_test_model(svm, X_train, X_test, y_train, y_test, labels)
    accs.append(accuracy)
    reports.append(metrics_report)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time)) 
print('mean accuracy: {:.2f}'.format(np.mean(accs)))
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Total time: 28.41s
mean accuracy: 0.84

Random Forest Cross Validation

rf_model = RandomForestClassifier(n_estimators=100, max_depth=100, min_samples_split=10, n_jobs=-1, verbose=0)
accs = []
reports = []
st_time = time.time()
for train_index, test_index in skfold.split(X,y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    predictions, accuracy, metrics_report, rf_prf = train_test_model(rf_model, X_train, X_test, y_train, y_test, labels)
    accs.append(accuracy)
    reports.append(metrics_report)

en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time)) 
print('mean accuracy: {:.2f}'.format(np.mean(accs)))
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Start training...done!
Start testing...done!
Total time: 1274.23s
mean accuracy: 0.84

Result Comparison

The cross validation results for all the methods are very close. The mean accuracy of random forest and SVM is 84%, logistic regression is still the front runner by 85% accuracy and navie Bayes ranks at the end with 81% accuracy.

Multiclass Classification Using BERT

There are several different deep learning techniques available for (text) classification such as sequence to sequence models (LSTM, RNN), Convolutional Neural Networks (CNNs) and their combinations that can produce extremely good results. However, after introducing transformer models in 2017, many transformer-based models developed that led to new state-of-the-art results in a wide variety of NLP tasks including question answering, classification and others. There are many scientific articles demonstrate that such as this and this. Therefore, I decided to utilize a transformer model for my classification problem.

I used transformers library and built a BERT model. But due to some unknown issue it wasn't training correctly. Therefore, I switched to another library called ktrain and built the model. Please see below for more information.

X = df['text'].to_list()

# Encode the labels
labels = df.product_group.unique()
label_encoder = encode_labels(labels)
y = label_encoder.transform(df.product_group)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

My Implementation

from transformers import BertTokenizer, RobertaTokenizer
from transformers import TFBertModel, TFBertPreTrainedModel, TFBertForSequenceClassification
from transformers import glue_convert_examples_to_features, InputExample

def convert_data_into_input_example(x, y):
    input_examples = []
    for i in tqdm(range(len(x))):
        example = InputExample(
            guid= None,
            text_a= x[i],
            text_b= None,
            label= str(y[i])
        )
        input_examples.append(example)
    return input_examples

def bert_compatiable_format(bdset):  
    input_ids, attention_mask, token_type_ids, labels = [], [], [], []
    for in_ex in bdset:
        input_ids.append(in_ex.input_ids)
        attention_mask.append(in_ex.attention_mask)
        token_type_ids.append(in_ex.token_type_ids)
        labels.append(in_ex.label)

    labels = np.vstack(labels)
    return ([np.asarray(input_ids), np.asarray(attention_mask), np.asarray(token_type_ids)], labels)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=False)
train_input_examples = convert_data_into_input_example(X_train, y_train)
val_input_examples = convert_data_into_input_example(X_test, y_test)


label_list = label_encoder.transform(labels)
label_list = [str(i) for i in label_list.tolist()]
bert_train_dataset = glue_convert_examples_to_features(examples=train_input_examples, tokenizer=tokenizer, max_length=128, task='mrpc', label_list=label_list)
bert_val_dataset = glue_convert_examples_to_features(examples=val_input_examples, tokenizer=tokenizer, max_length=128, task='mrpc', label_list=label_list)

x_train, y_train = bert_compatiable_format(bert_train_dataset)
x_val, y_val     = bert_compatiable_format(bert_val_dataset)

def example_to_features(input_ids, attention_masks, token_type_ids, y):
    return {"input_ids": input_ids,
            "attention_mask": attention_masks,
            "token_type_ids": token_type_ids},y


train_ds = tf.data.Dataset.from_tensor_slices((x_train[0], x_train[1], x_train[2], y_train)).map(example_to_features).shuffle(100).batch(64)
val_ds   = tf.data.Dataset.from_tensor_slices((x_val[0], x_val[1], x_val[2], y_val)).map(example_to_features).batch(64)
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
EPOCHS = 3

# Train the model
history = model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)

Multiclass Text Classification Using krain

I used ktrain library to implement BERT. "ktrain is a lightweight wrapper for the deep learning library TensorFlow Keras (and other libraries) to help build, train, and deploy neural networks and other machine learning models. It is designed to make deep learning and AI more accessible and easier to apply for both newcomers and experienced practitioners."

import ktrain
from ktrain import text

(x_train,  y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=X_train, y_train=y_train,
                                          x_test=X_test, y_test=y_test,
                                          class_names=labels.tolist(),
                                          preprocess_mode='bert',
                                          maxlen=128)
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), batch_size=6)
# find good learning rate
learner.lr_find()             # briefly simulate training to find good learning rate
learner.lr_plot()             # visually identify best learning rate
learner.fit_onecycle(2e-5, 4)

# Use a Predictor object capable of making predictions on new raw data.
predictor = ktrain.get_predictor(learner.model, preproc)

predictor.get_classes()

# Predict a new document
predictor.predict(X_test[0:1])

# let's save the predictor for later use
predictor.save('/tmp/my_predictor')

# reload the predictor
reloaded_predictor = ktrain.load_predictor('/tmp/my_predictor')

# make a prediction on the same document to verify it still works
reloaded_predictor.predict(x_test[0:1])

CPU Run

begin training using onecycle policy with max lr of 2e-05...
Train on 214688 samples
Epoch 1/4
  4530/214688 [..............................] - ETA: 18:36:08 - loss: 1.4979 - accuracy: 0.4507

GPU Run

I used the Google colab to utilize the GPU resource. The program started with no problem. However, at the end of the second epoch due to lack of GPU resource it stopped and restarted the environment. Every epoch took 1.30 hours to execute and it can be observed that the accuracy is quite good (87%). I think it would go beyond 90% if it wasn't stopped.


begin training using onecycle policy with max lr of 3e-05...
Train on 214688 samples
Epoch 1/3
214688/214688 [==============================] - 10178s 47ms/sample - loss: 0.5175 - accuracy: 0.8252
Epoch 2/3
214688/214688 [==============================] - 10165s 47ms/sample - loss: 0.3903 - accuracy: 0.8684
Epoch 3/3
214688/214688 [==============================] - 10200s 48ms/sample - loss: 0.2637 - accuracy: 0.9110
<tensorflow.python.keras.callbacks.History at 0x7f59883c4668>

Challenges and Discussion

This project had several challenges including:

  • The dataset was fairly large, which made it quite interesting.

  • The dataset was imbalanced in terms of number of documents in different classes. Also, the length of documents varied from 1 to over 5000 words.

  • The textual content needed plenty of cleaning

Since I did not have easy access to GPU resources, I wasn't able to get the result that I expected. That said, The best result among all the six models I trained belongs to BERT by 91% accuracy.

In conclusion, transformer BERT outperformed all other models (even though it wasn't completely finished!!!), but training time was significantly higher and needs GPU. Naive Bayes on the other hand, is exceptionally fast and gives reasonable results. SVM and logistic regression were in the middle, fast in training and high accuracy, respectively. Random forest and GBoosting methods were the slowest with less accuracy. Standard/traditonal methods need feature engineering to improve perforamnce, but transformer models are pretrained and can be used end to end.