Task: The goal of this project is to build a classification model to accurately classify text documents into a predefined category. The dataset consists of a collection of customer complaints in the form of free text along with their corresponding departments (i.e. predifined categories). The original dataset is available here. However, I created a new dataset from it, which only includes "complaint id", "product_group" and "text" columns.
In order to achieve this goal, I performed the following steps:
Get/Load the dataset
Explore the data
Prepare the data
Build the model
Fine-tune the model
file_path = os.path.join('wells_data', 'case_study_data.csv')
print('Load the data...', end='')
df = load_data(file_path)
print('done.')
df.head()
categories = df.product_group.unique()
print('Number of categories: ', categories)
print()
df.product_group.value_counts().plot(kind='bar', title='Categories vs Number of Documents', cmap='plasma')
The bar chart above shows that our dataset is imbalanced, i.e. the number of observations per class is not equally distributed. For example, the number of documents belonging to "credit_reporting" is more than 8 times of "money_transfers" documents. There are several common ways to deal with imbalanced datasets. However, I decided to use a method called stratification or stratified sampling. This technique removes the proportions variance, meaning it maintains the same class distribution in each subset of the complete training dataset while training, which gives more stability during the training. Particularly, I use stratified 10-fold cross-validation and stratified version of a train/test split approaches to managing this imbalanced dataset.
# Create a dataframe that contain number of words for each document
dlength_df = pd.DataFrame({'doc_length': df.text.apply(lambda x: len(x.split()))})
# Group the documents based on their number of words (i.e. length)
grouped = dlength_df.groupby('doc_length')
indices = grouped.indices
word_count = []
doc_count = []
counter = 0
for w,d in indices.items():
word_count.append(w)
doc_count.append(len(d))
# Plot the distribution of words vs documents in the corpus
plt.figure(figsize=(8,5))
plt.plot(word_count, doc_count)
plt.xlabel('Word count in document')
plt.ylabel('Number of Documents')
plt.title('Word count vs Number of documents')
# Plot the Cumulative distribution of documents length
plt.figure(figsize=(8,5))
plt.hist(word_count, 30, density=True, histtype='step', cumulative=True, label='Complaints', color='red', linewidth=2)
plt.xlabel('Word count in document')
plt.ylabel('Fraction of Documents')
plt.legend(loc='upper left')
plt.title('Cumulative distribution of documents length')
plt.show()
By looking at the charts above, we realize that more than 70% of the documents have less than 1000 words. Also, the fraction of documents with 2000 words and more is less than 5%.
Preprocess the Dataset
One of the primary tasks before developing a model on textual data is cleaning the text. It usually includes lowercasing the text, removing stop words, removing puncutations, alphanumeric and/or special characters and so on. The functions preprocess_corpus()
and clean_text()
will clean up our dataset.
# Randomly show 3 complaints
samples = [np.random.randint(len(df)) for i in range(3)]
for index in samples:
print(df['text'][index])
print()
# Preprocess the corpus, i.e. clean the text
print('Cleaning the text...', end='')
df = preprocess_corpus(df, column='text')
print('done.')
Prepare the Dataset
After text preprocessing, we need to perform feature engineering and data prepration. We convert the text to vector representation using tfidf method. In this process, I consider the following options:
Remove stop words
Remove domain specific stop words, i.e. words that are rare (appear in 3 documents or less) or too common (occur in more that 90% of the documents)
Utilize unigrams, bigrams, etc
Use all the features (words) or a subset of them
Encode the class labels into numbers
Please see the function compute_tfidf()
for more information.
# Create tfidf features from the text
stop_words = 'english'
ngram_range = (1, 1)
max_features = None
X, vectorizer = compute_tfidf(df['text'], stop_words, ngram_range, max_features)
# Encode the labels
labels = df.product_group.unique()
label_encoder = encode_labels(labels)
y = label_encoder.transform(df.product_group)
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
print('Training set Shape: {} | Test set Shape: {}'.format(X_train.shape, X_test.shape))
Build the Model
In this section, I will several different models for our multiclass classification task and compare them at the end. There are a wide variety of techniques that can be used. However, I decided to apply the following methods:
Naive Bayes (NB)
Logistic Regression
Linear SVM
Random Forest
XGBoosting
BERT
nb_model = MultinomialNB()
print('Number of documents = {} | Number of features = {}'.format(X_train.shape[0], X_train.shape[1]))
st_time = time.time()
predictions, accuracy, metrics_report, nb_prf = train_test_model(nb_model, X_train, X_test, y_train, y_test, labels)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
print('accuracy: {}'.format(accuracy))
print('='*100)
print(metrics_report)
plot_confusion_matrix(nb_model, X_test, y_test, display_labels=labels, xticks_rotation='vertical', cmap="BuPu")
# plt.show()
log_model = LogisticRegression(penalty='l2', max_iter=500)
print('Number of documents = {} | Number of features = {}'.format(X_train.shape[0], X_train.shape[1]))
st_time = time.time()
predictions, accuracy, metrics_report, log_prf = train_test_model(log_model, X_train, X_test, y_train, y_test, labels)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
print('accuracy: {}'.format(accuracy))
print('='*100)
print(metrics_report)
plot_confusion_matrix(log_model, X_test, y_test, display_labels=labels, xticks_rotation='vertical', cmap="BuPu")
svm = LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-4, C=0.1)
print('Number of documents = {} | Number of features = {}'.format(X_train.shape[0], X_train.shape[1]))
st_time = time.time()
predictions, accuracy, metrics_report, svm_prf = train_test_model(svm, X_train, X_test, y_train, y_test, labels)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
print('accuracy: {}'.format(accuracy))
print('='*100)
print(metrics_report)
plot_confusion_matrix(svm, X_test, y_test, display_labels=labels, xticks_rotation='vertical', cmap="BuPu")
rf_model = RandomForestClassifier(n_estimators=100, max_depth=100, min_samples_split=10, n_jobs=-1, verbose=0)
print('Number of documents = {} | Number of features = {}'.format(X_train.shape[0], X_train.shape[1]))
st_time = time.time()
predictions, accuracy, metrics_report, rf_prf = train_test_model(rf_model, X_train, X_test, y_train, y_test, labels)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
print('accuracy: {}'.format(accuracy))
print('='*100)
print(metrics_report)
plot_confusion_matrix(rf_model, X_test, y_test, display_labels=labels, xticks_rotation='vertical', cmap="BuPu")
gb_model = GradientBoostingClassifier(n_estimators=50, max_depth=10)
predictions, accuracy, metrics_report, gb_prf = train_test_model(gb_model, X_train, X_test, y_train, y_test, labels)
print('accuracy: {}'.format(accuracy))
print(metrics_report)
plot_confusion_matrix(gb_model, X_test, y_test, display_labels=labels, xticks_rotation='vertical', cmap="BuPu")
Start training...
Iter Train Loss Remaining Time
1 274073.2086 104.34m
2 240114.7651 104.04m
3 215548.5383 102.49m
4 196626.8849 100.89m
5 181335.8100 99.41m
6 168818.3973 97.47m
7 158202.7714 95.42m
8 149309.3867 93.23m
9 141564.8736 91.14m
10 134992.9559 88.86m
20 97935.1018 65.44m
30 82408.8805 42.23m
40 73946.2785 20.42m
50 67992.7312 0.00s
Training done!
Number of documents = 214688 | Number of features = 10000
Start testing...
done!
accuracy: 0.8323179252137947
precision recall f1-score support
bank_service 0.80 0.78 0.79 3994
credit_card 0.80 0.78 0.79 5964
credit_reporting 0.85 0.86 0.86 16310
debt_collection 0.79 0.83 0.81 12229
loan 0.80 0.77 0.79 6139
money_transfers 0.75 0.63 0.68 911
mortgage 0.92 0.90 0.91 8126
accuracy 0.83 53673
macro avg 0.82 0.79 0.80 53673
weighted avg 0.83 0.83 0.83 53673
Issues with Gradient Boosting
In gradient boosting trees are built sequentially, therefore, training takes longer time. In our case, since the dataset is large it even makes the training take significantly longer. As you can observe in the above cell, training took over 104 minutes and the accuracy is 83%, which ranks it at fourth positions only slightly better than Naive Bayes classifier.
Since I was willing to try a few other configuration of the gradient boosting and see if it improves the performance, I decided to install the GPU supported version of XGBoosting algorithm on Google Colab. After searching, I found the XGBoosting library from here and installed it on Google Colab. After that, I implemented two forms of XGBoosting algorithm:
xgb: This is the direct xgboost library.
XGBClassifier: This is a sklearn wrapper for XGBoost. This enables us to use sklearn’s Grid Search with parallel processing.
The next two cells includes two versions of XGBoost algorithm I implemented and copied the code from Colab. Although the gpu-supported version were much faster, the performance didn't improve.
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
xgb_model = XGBClassifier(objective='multi:softmax', n_estimators=100, learning_rate=0.3, max_depth=4, subsample=0.8, n_iter_no_change=2, verbosity=1)
xgb_param = xgb_model.get_xgb_params()
xgb_param['num_class'] = 7
cvresult = xgb.cv(xgb_param, xgb_train, num_boost_round=xgb_model.get_params()['n_estimators'], nfold=5, early_stopping_rounds=10, verbose_eval=True)
xgb.set_params(n_estimators=cvresult.shape[0])
predictions, accuracy, metrics_report = train_test_model(xgb_model, X_train, X_test, y_train, y_test)
print('accuracy: {}'.format(accuracy))
print(metrics_report)
plot_confusion_matrix(xgb_model, X_test, y_test, display_labels=labels, xticks_rotation='vertical', cmap="BuPu")
xgb_train = xgb.DMatrix(X_train, label=y_train)
xgb_test = xgb.DMatrix(X_test, label=y_test)
# setup parameters for xgboost
param = {}
# use softmax multi-class classification
param['objective'] = 'multi:softmax'
param['eta'] = 0.05
param['max_depth'] = 12
param['nthread'] = 4
param['num_class'] = 7
param['gpu_id'] = 0
watchlist = [(xgb_train, 'train'), (xgb_test, 'test')]
num_round = 5
bst = xgb.train(param, xgb_train, num_round)
bst = xgb.train(param, xgb_train, num_round, watchlist)
# get prediction
pred = bst.predict(xgb_test)
error_rate = np.sum(pred != y_test) / y_test.shape[0]
print('Test error using softmax = {}'.format(error_rate))
[0] train-merror:0.19384 test-merror:0.22162
[1] train-merror:0.18778 test-merror:0.21685
[2] train-merror:0.18410 test-merror:0.21366
[3] train-merror:0.18071 test-merror:0.21107
[4] train-merror:0.17842 test-merror:0.20849
Test error using softmax = 0.208484712984182
xx = np.array([1,2,3])
width = 0.15
gb_prf = np.array([0.83, 0.83, 0.83])
ax = plt.subplot(111)
ax.bar( xx - width, height=np.array(nb_prf), width=width, color='b', align='center', label='NB', tick_label=['Precision', 'Recall', 'f1'])
ax.bar(xx ,height=np.array(log_prf), width=width, color='g', align='center', label='Log_Reg')
ax.bar(xx + width, height=np.array(svm_prf), width=width, color='r', align='center', label='SVM')
ax.bar(xx + 2*width, height=np.array(rf_prf), width=width, color='y', align='center', label='Random Forest')
ax.bar(xx + 3*width, height=np.array(gb_prf), width=width, color='black', align='center', label='Gradient Boosting')
plt.xlabel('Classification Metrics')
plt.ylabel('Scores')
plt.legend(loc='lower right')
plt.show()
Training Time Comparison
The plot below shows that Naive Bayes is incredibly fast and SVM is following it by having the 0.28s and 5.59s training times, respectively. I intentionally omitted Gradient boosting from the plot, because it has very large training time, to be able to have a better visualization.
tr_times = [('NB', 0.28), ('Log_Reg', 111.02), ('SVM', 5.59), ('Random_Forest', 280.10)]
x_vals = []
h_vals = []
for t in tr_times:
x_vals.append(t[0])
h_vals.append(t[1])
ax = plt.subplot(111)
ax.bar(np.linspace(0,1,4) - width, height=np.array(h_vals), width=width, color='Green', tick_label=x_vals)
plt.xlabel('Techniques')
plt.ylabel('Training Time')
plt.show()
Cross Validation Implementation of the Models
The second approach to coping with imbalanced datasets is cross validation. For this purpose, I used the stratified Cross Validation from sklearn. This method creates the folds by preserving the percentage of samples for each class. Please note that cross validation in general is computationally expensive. Below are the results for all the previous techniques.
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=123)
nb_model = MultinomialNB()
accs = []
reports = []
f1_scores = []
st_time = time.time()
for train_index, test_index in skfold.split(X,y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
predictions, accuracy, metrics_report, nb_prf = train_test_model(nb_model, X_train, X_test, y_train, y_test, labels)
accs.append(accuracy)
reports.append(metrics_report)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
print('mean accuracy: {:.2f}'.format(np.mean(accs)))
log_model = LogisticRegression(penalty='l2', max_iter=500)
accs = []
reports = []
st_time = time.time()
for train_index, test_index in skfold.split(X,y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
predictions, accuracy, metrics_report, log_prf = train_test_model(log_model, X_train, X_test, y_train, y_test, labels)
accs.append(accuracy)
reports.append(metrics_report)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
print('mean accuracy: {:.2f}'.format(np.mean(accs)))
svm = LinearSVC(class_weight='balanced', verbose=False, max_iter=10000, tol=1e-4, C=0.1)
accs = []
reports = []
st_time = time.time()
for train_index, test_index in skfold.split(X,y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
predictions, accuracy, metrics_report, svm_prf = train_test_model(svm, X_train, X_test, y_train, y_test, labels)
accs.append(accuracy)
reports.append(metrics_report)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
print('mean accuracy: {:.2f}'.format(np.mean(accs)))
rf_model = RandomForestClassifier(n_estimators=100, max_depth=100, min_samples_split=10, n_jobs=-1, verbose=0)
accs = []
reports = []
st_time = time.time()
for train_index, test_index in skfold.split(X,y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
predictions, accuracy, metrics_report, rf_prf = train_test_model(rf_model, X_train, X_test, y_train, y_test, labels)
accs.append(accuracy)
reports.append(metrics_report)
en_time = time.time()
print('Total time: {:.2f}s'.format(en_time-st_time))
print('mean accuracy: {:.2f}'.format(np.mean(accs)))
Multiclass Classification Using BERT
There are several different deep learning techniques available for (text) classification such as sequence to sequence models (LSTM, RNN), Convolutional Neural Networks (CNNs) and their combinations that can produce extremely good results. However, after introducing transformer models in 2017, many transformer-based models developed that led to new state-of-the-art results in a wide variety of NLP tasks including question answering, classification and others. There are many scientific articles demonstrate that such as this and this. Therefore, I decided to utilize a transformer model for my classification problem.
I used transformers library and built a BERT model. But due to some unknown issue it wasn't training correctly. Therefore, I switched to another library called ktrain and built the model. Please see below for more information.
X = df['text'].to_list()
# Encode the labels
labels = df.product_group.unique()
label_encoder = encode_labels(labels)
y = label_encoder.transform(df.product_group)
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
from transformers import BertTokenizer, RobertaTokenizer
from transformers import TFBertModel, TFBertPreTrainedModel, TFBertForSequenceClassification
from transformers import glue_convert_examples_to_features, InputExample
def convert_data_into_input_example(x, y):
input_examples = []
for i in tqdm(range(len(x))):
example = InputExample(
guid= None,
text_a= x[i],
text_b= None,
label= str(y[i])
)
input_examples.append(example)
return input_examples
def bert_compatiable_format(bdset):
input_ids, attention_mask, token_type_ids, labels = [], [], [], []
for in_ex in bdset:
input_ids.append(in_ex.input_ids)
attention_mask.append(in_ex.attention_mask)
token_type_ids.append(in_ex.token_type_ids)
labels.append(in_ex.label)
labels = np.vstack(labels)
return ([np.asarray(input_ids), np.asarray(attention_mask), np.asarray(token_type_ids)], labels)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=False)
train_input_examples = convert_data_into_input_example(X_train, y_train)
val_input_examples = convert_data_into_input_example(X_test, y_test)
label_list = label_encoder.transform(labels)
label_list = [str(i) for i in label_list.tolist()]
bert_train_dataset = glue_convert_examples_to_features(examples=train_input_examples, tokenizer=tokenizer, max_length=128, task='mrpc', label_list=label_list)
bert_val_dataset = glue_convert_examples_to_features(examples=val_input_examples, tokenizer=tokenizer, max_length=128, task='mrpc', label_list=label_list)
x_train, y_train = bert_compatiable_format(bert_train_dataset)
x_val, y_val = bert_compatiable_format(bert_val_dataset)
def example_to_features(input_ids, attention_masks, token_type_ids, y):
return {"input_ids": input_ids,
"attention_mask": attention_masks,
"token_type_ids": token_type_ids},y
train_ds = tf.data.Dataset.from_tensor_slices((x_train[0], x_train[1], x_train[2], y_train)).map(example_to_features).shuffle(100).batch(64)
val_ds = tf.data.Dataset.from_tensor_slices((x_val[0], x_val[1], x_val[2], y_val)).map(example_to_features).batch(64)
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
EPOCHS = 3
# Train the model
history = model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)
Multiclass Text Classification Using krain
I used ktrain library to implement BERT. "ktrain is a lightweight wrapper for the deep learning library TensorFlow Keras (and other libraries) to help build, train, and deploy neural networks and other machine learning models. It is designed to make deep learning and AI more accessible and easier to apply for both newcomers and experienced practitioners."
import ktrain
from ktrain import text
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=X_train, y_train=y_train,
x_test=X_test, y_test=y_test,
class_names=labels.tolist(),
preprocess_mode='bert',
maxlen=128)
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), batch_size=6)
# find good learning rate
learner.lr_find() # briefly simulate training to find good learning rate
learner.lr_plot() # visually identify best learning rate
learner.fit_onecycle(2e-5, 4)
# Use a Predictor object capable of making predictions on new raw data.
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.get_classes()
# Predict a new document
predictor.predict(X_test[0:1])
# let's save the predictor for later use
predictor.save('/tmp/my_predictor')
# reload the predictor
reloaded_predictor = ktrain.load_predictor('/tmp/my_predictor')
# make a prediction on the same document to verify it still works
reloaded_predictor.predict(x_test[0:1])
GPU Run
I used the Google colab to utilize the GPU resource. The program started with no problem. However, at the end of the second epoch due to lack of GPU resource it stopped and restarted the environment. Every epoch took 1.30 hours to execute and it can be observed that the accuracy is quite good (87%). I think it would go beyond 90% if it wasn't stopped.
begin training using onecycle policy with max lr of 3e-05...
Train on 214688 samples
Epoch 1/3
214688/214688 [==============================] - 10178s 47ms/sample - loss: 0.5175 - accuracy: 0.8252
Epoch 2/3
214688/214688 [==============================] - 10165s 47ms/sample - loss: 0.3903 - accuracy: 0.8684
Epoch 3/3
214688/214688 [==============================] - 10200s 48ms/sample - loss: 0.2637 - accuracy: 0.9110
<tensorflow.python.keras.callbacks.History at 0x7f59883c4668>
Challenges and Discussion
This project had several challenges including:
The dataset was fairly large, which made it quite interesting.
The dataset was imbalanced in terms of number of documents in different classes. Also, the length of documents varied from 1 to over 5000 words.
The textual content needed plenty of cleaning
Since I did not have easy access to GPU resources, I wasn't able to get the result that I expected. That said, The best result among all the six models I trained belongs to BERT by 91% accuracy.
In conclusion, transformer BERT outperformed all other models (even though it wasn't completely finished!!!), but training time was significantly higher and needs GPU. Naive Bayes on the other hand, is exceptionally fast and gives reasonable results. SVM and logistic regression were in the middle, fast in training and high accuracy, respectively. Random forest and GBoosting methods were the slowest with less accuracy. Standard/traditonal methods need feature engineering to improve perforamnce, but transformer models are pretrained and can be used end to end.