Transformer revolves around the idea of a model that uses attention to increase the speed with which it can be trained. The primary motivation for designing a transformer was to enable parallel processing of the words in the sentences, i.e. to process the entire sentence at once. This parallel processing is not possible in LSTMs or RNNs or GRUs as they take words of the input sentence one by one. The first transformer was proposed in the paper Attention is All You Need. There is a TensorFlow implementation of it is available here. Also, Harvard’s NLP group provided a guide annotating the paper with PyTorch implementation. These transformer models come in different shapes, sizes, and architectures and have their own ways of accepting input data: via tokenization.

BERT Overview

BERT (Bidirectionnal Encoder Representations for Transformers) is a “new method of pre-training language representations” developed by Google in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding and released in late 2018. Since it is pre-trained on generic large datasets (from Wikipedia and BooksCorpus), it can be used for a wide variety of NLP tasks like text classification, translation, summarization, and question answering. Here is the abstract from the paper:

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

BERT provides pre-trained language models for English and 103 other languages that you can fine-tune to fit your needs. Fine-tuning a model means that we will slightly train it using our dataset on top of an already trained checkpoint. Here, we’ll see how to fine-tune the multilingual model to do sentiment analysis. To do that, we follow the steps below:

Load and preprocess the data so that it can be used by the model.
Set-up a training loop using Keras' fit API; train the model on the training data
Evaluate the model on the testing data and compare to the actual results

Sentiment Analysis on Farsi Text

The following implementation shows how to use the Transformers library to obtain state-of-the-art results on the sequence classification task. This library "provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch".

There are plenty of applications of classification on English text. Thus, the goal of my implementation is to perform sentiment classification on non-English text, in this case Farsi(Persian) language. As it takes time to run it on CPU, I created a Google Colab notebook version of the entire implementation, which you can access it here.

Load the data

For this tutorial, I used a Farsi dataset, from Kaggle. It contains over 68000 comments about books gathered from a book website called taaghche.

Initial Data Preprocessing

# Remove the unnecessary columns
df = df.drop(columns=['date', 'bookname', 'bookID', 'like'])
# df = df.rename(columns={'Text':'text','Suggestion': 'label'})
df.loc[(df.rate < 3), 'label'] = '0'
df.loc[(df.rate >= 3), 'label'] = '1'
df = df.drop(columns=['rate'])
df.head()

df_pos = df.loc[df.label == '1']
df_neg = df.loc[df.label == '0']
print('Postive examples: {}  Negative examples: {}'.format(len(df_pos),len(df_neg)))

Postive examples: 55922  Negative examples: 13868

Create a Balanced Dataset

dataset_size = 5000
dataset = create_small_dataset(df_pos, df_neg, dataset_size)

#shuffle the dataset
dataset = dataset.sample(frac=1).reset_index(drop=True)
dataset

train_data, test_data = train_test_split(dataset, test_size=0.2)
print('train data size: {}    test data size: {}'.format(len(train_data), len(test_data)))

train data size: 8000    test data size: 2000

Convert the Data into Bert Specific Format

We need to transform our data into a format BERT understands. This involves two steps. First, we create a list of InputExample objects using the constructor provided by Transformers library. Every InputExample must have the following structure:

text_a is the text we want to classify
text_b is used if we're training a model to understand the relationship between sentences (i.e. is text_b a translation of text_a)? Is text_b an answer to the question asked by text_a?). This doesn't apply to our task, so we can leave text_b blank.
label is the label of our example, i.e. True or False.

train_input_examples = convert_data_into_input_example(train_data)
val_input_examples = convert_data_into_input_example(test_data)

Next, we need to preprocess our data (i.e. InputExample's) so that it matches the data BERT was trained on. Therefore, we need to do a few things:

Lowercase our text (if we're using a BERT lowercase model)
Tokenize it (i.e. "sally says hi" -> ["sally", "says", "hi"])
Break words into WordPieces (i.e. "calling" -> ["call", "##ing"])
Map our words to indexes using a vocab file that BERT provides
Add special "CLS" and "SEP" tokens (see the readme)
Append "index" and "segment" tokens to each input (see the BERT paper)

Tokenization

Deep learning models accept certain kinds of inputs, which is vectors of integers, each value representing a token. Each string of text must first be converted to a list of indices to be fed to the model. The tokenizer takes care of that for us. We also need to add special tokens to the list of ids.

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=False)

text = 'I liked that book very much!'
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)
text_ids = tokenizer.convert_tokens_to_ids(tokenized_text)
print('text ids:', text_ids)
text_ids_with_special_tokens = tokenizer.build_inputs_with_special_tokens(text_ids)
print('text ids with special tokens: ', text_ids_with_special_tokens)

['I', 'like', '##d', 'that', 'book', 'very', 'much', '!']
text ids: [146, 11850, 10162, 10189, 12748, 12558, 13172, 106]
text ids with special tokens:  [101, 146, 11850, 10162, 10189, 12748, 12558, 13172, 106, 102]

Happily, there is a much simpler method that can do all the previous steps (i.e. tokenize, convert to indices and add special tokens) altogether.

MAX_SEQ_LENGTH = 128
encoded_bert_text = tokenizer.encode(text, add_special_tokens=True, max_length=MAX_SEQ_LENGTH)
# encoded_bert_text = tokenizer.encode(text, add_special_tokens=True, max_length=MAX_SEQ_LENGTH, return_tensors='tf')

print('encoded text: ', encoded_bert_text)
decoded_text_with_special_token = tokenizer.decode(encoded_bert_text)
decoded_text_without_special_token = tokenizer.decode(encoded_bert_text, skip_special_tokens=True)

print('decoded text with special token: ', decoded_text_with_special_token)
print('decoded text without special token: ', decoded_text_without_special_token)

encoded text:  [101, 146, 11850, 10162, 10189, 12748, 12558, 13172, 106, 102]
decoded text with special token:  [CLS] I liked that book very much! [SEP]
decoded text without special token:  I liked that book very much!

However, we still require other addtional information to deal and manage. Thankfully, the Transformer library has a method to directly convert a dataset of InputExamples into features BERT understands. This method is called glue_convert_examples_to_features.

label_list = ['0', '1']

bert_train_dataset = glue_convert_examples_to_features(examples=train_input_examples, tokenizer=tokenizer, max_length=MAX_SEQ_LENGTH, task='mrpc', label_list=label_list)
bert_val_dataset = glue_convert_examples_to_features(examples=val_input_examples, tokenizer=tokenizer, max_length=MAX_SEQ_LENGTH, task='mrpc', label_list=label_list)

for i in range(3):
#     print('Example: {}'.format(bert_train_dataset[i]))
    print('Example: {')
    print(' Input_ids: {}'.format(bert_train_dataset[i].input_ids))
    print(' attention_mask: {}'.format(bert_train_dataset[i].attention_mask))
    print(' token_type_ids: {}'.format(bert_train_dataset[i].token_type_ids))
    print(' label: {}'.format(bert_train_dataset[i].label))
    print('}')

Example: {
 Input_ids: [101, 66847, 11892, 41002, 14495, 99500, 29869, 37951, 20208, 60230, 774, 92289, 11892, 10498, 763, 101420, 39387, 10700, 17197, 11626, 55532, 26973, 10700, 29869, 76528, 10700, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 label: 1
}
Example: {
 Input_ids: [101, 40160, 10429, 19015, 12455, 13141, 54731, 788, 71222, 54862, 80493, 10278, 26973, 10388, 756, 770, 11086, 10327, 53880, 54862, 10278, 756, 10289, 23155, 54862, 808, 17821, 19015, 55532, 29869, 86598, 10700, 10498, 12084, 10582, 10700, 789, 22929, 86598, 10700, 10498, 10641, 41883, 17821, 10327, 10700, 773, 36990, 26973, 10388, 12218, 24076, 756, 29869, 11626, 10388, 15511, 31688, 10582, 67125, 823, 24076, 756, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 label: 0
}
Example: {
 Input_ids: [101, 11800, 34353, 10700, 772, 22900, 29413, 763, 73478, 68625, 22955, 789, 53001, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 label: 0
}

# Let's take a look at one example from the dataset
ex = bert_train_dataset[0]
in_ids = ex.input_ids
decoded_sentence = tokenizer.decode(in_ids, skip_special_tokens=True)
print(decoded_sentence)

عالی بودحتما پیشنهاد میکنمانقدر زیبا بود که احساس میکردم فیلمش رو دارم میبینم

Defining the Hyperparameters

Before fine-tuning the model, we must define a few hyperparameters that will be used during the training such as the optimizer, the loss and the evaluation metric.

model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased')

optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

Training the Model

Note: For some reason, when passing the bert_train_dataset, (which is supposed to work), to the model.fit() did NOT work, so I had to workaround it.

# This way did NOT work, so I had to workaround it.

model.fit(bert_train_dataset, validation_data=bert_val_dataset, epochs=3)

Workaround

I used the bert_train_dataset and created a list for each feature (i.e. input_ids, attention_mask, token_type_ids and label) and passed them as the arguments of the model. Please see transformers' documentation for more details.

Training the Model

x_train, y_train = my_solution(bert_train_dataset)
x_val, y_val = my_solution(bert_val_dataset)

print('x_train shape: {}'.format(x_train[0].shape))
print('x_val shape: {}'.format(x_val[0].shape))

train_ds = tf.data.Dataset.from_tensor_slices((x_train[0], x_train[1], x_train[2], y_train)).map(example_to_features).shuffle(100).batch(32)
val_ds   = tf.data.Dataset.from_tensor_slices((x_val[0], x_val[1], x_val[2], y_val)).map(example_to_features).batch(64)

print('Format of model input examples: {} '.format(train_ds.take(1)))

EPOCHS = 5

history = model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)

x_train shape: (8000, 128)
x_val shape: (2000, 128)
Format of model input examples: <TakeDataset shapes: ({input_ids: (None, 128), attention_mask: (None, 128), token_type_ids: (None, 128)}, (None, 1)), types: ({input_ids: tf.int64, attention_mask: tf.int64, token_type_ids: tf.int64}, tf.int64)>

Training Details

Epoch 1/5
250/250 [==============================] - 137s 548ms/step - loss: 0.5838 - accuracy: 0.6894 - val_loss: 0.5079 - val_accuracy: 0.7665
Epoch 2/5
250/250 [==============================] - 134s 535ms/step - loss: 0.4867 - accuracy: 0.7747 - val_loss: 0.5002 - val_accuracy: 0.7650
Epoch 3/5
250/250 [==============================] - 134s 534ms/step - loss: 0.4131 - accuracy: 0.8227 - val_loss: 0.5112 - val_accuracy: 0.7685
Epoch 4/5
250/250 [==============================] - 134s 534ms/step - loss: 0.3515 - accuracy: 0.8558 - val_loss: 0.5692 - val_accuracy: 0.7630
Epoch 5/5
250/250 [==============================] - 134s 535ms/step - loss: 0.2901 - accuracy: 0.8871 - val_loss: 0.6147 - val_accuracy: 0.7575

plot_history(history)

Plot the Loss and Accuracy

We can see the training and validation loss as well as accuracy in the plots below. Please note that after epoch 3 validation loss starts to grow, which means the model begins to overfit. Therefore 3 epochs would be enough and work better since it has better accuracy of 0.7685.

predictions = model.predict(val_ds)
print(predictions[0].shape)
print()
predictions_classes = np.argmax(predictions[0], axis = 1)
for i in range(10):
    print('comment: {}\n, actual label: {}, predicted label: {}'.format(test_data.iloc[i]['comment'], val_input_examples[i].label, predictions_classes[i]))

comment: کسالت آور بود 
ولی چون یک داستان نسباتا واقعیست ، به خاطر حقایقی که در اون وجود داشت بخوبی ما رو با فرهنگ و اعتقادات مردم اون دوره انگلیس آشنا میکنه
داستان کشتی تایتانیک
, actual label: 0, predicted label: 1
comment: شیوایی کلام میتوانست بهتر باشد
, actual label: 0, predicted label: 0
comment: باسلام،برای قشرخاصی طراحی شده کتاب ودرکش برای عموم کمی سخت وگیج کننده است
, actual label: 0, predicted label: 1
comment: بد نبود- راضی کننده بود
, actual label: 1, predicted label: 0
comment: من به عنوان یه نویسنده این مجله تقاضا میکنم همه ازش حمایت کنند! این یه کار نمونه ای علمی ترویجی نو تو زمینه زبان کوردی هست و نیاز به حمایت همه داره
, actual label: 1, predicted label: 0
comment: دوستشدارمدوستش خواهی داشت✌
, actual label: 1, predicted label: 0
comment: خیلی قشنگ بود  
مرسی طاقچه جون
, actual label: 0, predicted label: 1
comment: همه کتاب داره درباره جزییات ظاهر بقیه و محیط صحبت میکنهچنگی ب دل نمیزنهبسیار خسته کنندس
, actual label: 0, predicted label: 0
comment: کتاب رو خوندم
به نظرم کتاب مفیدی میتونه باشه
فقط طیق معمول کتب روانشناسی داستان زیاد داره
خودم فقط قسمتهای غیر داستانی رو خوندم اگر ابهام ایجاد میشد داستانش رو هم میخوندم
کلا خوبه
حداقل متوجه میشید زبان عشق خودتون چیه و ب طرف مقابلتون میتوتید بگید با من اینطوری باش 
, actual label: 1, predicted label: 1
comment: کتاب با بررسی وقایع کربلا و تطبیق آنها با زندگی امروز به ما هشدار می‌دهد تا به سمت همان فرهنگی که فرزند رسول خدا را به شهادت رساند حرکت نکنیم و این کار را با دلایل منطقی و بسیار هوشمندانه انجام داده است
پیشنهاد میکنم این کتاب رو حتماً بخونید چون جواب بسیاری از سوالاتتون رو خواهید گرفت
, actual label: 1, predicted label: 1

# pred_sentences = ['به نظر کتاب خوبی نمی ومد', ' رایگان بودنش فوق العادش میکند']
pred_sentences = [comment for comment,l in test_data.sample(20).values]
predictions = get_prediction(pred_sentences)
for p in predictions:
    print(p)

('واقعا چطور همچین کتابایی اجازه نشر پیدا میکنن؟', 0)
('روش چی ؟', 0)
('بنظرم کتاب زندگینامه شهید رو رایگان کردن صورت خوشی نداره! چون اونی که براش مهمه بخونه پولشو میده میخره کسی هم که سلیقه مطالعه کتابش این مدلی نیست، رایگانم بشه نمیخونه', 0)
('سوالات کجا میاد کجا باید جواب بدیم', 0)
('سمنوپزان ، مسلول و  زن زیادی  خوب بودند ؛ در کل ادم رو به گذشته های دور و شیوه زندگی اونها میبره و از این جهت فوق العاده است\nممنون از طاقچه بابت برنامه خوبش و به خاطر کتابهایی که رایگان کردید', 1)
('کتاب فوق العاده ای بود به خیلی از شبهات در غالب داستان جواب داده  حتما پیشنهاد میکنم', 1)
('چقدر قشنگ بالا و پایین زندگی رو نشون داده بود  ', 0)
('ترجمه اش اصلا خوب نیست', 0)
('واقعا این کتاب عالییییییییییییییییییییییییییییییییییییییییییییییییه یه حرف نگفته اس که غمباد شده تو دلمون واقعا دست مریزاد سرکاره خانوم عالیشاهی با آرزوی بهترین ها و بالاترین درجات برای شما نویسنده محترمه', 1)
('واقعاً قیمت ها رو بر چه اساس تعیین میکنید نمیدونم قیمت نسخه چاپی ۵۰ تومان هست که با توجه به قیمت کاغذ منطقی هست اما نسخه الکترونیک برای یک کتاب نباید انقدر گران باشد بنظرم طاقچه باید محدودیت های مشخص شده قرار بده برای قیمت', 0)
('واقعا صداش خوب نیست چون انگار داره برا بچه ها قصه میگه همچین متن هایی رو باید یه نفر با صدای قوی ومحکم وبا جذبه قرائت کنه ', 0)
('خیلی داستان جذابی نداشت و در آخر داستان هم\u200c اسلام هراسی بیداد میکرد', 0)
('سلام من این کتاب را قبلا خریداری کردم نمی دانم چرا الان فایل کتاب را باز نمی کند ؟', 0)
('آیا حرف دلواپسان درست نبود؟', 0)
('عالی', 1)
('کتاب سطح پایین با نثری درهم پیچیده البته ترجمه نامناسب', 0)
('قشنگ بود', 1)
('ترجمه خیلی خیلی خیلی بد و غیرقابل فهم', 0)
('یکی از بهترین و ساده ترین کتاب ها برای کسایی که دنبال راه حلی هستند تا در زندگی مالی موفق شوند', 1)
('خیلی عالی بود لذت بردم امیدوارم شما هم فراموش نکنید اولین نفری بودم که به ازادیش فکر کردم به درخت، به ماهی، به دلیران تنگسیر، به پایانی خوش', 1)

print("Evaluating the BERT model")
model.evaluate(val_ds)

Evaluating the BERT model
32/32 [==============================] - 9s 293ms/step - loss: 0.6147 - accuracy: 0.7575
[0.6146795749664307, 0.7574999928474426]

	date	comment	bookname	rate	bookID	like
0	1395/11/14	اسم کتاب No one writes to the Colonel\nترجمش...	سرهنگ کسی ندارد برایش نامه بنویسد	0.0	3.0	2.0
1	1395/11/14	طاقچه عزیز،نام کتاب"کسی به سرهنگ نامه نمینویسد...	سرهنگ کسی ندارد برایش نامه بنویسد	5.0	3.0	2.0
2	1394/06/06	بنظرم این اثر مارکز خیلی از صد سال تنهایی که ب...	سرهنگ کسی ندارد برایش نامه بنویسد	5.0	3.0	0.0
3	1393/09/02	به نظر کتاب خوبی میومد اما من از ترجمش خوشم نی...	سرهنگ کسی ندارد برایش نامه بنویسد	2.0	3.0	0.0
4	1393/06/29	کتاب خوبی است	سرهنگ کسی ندارد برایش نامه بنویسد	3.0	3.0	0.0

	comment	label
0	اسم کتاب No one writes to the Colonel\nترجمش...	0
1	طاقچه عزیز،نام کتاب"کسی به سرهنگ نامه نمینویسد...	1
2	بنظرم این اثر مارکز خیلی از صد سال تنهایی که ب...	1
3	به نظر کتاب خوبی میومد اما من از ترجمش خوشم نی...	0
4	کتاب خوبی است	1

	comment	polarity
0	اصلا خوب نبود	0
1	لطفا کتاب های بیشتری بزارید رشته حسابداری ممنو...	0
2	با وجود اینکه ساده و روان نوشته شده و گوینده ه...	0
3	بسیار عجیب و باورنکردنی	1
4	خوب نبود	0
...	...	...
9995	خوبه	1
9996	سلام\nلطفا براش تخفیف بگزارید\nبهش احتیاج داریم!	1
9997	قیمتش حتی با تخفیف چهل درصد هم غیرمنطقیه برای ...	0
9998	افتضاح، بد، مزخرف	0
9999	فوق العادست	1

Sentiment Analysis with Multilingual Transformers

BERT Overview

Sentiment Analysis on Farsi Text

Load the data

`load_data`[source]

Initial Data Preprocessing

Create a Balanced Dataset

`create_small_dataset`[source]

`remove_emoji`[source]

Convert the Data into Bert Specific Format

`convert_data_into_input_example`[source]

Tokenization

Defining the Hyperparameters

Training the Model

Workaround

`my_solution`[source]

`example_to_features`[source]

Training the Model

Training Details

`plot_history`[source]

Plot the Loss and Accuracy

`example_to_features_predict`[source]

`get_prediction`[source]

Sentiment Analysis with Multilingual Transformers

BERT Overview

Sentiment Analysis on Farsi Text

Load the data

load_data[source]

Initial Data Preprocessing

Create a Balanced Dataset

create_small_dataset[source]

remove_emoji[source]

Convert the Data into Bert Specific Format

convert_data_into_input_example[source]

Tokenization

Defining the Hyperparameters

Training the Model

Workaround

my_solution[source]

example_to_features[source]

Training the Model

Training Details

plot_history[source]

Plot the Loss and Accuracy

example_to_features_predict[source]

get_prediction[source]

`load_data`[source]

`create_small_dataset`[source]

`remove_emoji`[source]

`convert_data_into_input_example`[source]

`my_solution`[source]

`example_to_features`[source]

`plot_history`[source]

`example_to_features_predict`[source]

`get_prediction`[source]