A review: preprocessing techniques and data augmentation for sentiment analysis

In literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.

Working with sentiment analysis faces many challenges. The first one is text structure, Hussein [1] made a survey on sentiment analysis challenges by comparing many past studies; the authors showed types of text structure for sentiment analysis: (i) structured sentiments are format sentiment text; (ii) unstructured sentiments are informal and free text; (iii) semi-structured sentiments are between format structured text and unstructured text. The most difficult is working with unstructured sentiment, the writer is not required to comply with any constraints: using slang terms, wrong spelling, wrong grammar structure, etc. These have made it hard to analyze text structure, especially detecting negation text being big challenge impacting sentiment detection and text structure evaluation. Another challenge is to select the most relevant techniques or approaches to classify the sentiment polarities.
According to Medhat et al. [2], there are three main approaches for sentiment analysis: lexicon-based approach, machine learning approach, hybrid approach. Lexicon-based approach replies on the emotional lexicons to detect customers' emotions, its main drawbacks are to depend on the context and languages. Actually, sentiment analysis is text classification problem which can apply machine learning classifiers in emotional polarities, Soleymani et al. [3] made a survey summarized sentiment analysis methods, including in text, it showed many previous researches in supervised and unsupervised learning.
The traditional approach is usually supervised learning, supervised classifiers are used such as Naive Bayes, SVM, logistic regression, ensemble of voting classifiers, also investigating on feature selection for retaining useful features and ignoring redundant features to improve the performing approach. This depends on the size and quality of the pre-labeled datasets which are scarce and unavailable for a certain application, they are tedious to collect, expensive and time-consuming to build, depend on domain adaptation and ineffectively handle unseen data. Especially, the Vietnamese training data are not abundant and lack so much preventing many propositions in research team. This motivates us to develop an effective solution to text classification in general and sentiment analysis in particular.

Related works
Our study is based on semi-supervised learning and used many preprocessing techniques to normalize the data such as emoji icons replacement, elongated characters removal, negation handling, intensification handling. Furthermore, we have investigated to enhance more training data automatically in order to improve the performance of the models with the limited training data.
Preprocessing techniques are frequently used in natural language processing to prepare text that is going to be classified. Especially, reviews in e-commerce system, blogs and social media are informal, so they contain so much noisy information, unnecessary in detecting the sentiment. Those will clean text, normalize text and only keep useful information, Symeonidis et al. [4] and Effrosynidis et al. [5] summarized the preprocessing techniques and performed experiments to prove they improve significantly the accuracy of classifiers. Fernández-Gavilanes et al. [6] used sentiment lexicon to create by means of an automatic polarity expansion algorithm and some natural language processing techniques such as detecting of polarity conflicts or concessive subordinate clauses.
Singh and Kumari [7] evaluated the effects of preprocessing on twitter data and indicated the improvement of the classifiers. They removed URLs, hashtags, user mentions, punctuation, stop words and replaced slang words with actual words using n-gram. Similarly, Jianqiang and Xiaolin [8] also evaluated these on five twitter datasets, including expanding acronyms, replacing negation, removing URLs, numbers, stop words.
Negation handling is used to remove negative forms to reduce ambiguities of the classified sentences, this is one of important preprocessing techniques applied widely in sentiment analysis. AL-Sharuee et al. [9] used SentiWordNet 3.0 to prepare the underlying text for further processing and handle common linguistic forms such as intensifiers, negation, contrast. Next phase, they proposed binary ensemble clustering by assembling the results of a modified k-means algorithm, where the selected features are adjectives and adverbs in all documents.
Emoji icons have boomed in e-commerce system, blogs and social media, these express so much sentiment, Fernández-Gavilanes et al. [10] also constructed a novel emoji sentiment lexicon using an unsupervised sentiment analysis based on the definitions given by emoji creators Emojipedia and created lexicon variants thanks to the sentiment distribution of the informal texts. Wang and Castanon [11] conducted analyses to examine the relationship between the emotional icons and sentiment polarities, they confirmed a few emotional icons are strong signals of sentiment polarity, but a group the emotional icons conveys complicated sentiment in detecting sentiment polarities.
Data augmentation is a spotlight in recent years, from a limited training data will automatically generate more training data as considered semi-supervised learning. Sennrich et al. [12], Sugiyama and Yoshinaga [13] used back translation technique to generate training data to improve performance of translation model. Fadaee at al. [14] also proposed a novel approach that augments the training data to improve translation quality substantially, this targets low-frequency words to generate new sentence pairs containing rare words, synthetically created context. Kobayashi [15] proposed the contextual augmentation. They stochastically replace words with other words predicting by a bidirectional language model at word positions, language models improved with a labelconditional architecture which allows the model to augment sentences without breaking the label-compatibility.
Query expansion (QE) is also an effective solution to get more data, Azad et al. [16] made a survey the QE techniques in information retrieval (IR), its purpose is to reformulate the original query to enhance the IR effectiveness. This can be applied by putting the original query into a specific search engine and selecting the most relevant retrieval results as new query.
Şahin and Steedman [17] based on dependency tree to remove dependency links (crop) and move the tree fragments around the root (rotate) for new data, this proposal was inspired from two techniques augmenting data in image processing as cropping and rotating, their experiments showed improvement for majority of low-resource languages.
Wei and Zou [18] applied some easy data augmentation (EDA) techniques, namely synonym words, random swap, random insert, random delete to generate new data. Although these techniques are easy to implement, not depending on any external resources, they improve the performances substantially.
In this paper, we aim to improve the accuracies of the models with limitation of training samples by using the preprocessing techniques to normalize and clean data, also enhancing training samples automatically for the original samples. In order to evaluate our approach, we have applied the well-known multiple classifiers such as logistic regression, support vector machine, and ensembles of classifiers such as one-vs-one, one-vs-all.
Our main contribution has evaluated the effects of preprocessing techniques and data augmentation for Vietnamese. We have summarized the preprocessing techniques, investigated data augmentation techniques and experimented to examine the possibility to generate training data automatically applied for Vietnamese to improve the accuracies of the algorithms. This is the necessary and meaningful step due to the limitation of the Vietnamese dataset, this enriches the Vietnamese dataset and saves the time and cost to build the pre-labeled dataset of a certain domain.
In the rest of this paper is organized as follows: Sect. 3 presents our background and approach, the experiments is presented in Sect. 4, and Sect. 5 is the conclusions and future works.

Preprocessing techniques
Most of recent studies in sentiment analysis focus on the user-generated texts which have been based on habit and are informal, hence it is necessary to clean, normalize language, also remove noisy information to be classified.
Vietnamese segmentation this is always a required step to work with Vietnamese, for example " " (this is a wonderful phone) is tokenized " " (using pyvi 1 library). Unlike English, words are separated by whitespaces and punctuations, Vietnamese words may contain many tokens and they must be processed, if not, the meaning of the sentence can be much different from the original expectation.
Lowercase is a classic preprocessing technique converting all texts into lowercase form. The same words are merged, so the dimensionality of the problem is reduced, for example " " (good) and " " (Good) is the same dimensionality. This techniques have been widely used by many researchers [19][20][21].
Stop words removal stop words are function words, they are usually less meaning words and do not contain any sentiment, but appear high frequencies in texts. They should be removed to reduce the dimensionality and the computational cost, also improve the performance. The set of these words is not totally predefined depending on the application. In our experiments, they are determined stop words list based on term frequencies and inverse document frequencies weights in the collected datasets.
Elongated characters removal some characters are elongated one or more times in a lexicon to emphasize sentiment, this can lead to increase unnecessary dimensionality because the classifiers treat them as different words, even they may be ignored due to low frequency. So the elongated characters removal transforms the word to the source word in order to merge them in the same dimensionality. For example "quuuuáaaa" is replaced by "quá" (so), " " is replaced by " " (wonderful). The experiments of Symeonidis [4] proved the improvement of this one.
Abbreviations or wrong-spelling lexicons replacement abbreviations and wrong-spelling words become a habit and are usually used in reviews of social media or e-commerce system, they should also merge into the source word, for example dth -> (cute); iu, êu -> yêu (lovely); omg -> oh my god; k -> không (not); sd -> (use); ote, okay, oki, uki, oke -> ok; -> (wonderful); , xs -> (excellent); wá, qá -> quá (so). Currently, a list of abbreviation and wrong-spelling lexicons have prepared manually for our experiments based on observing the reviews in social media and our collected datasets. Kim [19] corrected the common spelling mistakes by using AutoMap. Symeonidis et al. [4] also mentioned this technique in a comparison of preprocessing techniques.
Emotional icons replacement emotional icons have been widely used in reviews and denotes users' sentiment. Wang and Castanon [11] analyze and compare sentiments of tweets with and without emotional icons in order to provide the evidences the importance of emotional icons in expressing the sentiment in social media. In our case, the positive and negative icons are, respectively, replaced by "pos" and "neg" lexicons, for example: :) is replaced by "pos" lexicon or :( is replaced by "neg" lexicon. Punctuation removal some punctuations (excluding underscore _ is used for Vietnamese segmentation) usually do not affect the sentiment, it should be removed to reduce noise, for example: " " (so beautiful!, love this phone!) will be " ". However, some punctuations contain sentiment, so it might decrease the accuracy of classification in those cases such as :), :D, ;), < 3 are positive icons which affect sentiment in reviews. In our works, this one will be applied after emotional icons replacement. Kim [19] also removed punctuation, URLs, stop words not containing any sentiment to improve performance.
Numbers removal normally, numbers do not contain any sentiments, it is necessary to remove them, but this should be performed after emotional icon replacement, wrongspelling replacement because any of them contain numbers such as :3, < 3, 8| , 8-), etc.
Part of Speech (POS) handling POS tagging is an essential problem in natural language processing to assign part of speech to each words in a sentence as noun, verb, adjective, pronoun. This is helpful to increase semantic in text. In our works, POS tagging is used to retain words containing the sentiment, namely nouns, adjectives, verbs, adverbs. Symeonidis et al. [4] also applies POS tag and keeps nouns, verbs, adverbs in experiments. For example of our case " " (that phone is so beautiful, I am so pleased), the part of speech for each words is " " (N: noun, P: pronoun, A: adjective, R: adverb), the sentence becomes " ". In order to do this, we also used pyvi 1 library for POS tagging.
Negation handling is a challenge in sentiment analysis, for example " " (the product is not good) used terms to vectorize, if not considering "không" (not) term, it might evaluate this is a positive sentence instead of a negative one. Normally, when detecting a negation lexicon (không (not), (not), (not yet), etc) following by a positive or negative lexicons, those phrases should be replaced by antonyms of next lexicon, for example the phrase " " (not good) is replaced by " " (bad) as an antonym of " " (good) based on a certain wordnet. However, based on the experiments of Symeonidis et al. [4], replacing negations with antonyms only logistic regression algorithm of SS-Twitter dataset and Convolutional Neural Networks for SemEval dataset beat the baseline. Even, Xia et al. [22] presented many machine learning algorithms fail replacing negations with antonyms.
Our works are based on the negation terms (không (not), (not), (not yet)) to detect the negation, Fernández-Gavilanes et al. [6] also estimated negation scope based on some negator forms (not, no, never, neither). Our case has no Vietnamese wordnet being strong enough for negation replacement, so if detecting the negation following by a positive lexicon, then replacing by "not_pos" lexicon, it also detects the negation following be a negative lexicon, then replace by "not_neg" lexicon. After that, in order to show affectation of lexicons, we append "pos" and "neg" lexicons whenever appearing a positive and negative lexicon, respectively. In our experiments, this improves significantly accuracy of classifiers.
For example the sentence " " (this phone design is not nice, but its performance is good), "không đẹp" (not nice) is a negation phrase, "đẹp" (nice) is a positive lexicon, so it is replaced by "not_pos" lexicon, and " " (good) is also a positive lexicon, so the sentence becomes " not_pos, pos". Intensification handling intensifier lexicons such as " " (very), "quá" (too), " " (a bit), "khá" (pretty) aim to emphasize, increase or decrease the semantic meanings of the lexicons which precede or follow them. This is also so necessary to detect the degree of customers' satisfaction. Fernández-Gavilanes et al. [6] applied intensification treatment as a preprocessing technique, they used the parsing to determine which semantic orientation altered.
For our works, if the program detects an increasing intensifier lexicon preceding or following by a positive or negative lexicon, then appending "strong_pos" or "strong_neg" lexicon, respectively. Otherwise, if detecting a decreasing intensifier lexicon preceding or following by a positive or negative lexicon, then appending "pos" and "neg" lexicon, respectively. This one will be applied after negation handling.
For example the sentence " " (nice design, strong configuration, I'm very pleased), " " is an increasing intensifier lexicon and follows by a positive lexicon as " " (pleased), so the sentence becomes " strong_pos". For intensification handling, we have also prepared a list of intensifier lexicons used frequently in Vietnamese, grouped into increasing ( ) and decreasing ( ) semantic.
Other techniques relate to morpheme of word not using such as stemming, lemmatizing since Vietnamese is an inflexionless language, words are only one form.

Data augmentation
The original data augmentation is used in image classification by increasing image data such as rotate, translate, scale, add noise, etc. Similarly, data augmentation has also applied for text classification by increasing text data based on various techniques. In text, data augmentation is more complex, many studies have been investigated to get new data, also improve quality of new data without user intervention. This helps to enhance the original training data to increase accuracies of models. However, it notes that data augmentation is only useful for a small dataset.
Firstly, we present EDA techniques which was introduced by Wei and Zou [18] and apply them to Vietnamese.
Synonym replacement the words are not stop words, their synonym words have obtained randomly and replaced them for a new sentence. Random Swap will swap n times two non-stopword words randomly. Random Insert will find a random synonym of a non-stopword word and insert this randomly n times in the sentence. Random Delete will randomly remove each word in the sentence with probability p. These processes have repeated many times until having the expected training data. For example the sentence: " " (great! I'm so pleased) has applied these techniques which may generate new sentences as follows: -Synonym Replacement: " " (" " is a synonym word of "hài_lòng"). -Random Swap: "hài_lòng! " (swapping " " and "hài_lòng"). -Random Insert: " " (" " is also a synonym word of "hài_lòng"). -Random Delete: " " (deleting "tôi" word).
Although these methods may increase the meaningless sentences, but they can increase the accuracies of classifiers in experiments. Depending on the dataset size to determine the repeated times because too much augmented data can lead to overfitting issue. The biggest disadvantage of these methods is not reserving meaning concerning the context of the sentences, so we present more complex approaches retaining the meaning as the original sentence. Back translation aims to obtain more training samples based on the translators, many research teams have used to improve translation models [12][13][14][15]23]. This technique is resolved by using the translators to translate the original data to a certain language, after that taking the translated data into the independent translator to translate back to the original language. Normally, the data of back translation will be never totally exact the same as the original data. English is one of languages having many training datasets for translation, others lack training datasets for translation models. So, English was utilized as the intermedia language to get more data.
For example the sentence " ", Google translator translates it to English: "I really like to buy the device at this store", taking this translated sentence to Google translator to translate back to Vietnamese as: " ". This approach is simple to understand and helpful to augment data retaining the meaning of the original data, but it needs effective translators. In experiments, we have used Google Translation API to translate the original data in Vietnamese into English, and translate back to Vietnamese for augmented data.
Syntax-Tree Transformations is a rule-based approach to generate new data. From the original data, a syntactic parser builds a syntax tree, then using some syntactic grammars transform this tree to the transformed tree which is used to generate new sentence form. There are many syntactic transformations such as moving active voice to passive voice.
For example, the sentence " " (I like this phone) is parsed into " " (P: pronoun, N: noun, V: verb) and transforms to " " (this phone is liked by me). The generated data still retains the meaning of the original data, but this approach is costly time in calculation, especially Vietnamese which is complex in sentence structures.

Classifiers
Based on the experiments [24], we choose the best classifiers for our experiments, namely logistic regression, SVM and ensembles of classifiers as OVO and OVR.
Logistic regression (LR) is a statistical approach to determine relationship between the dependent variable y and a set of independent variables x. In order to predict the label of a data point, this is based on the probability of logistic function and a predefined threshold belongs to [0, 1]. The logistic function is often used as sigmod function.
Support vector machine (SVM) is a strong classifier to find the hyperland which divides the dataset into various groups in multi-dimensional space, this must have the same distance between it and two hyperlands which contain the nearest data points belongs to two groups, respectively. For non-linearly separable dataset, SVM used kernel functions to transform the data points from non-linearly separable space into linearly separable space. Our experiments use RBF (radial basis function kernel) kernel as follows: where γ indicates how far the influence of a data point in calculation of a certain hyperland. Data points, which are low γ values are far from or high γ values which close to a separation hyperland, are considered in calculation.
One-vs-one (OVO) and one-vs-all (OVA) are ensembles of binary classifiers for multi-class problem. Each iteration of OVO takes a pairwise of classes and applies the binary classifier to indicate the label of a data point, the final label is determined based on majority voting of iterations. For OVR, the computational cost is lower, if having c classes then OVO needs to execute c(c − 1)/2 iterations, about OVR takes only c iterations. For each iteration, the binary classifier determines whether a data point belongs to that label, the final label is determined based on a probability.

Data preparation and feature extraction
We have prepared four Vietnamese datasets as short reviews on watch, phone, food collecting from the internet and previous studies (see Sect. 6). Table 1 shows the size of each dataset used for validation and Table 2 presents some representative samples of each dataset. Positive and negative lists contain Vietnamese positive and negative lexicons, respectively, including English lexicons used usually in Vietnamese sentences such as happy, nice, good, bad, etc. Negation list is to detect negation in the sentence such as "không" (not), " " (not yet). And intensification list has grouped into increasing intensifiers (" " (very), "quá" (so), " ", " " (extremely)) and decreasing intensifiers ("tạm" (pretty), "khá" (pretty), "cũng" (also)).
In order to extract features, we used tf × idf weight which is widely used in natural language processing and has high score in text classification. The dimensionality is represented by unigram and bigram. F1 score is used to evaluate the performances of the approaches, it is the average of precision and recall metrics which reaches the best score at 1 and worst score at 0.

The experiments
As earlier mentioned, the approach is based on semi-supervised learning with a limited training data to reduce efforts to build a pre-labeled dataset. In order to prove the effects of preprocessing techniques and data augmentation techniques, we get 100 reviews of every datasets for training, and all remaining ones will be used for validation data. Various experiments have been performed with the well-known classifiers.

Preprocessing techniques
The first column of Table 3 is the F1 scores of the classifiers without using preprocessing techniques (the baseline results), and the second one is with preprocessing techniques. The performances of all datasets which are applied with preprocessing techniques improved well. Nguyen-Nhat [21] has also showed the effects of these techniques, even using only one review for training. This proves that they are the reasonable techniques for Vietnamese sentiment analysis problem to normalize data before feeding to the classifiers.

Back translation
Back translation technique has generated 100 of new reviews for every polarities of training data, Google translator has been applied for this one, each review is translated between Vietnamese (vi) and English (en). The third column of Table 3 is the F1 scores of the classifiers to execute back translation technique to generate new reviews without using preprocessing techniques, the performances of the classifiers are improved a little bit for four datasets compared to the baseline results.
The sixth column of Table 3 incorporated the preprocessing techniques to normalize data. Later, applying back translation techniques to generate 100 of new samples more, the scores have been improved much better than the baseline results, also only preprocessing techniques or back translation results.
The reason is the normalized data help the results of translator better, so the new sample also obtained better quality. From that, it improves the performances of the classifiers rather than only using techniques independently. The results show that this is a promising approach to enhance training samples and improve the performance for Vietnamese sentiment analysis problem.

Syntax-tree transformation
For Syntax-Tree transformation, we have generated new reviews by converting each sentence of a review which is in active voice to passive voice. This has also generated 100 of new reviews for every polarities.
The F1 scores of this technique which are placed at the fourth column of Table 3 are also better than the baseline results, and the results of incorporating the preprocessing techniques presented at the seventh column of Table 3. The same as back translation techniques, these have boosted the performances of the classifiers. The normalized data help POS tagging works better leading to the quality of the converted sentences better. Thus, this approach is also suitable for Vietnamese sentiment analysis problem.

EDA
As above discussion, these above approaches have still retained meaning and grammar structure of the original data, but they are complex and need external resources. In other hand, EDA techniques are totally simple, easy to understand, do not need other predefined datasets or external resources, they still obtain promising performances in English [18].
Our works perform ten iterations for every EDA techniques, so every 100 reviews of the original training data will generate 4000 of new reviews. The fifth column of Table 3 is the results of these techniques performing without preprocessing techniques, but most of the scores are worse than the baseline results, only logistic regression of dataset1 is better a little bit. However, we have investigated one more experiment by incorporating these techniques with preprocessing techniques (the eighth column of Table 3), almost results of the datasets have been better than the baseline and preprocessing techniques (the second column), excluding the dataset 4 which is only better than the baseline results, but a little bit worse than the preprocessing techniques. These indicate that these techniques depend on the dataset, also  Some discussions may present worse results: synonym replacement or random insert technique depends on the way to get the synonym words, none of them are suitable in a certain domain context, for example " " (I am pleased the phone), "hài_lòng" (pleased) word has many Vietnamese synonym words such as [" ", " ", " ", " "], replacing "hài_lòng" by " " (contented) is better than " " (satisfied) in this context. About delete random technique sometimes deletes some words containing sentiment in the sentence.
In short, the visualizable results of the experimented datasets [Figs. 1,2,3,4] show the preprocessing techniques are the robust indicators to improve the performances in Vietnamese sentiment polarity. Moreover, data augmentation is a promising solution to make an abundance of training samples to boost the accuracies of the classifiers. Based on our experiments, back translation and Syntax-Tree transformation are the reasonable approaches and the EDA techniques have the potential to improve Vietnamese sentiment polarity.

Conclusions and future works
In this paper, we have based on semi-supervised learning for Vietnamese sentiment analysis, summarized the preprocessing techniques to normalize data and augmentation data techniques to generate new training data from the limited original training data. We have performed many experiments to present the effects of these techniques applying in Vietnamese. For most of experimented datasets, the accuracies of classifiers are improved. Moreover, ensembles of data augmentation techniques have also experimented and competitive results obtained. These experimented results show the approaches are reasonable and suitable for Vietnamese, this saves cost and time to build a pre-labeled dataset and gradually reach domain independence.
We can see that the performances of these techniques have depended on the original training data which are used to generate new data, and it will be better if the predefined data which serve preprocessing are collected large enough. Therefore, we will investigate to process slang words, concessive words, collect the intentional wrongspelling and abbreviation words in social media, select the better synonym words, and also propose the novel approaches of data augmentation techniques to obtain new training samples being more qualification, especially in Vietnamese context. In short, these results are promising scores, it can motivate our team and other propositions to improve the scores for this attractive problem in future.