.. _text_representation: Text Representation ========================================== .. image:: https://github.com/INGEOTEC/text_models/actions/workflows/test.yaml/badge.svg :target: https://github.com/INGEOTEC/text_models/actions/workflows/test.yaml .. image:: https://badge.fury.io/py/text-models.svg :target: https://badge.fury.io/py/text-models .. image:: https://coveralls.io/repos/github/INGEOTEC/text_models/badge.svg?branch=develop :target: https://coveralls.io/github/INGEOTEC/text_models?branch=develop .. image:: https://dev.azure.com/conda-forge/feedstock-builds/_apis/build/status/text_models-feedstock?branchName=main :target: https://dev.azure.com/conda-forge/feedstock-builds/_build/latest?definitionId=16894&branchName=main .. image:: https://img.shields.io/conda/vn/conda-forge/text_models.svg :target: https://anaconda.org/conda-forge/text_models .. image:: https://img.shields.io/conda/pn/conda-forge/text_models.svg :target: https://anaconda.org/conda-forge/text_models .. image:: https://readthedocs.org/projects/text-models/badge/?version=latest :target: https://text-models.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status Solving a text categorization problem usually starts by deciding which text transformation to use; the traditional approach would be to decide on a :ref:`bow`. A BoW representation using q-grams of character and n-grams of words produces a satisfactory baseline (with term-frequency inverse document-frequency as its weighting scheme and linear support vector machine as the classifier, e.g., `B4MSA `_). However, it cannot encode more information than the one used to learn the BoW and train the classifier. Techniques in semi-supervised learning have been used to incorporate information into text representations. The basic idea is to convert a dataset into a labeled dataset where the labels are automatically identified. The :ref:`emoji` developed here followed this idea. The dataset used to train these representations is a set of tweets collected from Twitter's open stream in Arabic, Chinese, English, French, Portuguese, Russian, and Spanish. On the other hand, many competitions and datasets follow an equivalent structure to the approach mentioned above; these competitions can be used to increase the number of possible text representations. The :ref:`dataset` describes the dataset and competition sets used to create text representations. .. _bow: Bag of Word (BoW) Representation -------------------------------------- The first step to including more information in the representation is to be able to represent the text in a format amenable to different machine learning algorithms. Particularly, it is used a BoW representation with q-grams of characters (2, 3, and 4) and words; the constraint is that the q-grams are only computed on the words, and consequently, there are no q-grams between words. The exception is Chinese, which only uses q-grams of 1, 2, and 3. The BoW was learned from 524,288 tweets randomly selected from the `text_models `_ collection. The BoW model is implemented with `microTC `_; the only particular characteristic is that only the 16,384 more frequent tokens were kept in the representation. The BoW models for the different languages are found in: * `Arabic (ar) `_ * `Chinese (zh) `_ * `English (en) `_ * `French (fr) `_ * `Portuguese (pt) `_ * `Russian (ru) `_ * `Spanish (es) `_ These representations can be used as follows: >>> from EvoMSA.utils import load_bow >>> bow = load_bow(lang='en') >>> X = bow.transform(['Hi', 'It is working']) >>> X.shape (2, 16384) where the text *Hi* and *It is working* are transformed into matrix :math:`\mathbb R^{2 \times 16384}`. The BoW contains words and q-grams; let us depict in a wordcloud the words with the highest weight; this can be done with the following code. >>> from matplotlib import pylab as plt >>> from wordcloud import WordCloud >>> _ = {bow.id2token[id]: w for id, w in bow.token_weight.items()} >>> words = {k: v for k, v in _.items() if k[:2] != 'q:'} >>> word_cloud = WordCloud().generate_from_frequencies(words) >>> plt.imshow(word_cloud, interpolation='bilinear') >>> plt.axis("off") .. plt.tight_layout() plt.savefig('bow-wordcloud.png') .. image:: bow-wordcloud.png .. _emoji: Emoji Text Representation -------------------------------- Transforming the dataset into a labeled dataset is similar to the one used in `deepmoji `_. The idea is to use the emoji in the text as the labels. The process selects and removes the emojis in the tweets and keeps them as the labels of the text. The emojis kept are the ones that appear at least 1024 times alone, i.e., the text contains only one emoji. The second step is to use the labeled dataset and the BoW model with a classifier; it was decided to use a Linear Support Vector Machine (implemented in `sklearn.svm.LinearSVC`). The methodology used is one versus the rest. The positive class corresponds to a maximum of 262,144 tweets containing only the label at hand, whereas the negative class corresponds to 262,144 randomly selected from the other labels. If there are not enough tweets, then it is used as many tweets as possible as long as the set contains the same number of positive and negative examples. For the negative class, only tweets with unique labels are used; however, if there are not enough, it is allowed to use tweets with multiple labels as long as it does not contain the positive class. The following code uses the emoji representation in English using only the representation of the emoji identified with index 0. >>> from EvoMSA.utils import load_bow, load_emoji >>> bow = load_bow(lang='en') >>> emo = load_emoji(lang='en', emoji=2) >>> X = bow.transform(['This is funny', 'This is lovely']) >>> emo.decision_function(X) array([-0.89521935, 0.46341599]) The function :py:func:`~text_models.utils.emoji_information` can be used to know the available emojis. The information is stored in a dictionary where the keys are emojis, and values contain additional information such as the performance (i.e., recall), the number of examples of the positive class (i.e., number), and the identifier (i.e., emoji). The following code shows an example: >>> from text_models.utils import emoji_information >>> emoji = emoji_information(lang='en') >>> emoji['❤️'] {'recall': 0.6546198309081498, 'emoji': 2, 'number': 673530} Complementing the previous wordcloud, the following code computes a wordcloud of the words that contribute the most to decide the positive class. >>> w = emo.coef_[0] >>> _ = {bow.id2token[id]: _w * w[id] for id, _w in bow.token_weight.items() if w[id] > 0} >>> words = {k: v for k, v in _.items() if k[:2] != 'q:'} >>> word_cloud = WordCloud().generate_from_frequencies(words) >>> plt.imshow(word_cloud, interpolation='bilinear') >>> plt.axis("off") .. plt.tight_layout() plt.savefig('emoji-wordcloud.png') .. image:: emoji-wordcloud.png .. _dataset: Dataset Text Representation --------------------------------- The idea of Dataset Text Representation is, on the one hand, to increase the number of representations and, on the other, to test their impact on the performance of a text classifier. The datasets used are in Arabic, Chinese, English, and Spanish; these are text categorization problems taken from competitions such as SemEval, TASS, and IberLEF, among others. The models were created using the approach one versus the rest, even when there are only two classes. Consequently, in a binary problem, there will be two models corresponding to each class being the positive class. The function :py:func:`~text_models.utils.dataset_information` can be used to know which are the available datasets. It returns a dictionary where the the keys correspond to the dataset names and the values are the labels. >>> from text_models.utils import dataset_information >>> dataset = dataset_information(lang='en') >>> dataset['HA'] array(['negative', 'neutral', 'positive'], dtype='>> from EvoMSA.utils import load_dataset, load_bow >>> bow = load_bow(lang='en') >>> ha = load_dataset(lang='en', name='HA', k=2) >>> X = bow.transform(['Good moorning!', 'It is a cold and rainy day']) >>> ha.decision_function(X) array([ 0.8267016, -0.1237163]) where the examples presented correspond to a correct classification. Following an equivalent procedure to the one used to compute the emoji's wordcloud, on this occassion, it is computed the wordcloud of the dataset *HA* on the positive class. >>> w = ha.coef_[0] >>> _ = {bow.id2token[id]: _w * w[id] for id, _w in bow.token_weight.items() if w[id] > 0} >>> words = {k: v for k, v in _.items() if k[:2] != 'q:'} >>> word_cloud = WordCloud().generate_from_frequencies(words) >>> plt.imshow(word_cloud, interpolation='bilinear') >>> plt.axis("off") .. plt.tight_layout() plt.savefig('dataset-wordcloud.png') .. image:: dataset-wordcloud.png .. _dataset-emoji: Dataset and Emoji Text Representations ------------------------------------------ Considering that there is a linear model for each emoji and dataset, it is feasible to visualize them with the aim of learning more about the similarities and differences between the models. Before starting the comparison, let us load the libraries used in the procedure. >>> from EvoMSA.utils import load_bow, load_emoji, emoji_information, dataset_information, load_dataset >>> from sklearn.metrics.pairwise import cosine_distances >>> from sklearn.decomposition import PCA >>> from matplotlib import pylab as plt >>> from joblib import Parallel, delayed >>> from tqdm import tqdm >>> import numpy as np The models are vectors where each component corresponds to a token; the BoW assigns a weight to each token. These weights are incorporated in the comparison by computing the element-wise product of the coefficients obtained by the SVM (see :py:data:`m.coef_`) and the weight :py:data:`w`. >>> def weights(models: list, lang: str): >>> bow = load_bow(lang=lang) >>> w = np.array([bow.token_weight[i] for i in range(len(bow.token_weight))]) >>> return np.array([m.coef_[0] * w for m in models]) The first step is to download the :ref:`emoji`. >>> LANG = 'es' >>> emoji_info = emoji_information(lang=LANG).items() >>> emoji_models = Parallel(n_jobs=-1)(delayed(load_emoji)(lang=LANG, emoji=k) for k in tqdm(range(len(emoji_info)))) The :ref:`dataset` can be retrieved using the following code. >>> dataset_info = dataset_information(lang=LANG) >>> problems = [] >>> [[problems.append(dict(name=name, lang=LANG, k=k)) for k in range(len(labels))] for name, labels in dataset_info.items()] >>> dataset_models = Parallel(n_jobs=-1)(delayed(load_dataset)(**x) for x in tqdm(problems)) The representations are stored in :py:data:`emoji_models` and :py:data:`datasets_models`; these contain the coefficients estimated with SVM. The missing step is to use the weights in the BoW model, which is done with :py:func:`weights`. These representations are set to form a matrix where the cosine distance of all the pairs are computed, as can be observed in the following code. >>> X = np.vstack([weights(emoji_models, lang=LANG), weights(dataset_models, lang=LANG)]) >>> distances = cosine_distances(X) The final step is to visualize :py:data:`X` using :py:class:`~sklearn.decomposition.PCA` to reduce the number of dimensions to two. The figure presents in black the emojis and red the datasets. >>> pca = PCA(n_components=2).fit(distances) >>> for x, y in pca.transform(distances[:len(emoji_info)]): >>> plt.plot(x, y, 'k.') >>> for x, y in pca.transform(distances[len(emoji_models):]): >>> plt.plot(x, y, 'r.') .. plt.tick_params(axis='both', bottom=False, labelbottom=False, left=False, labelleft=False) plt.tight_layout() plt.savefig('emoji-dataset-vis.png') .. image:: emoji-dataset-vis.png Example ~~~~~~~~~~~~~~~~~~ Let us assume that there is a text classification problem that one wants to visualize using the above procedure. The first step is to load the libraries needed. >>> from EvoMSA import base >>> from microtc.utils import tweet_iterator >>> import os >>> from sklearn.svm import LinearSVC The second step is to load the dataset into the memory; this can be done with the following instructions. >>> tweets = os.path.join(os.path.dirname(base.__file__), 'tests', 'tweets.json') >>> D = list(tweet_iterator(tweets)) It is time to estimate the parameters of the linear SVM, and scale the parameters obtained with the tokens' weights. >>> bow = load_bow(lang='es') >>> m = LinearSVC().fit(bow.transform(D), [x['klass'] for x in D]) >>> w = np.array([bow.token_weight[i] for i in range(len(bow.token_weight))]) >>> P = [w * c for c in m.coef_] The problem is represented in the matrix :py:data:`P`, the next step is to use :py:data:`P` and :py:data:`X` to compute the cosine distance as follows: >>> P_dis = cosine_distances(P, X) Let us depict the problem points along with the emoji and dataset points. In the following figure, the first 64 emojis are in black, the problem points are in red, and the rest of the emojis and datasets are in lightgrey. >>> for x, y in pca.transform(distances[64:]): >>> plt.plot(x, y, '.', color='lightgrey') >>> for x, y in pca.transform(distances[:64]): >>> plt.plot(x, y, 'k.') >>> for x, y in pca.transform(P_dis): >>> plt.plot(x, y, 'r.') .. plt.tick_params(axis='both', bottom=False, labelbottom=False, left=False, labelleft=False) plt.tight_layout() plt.savefig('emoji64-problem.png') .. image:: emoji64-problem.png .. text_categorization: Text Categorization ----------------------------------- The idea is that a figure similar to the one produced above provides information about the performance of a system developed on the text representations used. For example, to complement the example, the following code uses the first 64 emojis and the BoW as text representations. These representations are combined using a stack generalization approach (see `EvoMSA `_). >>> from EvoMSA.utils import linearSVC_array >>> from EvoMSA.model import LabeledDataSet >>> coef, intercept = linearSVC_array(emoji_models[:64]) >>> emo = LabeledDataSet(textModel=bow, coef=coef, intercept=intercept) >>> evomsa = base.EvoMSA(models=[[bow, 'sklearn.svm.LinearSVC'], [emo, 'EvoMSA.model.SVCWrapper']], stacked_method='sklearn.naive_bayes.GaussianNB', TR=False).fit(D, [x['klass'] for x in D]) The final step is to use the model to predict; the next code predicts two sentences in Spanish, i.e., *Good moorning!* and *It is a cold and rainy day*. >>> evomsa.predict(['Buenos días!', 'Es un día frío y lluvioso']) array(['P', 'N'], dtype='