Celebrity profiling through linguistic analysis of digital social networks

Moreno-Sandoval, Luis G.; Pomares-Quimbaya, Alexandra; Alvarado-Valencia, Jorge A.

doi:10.1186/s40649-021-00097-w

Research
Open access
Published: 26 August 2021

Celebrity profiling through linguistic analysis of digital social networks

Luis G. Moreno-Sandoval ORCID: orcid.org/0000-0001-8853-2455^1,2,
Alexandra Pomares-Quimbaya^1,2 &
Jorge A. Alvarado-Valencia^1,3

Computational Social Networks volume 8, Article number: 16 (2021) Cite this article

4277 Accesses
2 Citations
Metrics details

Abstract

Digital social networks have become an essential source of information because celebrities use them to share their opinions, ideas, thoughts, and feelings. This makes digital social networks one of the preferred means for celebrities to promote themselves and attract new followers. This paper proposes a model of feature selection for the classification of celebrities profiles based on their use of a digital social network Twitter. The model includes the analysis of lexical, syntactic, symbolic, participation, and complementary information features of the posts of celebrities to estimate, based on these, their demographic and influence characteristics. The classification with these new features has an F1-score of 0.65 in Fame, 0.88 in Gender, 0.37 in Birth year, and 0.57 in Occupation. With these new features, the average accuracy improve up to 0.14 more. As a result, extracted features from linguistic cues improved the performance of predictive models of Fame and Gender and facilitate explanations of the model results. Particularly, the use of the third person singular was highly predictive in the model of Fame.

Introduction

Digital social networks (DSN) have become popular as a means of spreading information and connecting people with like-minded ones [1]. The capacity to spread opinions shows a general phenomenon with relevant implications in the context of social influence [2].

Public accessibility of DSN along with the ability to share and exchange opinions, thoughts, and feelings, among others, allow people to connect not only with friends and family, but also with any celebrity on the network [3]. This ability has been evident in the growth of DSN communication [4]. However, the success of such communication attempts depends on the level of trust that members have with each other [1, 5], considering that opinions have helped to influence the feelings and emotions of the public [6].

For this reason, the interest of researching on micro blog communities with services such as Twitter is growing exponentially [7] due to the massive production of written information that each user generates. This information includes personal data such as name, photograph, location, etc.; quantitative data such as the number of followers and people they follow; and also their timeline, which is the chronology of their messages both public and private. Likewise, a user can follow another one by accepting to receive the messages that the other user posts [8].

On the other hand, language variation is permanent and evident in the new ways of writing in DSN. Such variation is not necessarily random, but highly related to social factors [9]. In fact, the linguistic ethnography holds that:

“to a considerable degree, language and the social world are mutually shaping, and that close analysis of situated language use can provide both fundamental and distinctive insight into the mechanisms and dynamics of social and cultural production in everyday activity” [10]

When people share their opinions in DSN (such as Twitter), they might also be revealing demographic, social and/or psychosocial information about themselves. For example, Schwartz and colleagues [8] indicate that his research has been driven to an integral exploration of the language that differentiates people, giving a new perspective to psychosocial processes that yield results on how to identify the words most commonly used by people with self-esteem issues or how possessive words may vary from men and women to refer to their sentimental companions.

Rangel and colleagues [11] point out that due to the huge amount of information available on social networking platforms, it is possible to obtain information about different attributes such as gender, age, personality, native language, or political orientation from the analysis of an author’s profile.

Considering that celebrities use DSN frequently to communicate and connect with their followers [12]; and understanding that user’s behavioral profile is reflected in the message according to their writing patterns [13], it is essential to detect whether a user is a celebrity is essential in order to determine the influence they may have on other users of social networks [14] and to know what would be the impact of a comment made by this user. This provides information to measure the influence of celebrities on their followers by means of the corpus of their texts.

Our motivation to write this paper is to explore the predictive and explanatory capacities of linguistic features on demographics and influence variables of celebrities using DSN. In fact, the research’s main objective is understand how these linguistic features, which are found in the texts that celebrities publish on DSN, generate new information that allows to classify celebrities according to their demographics and influence variables. Moreover, these new variables derived from the texts, can indicate the use of language which shows specific sociolects and idiolects useful to analyze the celebrity’s profile, and increase the accuracy level in the classification models.

Recognizing this opportunity, this article formally addresses the study of linguistic analysis observed from celebrities using DSN and proposes a model with 18 features that can quantify the outcome of five types of analysis: lexical, syntactic, symbolic, participation, and complementary information. From the lexical analysis, the average use of words and lexical diversity are analyzed. The syntactic analysis studies the personal pronouns most commonly used by celebrities. The symbolic analysis studies how symbolic contents such as emojis and hashtags are used; the participation analysis quantifies the features of participation in the network (mentions and retweets). Finally, the complementary information analyzes the reference that the celebrity makes to other media (URLS).

The difference between this paper and the one presented [12] at the Conference and Labs of the Evaluation Forum (CLEF) is that this paper proposes a new model of characteristic selection and explains how this model helps to increase the accuracy value. At the Plagiarism analysis, Authorship identification, and Near-duplicate detection (PAN) at CLEF they only presented several classification models and showed the accuracy obtained with different principles but there is nothing associated with characteristic selection.

This study presents eight sections. The second section presents a summary of the background, showing the authors who worked in areas related to the analysis of DSN, identification of profiles in texts, detection of demographic and social variables in texts, and influence of celebrities. The third section presents the methodology with the necessary steps to determine the features of the digital identity describing celebrities’ characteristics. The fourth section illustrates the data preparation, which is the corpus description, exclusion of redundant measures, and the methodology application. The fifth section shows the results of the constructed explanatory models with the significance from each one of the features found in the digital identity. The sixth section shows the results of making the celebrity classification model validation and prediction ability with the features selected to quantify the improvement of the accuracy. Finally, in the seventh and eighth section, the conclusions and future works are presented.

Background

Relevant background on celebrity detection has three elements: first, a basic background in social networks; second, a review of the works related to author profiling, including Machine Learning classification models tested and demographics and social variables that have been found as valuable in the task of Author profiling; finally, a review of works that address particularly the study and prediction of celebrities influence.

Social networks

According to Aggarwal [15] a social network is defined as:

“a network of interactions or relationships, where the nodes consist of participants and the edges consist of relationships or interactions between these participants.”

Social network analysis (SNA), therefore, seeks to discover different types of patterns in the relationship of the different nodes found inside the network [16], allowing them to describe these communities. Thanks to the Internet, there is an interactive dialogue platform of digital relationships, emulating physical interactions [17, 18], which makes possible to keep the different participants of the network in contact [17, 18] creating not only new forms of sharing information, but also new forms of communication, which, a possible effect would be a transformation of personal opinion or decision due the influence from the new contacts [19]. Therefore, nowadays SNA are of great interest to determine how languages can be used to describe communities and their collective subjectivity from sociolects.

With the vast exchange of information over the Internet, users in social networks are leaving a digital trail; for example, every day, Facebook members post 3.2 billion likes and comments, and 340 million tweets are sent out on Twitter [19]. This trail contains associated information given in texts, images, URLs, or audios, thus, generates a social structure programmed by each user in their own network based on the connections with other users [20]. Therefore, the availability of large amounts of data on the web has given a new motivation to use of statistical and computational tools in the area of Social Network Analysis (SNA) because of their growing popularity [15], combined with Natural Language Processing (NLP). Fan et al. [21] apply that combination for reducing the harmful effects caused by the spread of rumor in a social network through independent cascade (IC) model and the linear threshold (LT) model.

Consequently, the work oriented to computational linguistics has focused on the analysis of the corpus found in conversations shared in social networks to analyze opinions, feelings, emotions and in general, the expression of private status on certain individuals [2, 22, 23].

Author profiling

Author Profiling has been approached from different aspects that converge searching how to describe or profile an author. One of these aspects has studied the problem from a computational point of view, giving all the relevance to classification. Other aspects are from the sociolinguistic point of view, where language is understood as a process of social construction that develops along the time and describes dialects, sociolects, or chronolects associated to the authors.

Therefore, some examples of the aspects mentioned are the methodologies of the first, third, and fifth place of celebrity profiling in the PAN at CLEF event. First, Radivchev and colleagues [24] vectorized with a Term Frequency-Inverse Document Frequency (TF-IDF) the users’ tweets taking into account the top 10,000 features from word bigrams to use a combination of logistic regression and Support Vector Machines (SVM). In contrast, Martinc and colleagues [25] selected a Logistic regression classifier with word unigram and character tetragram features where the Logistic regression classifier and its hyper-parameters were chosen with a grid search. Finally, Petrik and Chuda [26] extracted the text features with TF-IDF using bigrams and trigrams to capture word relationships, then, they combined it with Random forest with 200 decision trees as a classification model.

Profile classification

Theoretical and empirical studies have demonstrated a strong relationship between social factors and linguistic attitudes, since language is perceived as a social activity that reflects and influences social reality [11, 27].

In fact, for Rangel and colleagues [11], the analysis of shared contents aims to:

“ predict different attributes of the authors, such as gender, age, personality, native language, or political orientation. Therefore, social networks are playing a vital role in identifying what people think because they can reinforce political ideas or even influence the way of thinking.”

The relationship between personality traits and the usage of language has been widely studied by psycholinguistics, analyzing the use of language and how it varies depending on personal characteristics. Initial researches on author profiles focused mainly on formal texts and blogs. However, at present time, researchers mainly focus on digital social networks, where language is more spontaneous and less formal [11].

Then, there are connections which are not captured with traditional analysis because a common feature of social media communication is that this is delivered through short messages. These messages do not often use standard language variations [28], and the data itself drives an integral exploration of the language that differentiates people, finding connections that cannot be captured with traditional analysis such as word categorization of vocabulary [8].

Consequently, social activities represent a great challenge for the selection and identification of the user profile, which is caused mainly by the diversity of texts and complex social structures [11, 29, 30].

Demographic and social variables

Jadhav and Mhetre [20] and Simaki and colleagues [27] indicate a connection between social networks and personal behavior on the web, identifying the relationship and influence between social factors and a person’s language. In fact, Milroy and Milroy [31] point out that one of the most important contributions of Labov’s (1972) “quantitative paradigm” on the study of language has been the systematical examination of the relationship between language variation and the variables of “speaker” such as age, ethnicity, gender, social network, and social class.

Due to this growing interest, the extraction of demographic information from the text has been studied, and important approximations have been made by authors like Przybyla and Teisseyre [32], who identified demographic characteristics such as education, party association, and year of birth. In contrast, Simaki and colleagues [27] used texts to determine an author’s gender, from a qualitative to a quantitative analysis, or [33] exploring the differences between male and female writing in a large subset of the British National Corpus.

The authors Nguyen and colleagues [22] and Romaine [34], state that linguistic variations occur over long and non-immediate periods of time in a sociolect. This means that the corpus of each generation has its own linguistic characteristics in which people of different gender and age tend to have different linguistic features. This is strongly related to the social influence and identity they have in the usage of language [27].

As for the characterization of “occupation”, authors such as Sloan and colleagues [35] used a search engine designed to identify the socioeconomic group of a tweet. The 2010 Standard Occupational Classification (SOC) system is used by U.S. federal statistical agencies to classify workers and jobs into occupational categories.

Celebrities’ influence

Celebrities are some of the most common users of DSN, by promoting their careers, and obtaining followers [36]. Therefore, social networks have been a revolutionary scenario for these individuals because these platforms allow them to share any information with their fans [12]. This demonstrates that a minority influences an exceptional number of people, becoming an important factor in the creation of public opinion [37].

In order to know the celebrity’s influence on the network, it is necessary to specify who influences who. However, this evidence of influence on real-world networks is limited, and it is something that only a few studies have attempted empirically [38].

To determine this influence it is necessary to know that there are celebrities who use only one social network. For example, words like “YouTuber” referring to a person whose primary social network is YouTube, or a person who only uses this social network in search of having a high reputation.

The development of micro-celebrities is more evident on Instagram, Facebook, Twitter, and other social platforms [39], leading to find different categories of celebrities on different social networks, therefore the data base of this study shows the celebrity profiling by hierarchical levels.

It is well known identifying profiles is not easy, and although there are exciting approximations, computational linguistics requires an integrated approach providing elements to understand patterns of linguistic variation [31] related to ethnographic and social factors, presenting a model and its validation to detect celebrities from variables identified and explained in the development of this study.

When trying to identify a user as a celebrity on Twitter, authors such as Wang and Kraut [40] argued that the specific topic and its continued usage in the user’s tweets affect the number of followers in two modalities: hemophilia and network externalities. However, Hutto and colleagues [41] created a theory based on forecasting models that although it included the topic of tweets, unlike Wang and Kraut, they did not find a prediction with more followers based on continues usage of a topic. Therefore, it is important to raise new proposals for the prediction of celebrities, not only for the number of followers, but also because more work is required to understand the importance of the contents published to engage an audience [42].

Meanwhile, Li and colleagues [18] indicated that to detect opinion leaders in social networks, academic studies generally consider the semantic analysis of user’s comments or the emotional analysis of contents published by users based on positive or negative comments; also, by analyzing feelings to define the relevance of the connection between users and followers. However, the detection of opinion leaders with semantic analysis or analysis of emotions is not always suitable for complex social networks, so Wang [43] proposed a method of extracting community opinion leaders based on a hierarchical structure.

Deep learning applied on feature selection in social network

Neural networks have the basic idea of representing the process of pattern recognition and classification that the human brain performs [44]. Therefore, research fields have applied this basic idea to evolve the models to increase their performance in classification models. Casas [45] mentions the phenomenon of replacement in statistical and optimization models to understand Geography’s travel behaviors and traffic management.

Now, popularity is a critical issue in celebrities’ behavior since an increase in the degree of their fame is often the result of the implementation of marketing strategies in the networks. Thus, research with neural networks becomes more relevant when concluding the critical factors for popularity or the key active times of popularity for making posts on social networks. Hsu et al. [46] developed research to improve the performance of classifiers in social network popularity prediction tasks; they implemented a multimodal approach by integrating the images included in the post and their social information into a Convolutional Neural Network (CNN). Huang et al. [47] performed a deep neural network model (Long short-term memory (LSTM) and CNN) with embedding in the responsible factors to improve the predictions of long-time popularity in social networks.

In turn, the social network’s characteristics involve vital indicators for the promotion of popularity. Retweets or hashtags contain relevant information about the interests of the communities participating in a separate communication thread generating topics of interest for celebrities, which can influence and achieve higher popularity in the network. Zhang et al. [48] proposed a neural network to predict retweeting behavior by weighting a layer of different interests from a clustering process to identify the core tweets of the cluster. Li et al. [49] modeled a CNN and LSTM-RNNs by improving existing classifiers to make hashtag recommendations by tweet representations that included word embedding generation, sentence composition, tweet composition, and hashtag classification.

PAN at CLEF is an international initiative that has been promoting the research of its excellence network on the fields of Digital Text Forensics and Stylometry for ten years. As a result, the best research groups around the world in the fields of Natural Language Processing (NLP) and Information Retrieval (IR) meet annually to participate in the Author Profiling, Author Verification, Authorship Attribution, and Style Change Detection tasks. In the last version at 2019 with the tasks of Bots and Gender Profiling, Celebrity Profiling, and Style Change Detection, we participated in the task for Celebrity Profiling obtaining the second place.^{Footnote 1}

Table 1 presents proposals for profiling celebrities, including the characteristics used by authors who have worked on identifying profiles through DSN.

Table 1 Demographic and social variables for profile detection

Celebrity profiling through linguistic analysis of digital social networks

Abstract

Introduction

Background

Social networks

Author profiling

Profile classification

Demographic and social variables

Celebrities’ influence

Deep learning applied on feature selection in social network

Celebrity feature selection methodology

Modeling digital identity

Calculating central tendency and dispersion measurements

Reducing the dimensionality

Constructing a model of significance analysis

Data preparation

Description of the corpus

Application of central tendency and dispersion measurements to selected features

Dimensionality reduction

Results of celebrity feature selection

Fame

Gender

Occupation

Birth year

Validation of the celebrity classification model with selected features

Deep learning

Baselines

Conclusions

Discussion and future works

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords