Since the inception of social media, a prodigious amount of status updates, tweets, and comments have been posted online. The language people use to express themselves can provide clues about the kind of people they are, online and off. Current efforts to understand personality from writing samples rely on theories and survey data from the 1980s. New research from the New England Complex Systems Institute (NECSI) uses social media data to successfully identify differences and similarities between users without prior assumptions. This is a step toward building a better understanding of the psychology of human personality.

Some personality psychologists do study publicly available data in addition to solicited surveys. However, they still start with predefined traits like extroversion, neuroticism or narcissism and correlate them with the writing. In other research, linguists have used algorithms to identify topics of conversation, but they do not have much to say about the personalities of the conversationalists. NECSI’s approach uses an unguided process that identifies word usage patterns among individual users without prior assumptions.

NECSI researchers analyzed tens of thousands of posts on the social media website Reddit to understand the characteristics of its users. The words written by each user were tallied and sorted, creating a distribution of individuals arranged according to the similarities or differences in their vocabularies. Pronouns and other very common words were not used to identify the array of users. Yet surprisingly, users still had strong variations in pronoun use. The way people refer to themselves and others has previously been correlated with personality. The new analysis also associates these personality markers with topics of conversation.

Topics emerged from the data as clusters of words frequently used by similar subsets of users. These word clusters include interests or hobbies like “hockey,” “global politics,” and “video games.” Other topics have no obvious theme. Because these topics emerge from the data through an unguided process, they are not biased by prior assumptions.

Importantly, this analysis preserves the complex relationships between topics and pronoun use. For example, a cluster of users was identified with the topic “hockey.” They frequently used words like NHL, puck, ice, Bruins, and Canucks. These also typed a lot of third person male pronouns (he, his, him), reflecting the male-dominance of the sport. They also frequently used first person plural pronouns (we, us, our), suggesting a focus on teamwork.

“Understanding the ways people can be different from each other is one of the most exciting topics in science,” NECSI president and an author of the paper, Prof. Yaneer Bar-Yam said. “This paper shows how we can make progress.”

These preliminary findings establish the potential for identifying differences between individuals from abundant social media data. The words people use online can tell us about their patterns of behavior. This analysis can also lead to a better understanding of , informing existing psychological models.

