The Problem
A couple of months ago in my MBA Communications course, I was required to take the Myers-Briggs Type Indicator (MBTI) personality test. It was one of those scantron-looking pamphlets that looks exactly like it did when my parents took it 40 years ago.[1. The MBTI was first published in 1962.] The questions are exactly what one would expect from a personality test, I…
After taking the test, everyone in the class scored their responses and I was surprised at how fascinated everyone was with their results. I admit that I too was quite interested in reading about the intricate details of what this piece of paper thought it knew about me. The professor told us all about the statistical validity of the MBTI and how it’s revolutionized the corporate world – except for one tiny thing: that people tend to answer the questions based on their “ideal selves” instead of who they actually are.
I started to question why in today’s technologically advanced society that we were relying on an individual’s perception of themselves, instead of observing the plethora of actions that are available about them online. With the amount of social media activity and online information available to anyone, I wondered why we couldn’t observe all of that data to make more accurate predictions of personality.
The Experiment
I have a bit of a background in analytics and linguistics and decided I would perform a short experiment. Knowing that Twitter provides an API that gives anyone near-complete access to everything published on its platform, I thought that it would be a perfect instrument of experimentation.[2. I purposely chose Twitter and asked for volunteers because of the openness of the platform. Everyone knows, or should know, that Twitter is 100% transparent and open. (The library of congress records your tweets.)]
That evening I went home and found a non-proprietary open source personality test that wouldn’t require me to pay a fee every time I had someone fill it out. Instead of the four MBTI dichotomies, this previously-proven-to-be-valid test determines one’s big five personality traits: extraversion, agreeableness, conscientiousness, neutoriticism, and openness. Then I used Zoomerang, an online survey-taking tool, to have a sample of people take the test while also volunteering their Twitter handle. Since I was learning Ruby at the time, I wrote some software using the Twitter Ruby framework to slice and dice the different portions of the Twitter API that I thought might be relevant for determining each of these five personality traits.
I imported the results of the sample’s personality tests using binary dummy variables for the indication of whether they were or were not slanted towards that particular personality trait. Since this open source personality test has already been verified as statistically valid, all I needed to do at that point was to find the combinations of coefficients and independent variables that could consistently and reliably predict the correct trait. I imported the results of the software I wrote into the same spreadsheet as the binary personality data to allow my algorithms to use a baseline of zero as a center-point of personality. (So, for example, if the calculation from the extraversion algorithm turned out to be above zero, that person would be an extravert, if it were below zero, they would be an introvert.) With all of this data in hand, I used some statistical software to run multiple-regression tests to see what independent variables, if any, explained a substantive portion of the variance in the any of the dependent variables (the individual traits).
I played around with political affiliation a little but, since I didn’t actually test this with the surveys, it’s purely conjecture. Using a spectrum of generally perceived political biases such as Fox News being more right-leaning and NPR being left-leaning, I proposed to count the mentions and links of each of this networks in the content of an individual’s tweets.
I baselined an integer at zero and decided that for each URL observed from Fox News, the integer would increment by one. For each tweet that included a URL from NPR, it would decrement by one. Finally, if there were a sufficient number of overall links (I didn’t want to be thrown off by a fluke one-off mention), it’s a crude indicator of political affiliation. I’ve considered tweaking this method to only track those media outlets on the edges (far left and right three?) or even using a moving average to weigh those on the media-bias extremes more than in the middle, but I have yet to return to the experiment.
My base conclusions are far from scientific as my sample size was incredibly small. In addition, I had no barrier for “level” of personality trait (0.001 would still be returned as an extravert) so if I were to move forward with this research, I would refine the algorithms and expand the level of sufficiency by requiring the calculation to be, for example, above one or below negative one to display a more confident result. That being said, the crude algorithms do provide some insight into personality through passive observation. (I posted the algorithms at identitweet.com so anybody can view their traits.)
Moving Forward
In addition to personality traits and political affiliation, I believe that the internet provides us a vehicle to determine a great deal of things about people through their actions online. (By “people,” I’m not simply referring to the parochial actions of determining whether someone is an extravert or not – I think we can determine a great deal about society as a whole.)
There’s a lot of potential in this field of study for behavioral investigation. I find myself wondering if it would be possible to monitor individual’s activity in order to predict psychological ailments such as depression or signs of suicide. Could we identify how fitted someone would be for a particular job? Could we serve advertisements to individuals based on their particular way of perceiving information? In terms of society as a whole, could we monitor sentiment (positive vs negative) of tweets that included the word “Obama” made by all people in California versus all people in Alabama to serve as a real-time political polling device? What about brand sentiment? Based by location? And hour of the day?
It’s fascinating to think about what one could determine, purely by knowing where and when to listen.
Follow me on Twitter @startuprob. Thank you to my lovely wife, Stephanie Johnson, for providing valuable feedback on this essay.
April 2012 Update: Due to the increased efficiency developed in the algorithms, the IdentiTweet site’s validity was falling behind so I took it down. I’m currently working on an alternative use for the technology.
Leave a Reply