(2 of 3)
Then there are geography, income and race. For instance, the term suttin (a variant of something) has been associated with Boston-area tweets, while the acronym ikr (an expression meaning "I know, right?") is popular in the Detroit area. Tweets containing the word awesome, one study showed, are more likely to emanate from wealthy neighborhoods. Emoticons often appear in tweets sent from areas with a large Hispanic population.
These facts may seem frivolous. But research like this provides insights into how people purposefully and unwittingly use words to signal who they are. "Language is really a window into people's sense of personal identity," Eisenstein says. Tweet trends also make it possible to guess the demographics of senders when no information is explicitly provided--a huge asset for anyone trying to use Twitter to sell products, get out a message or collect statistics. For example, researchers at the Mitre Corp., a nonprofit science and technology group, came up with an algorithm that could correctly determine someone's sex 75% of the time on the basis of just their tweets. It outperformed humans, at a much faster speed. "We can't personally read all the tweets," says Carnegie Mellon professor Noah Smith, who has used Twitter to study economic confidence and presidential approval rates. "But you can write a computer program."
Smith--a specialist in natural language processing--and Eisenstein are researching the diffusion of new words, a pursuit that was much more painstaking in the pre-Internet days. Using Twitter, their team is constructing what Smith describes as "subway maps around the United States showing where words tend to move." What they've found is that race may matter as much as geography. A term coined in Jackson, Miss., for instance, might turn up in Memphis--both places that have a high percentage of African Americans--but not spread to Nashville, where the majority is white. Other researchers are following trails of tweets to investigate how rumors and urban legends change as they're passed from person to person.
Tweeters are generally oblivious to the possibility that their messages might be scrutinized, which is a boon to researchers who want to analyze natural speech rather than the kind of edited text you find in the pages of a magazine. "They don't feel like they're being observed by people in white lab coats," Smith says. "They really are just doing their thing." But that presents an ethical quandary too. Even if tweets are meant to be public--and are presented anonymously in academic papers--tweeters typically aren't consenting to be part of a study. As Zimmer says, "It's a gray area."
The data also has drawbacks. Though Twitter users number more than 200 million, they are not a random sample; they skew young and urban. People can lie about themselves. And no matter how much information academics can guess at, there are details--like income and education level--that they'd be able to get in a traditional study but can't get from a microblog. "When you're using these brute-force techniques of data collection, the picture of the individual gets lost," says Zimmer.