Making Every Word Count

If you’re like me, you’ve wasted time taking online quizzes like the one my friend challenged me to take: Name the 100 most frequently used English words in five minutes. (I got 45.)

You could waste all the time you’d like, as Top 100 word lists abound. Word-frequency rankings are part — albeit just a sliver — of the vast output from studies of language corpora, or large collections of written and sometimes spoken text. Researchers parse such data to help make sense of our ever-evolving language.

But the results of these rankings differ widely. Taking a snapshot of English in all its diverse incarnations is devilishly tricky and expensive. Computers and the Internet can make research simpler. But they also add to the challenge because they can distort language patterns.

Tension between size, cost and representativeness runs through all corpus research, raising questions about its quantitative findings. Transcripts of university lectures and television programs are favored sources for spoken language, but they can differ markedly from private chatter. And speech, in general, diverges from writing. “People don’t say ‘yes’ anymore in interviews,” Alison Duguid, a linguist at the University of Siena, Italy, offers by way of example. “They say ‘absolutely.'”

English can look very different, when viewed through different prisms. “The” is the universal ranking champion, but “be” might place second or 22nd, depending on whether all conjugations, such as “is” and “was,” get counted. “I” was the most commonly used word in 11,700 10-minute conversations recorded in 2002 and 2003. It appeared 984,359 times, according to David Graff, the lead programmer analyst for the Linguistic Data Consortium at the University of Pennsylvania, which maintains corpora. “You” was runner-up, appearing 702,941 times.

In a collection of newspaper articles from the same time period, “I” ranked 30th and “you” ranked 43rd. “Yeah,” “um,” “uh” and “uh-huh” also made the Top 100 in conversations, but not in newspapers.

The proper construction of corpora matters to a lot of people. Dictionary publishers use corpora to determine the most-common definitions for versatile words. Literature researchers need them to compare the work of a given author with the norms for language. Linguists use them to track the introduction of new words (“Facebook”) and the diminution of older ones (“britches”).

Microsoft uses corpora to help correct misspellings in its Word software. It has licensed over one trillion words of English text in each of the past two years, and bolsters its collection with emails exchanged on its Hotmail program, with identifying details removed, according to a spokeswoman. “Text corpora is the lifeblood of most of our development and testing processes,” says Mike Calcagno, general manager of the Microsoft group that manages Word.

Computers have spawned a burst of activity in the field. But even computers don’t suffice for the daunting task of word collecting and counting. Brown University’s one-million-word corpus was considered adequate in the 1960s. Today, the 100-million-word British National Corpus is considered small — and dated — because it preceded the Internet era, and other sources of new language.

It’s easy to build bigger collections using the Web, but that gives short shrift to genres that don’t often make it online, notably fiction. It also ignores spoken words, which are underrepresented in corpora because they are so much harder and more expensive to collect.

Without enough spoken-language data, subtleties may not emerge. “The word ‘rife’ only occurs in negative contexts,” says Anne O’Keeffe, a linguist at Mary Immaculate College, the University of Limerick, Ireland. “We are never rife with money,” despite that affliction’s appeal.

In assembling the British National Corpus, it cost the same to collect 10 million spoken words as to collect 50 million in written text, says Lou Burnard. He worked in the early 1990s on building the corpus, which included the recorded conversations of 200 Britons. “It would be great to do another BNC, but we don’t have the funding,” he adds.

The intended American counterpart to the BNC has similar problems. The American National Corpus, an array of text including the 9/11 Commission Report and Berlitz travel guides, contains a mere 22 million words.

This newspaper is remembered fondly by linguists for donating a large chunk of its archives in the late 1980s and early 1990s for corpus research. The Wall Street Journal’s oeuvre was an imperfect representation of English, however. For one thing, the financial sense of “stock” predominated over meanings tied to livestock and soup.

“It is really crucial that you have a corpus that is well-balanced,” says Princeton University linguist Christiane Fellbaum.

In the years since, the Web has eclipsed the Journal as the go-to repository of words. It is now the primary source for Oxford University Press’s corpus for the Oxford English Dictionary, which once relied on the BNC. John Mansfield, who works on developing Web sources for Oxford, agrees that the Web is short on fiction and conversational English. Otherwise, he says, “you’ve just got an incredible diversity of every kind of text.”

Nancy Ide, chairwoman of the computer-science department at Vassar College, who manages the American National Corpus, points out a major failing of Web-based corpora: Without copyright permission for all of this text, researchers can’t share and analyze it fully. Also, it’s difficult to isolate American English from British English and other variants online.

Oxford has devised precise, if arbitrary, targets for Web categories. Blogs get about the same share as law, science, business and medicine combined. “Blog” itself, incidentally, merits nary a mention in corpora assembled a decade ago.

Potentially skewed results for corpora have caused any number of headaches. Even guides for English teachers often don’t reflect changes in the language. “The Reading Teacher’s Book of Lists” tracks frequently used words based on a corpus from the early 1970s, says Edward Fry, a co-author of the book and a retired educational psychologist at Rutgers University. “Computer,” for one, is not on the list.