Google Researcher Finds Most-Used English Words, Letters

Book scanner similar to the kind used by Google in its Google Books digital library.
Start your day with TPM.
Sign up for the Morning Memo newsletter

“Etaoin srhldcu” may read like nonsense to most English speakers upon first blush, but as it turns out, the combination is quite significant. It represents, in order, the most used letters in the English language, according to a new survey of 743 billion words conducted by Google’s head of research Peter Norvig.

The survey, which was publicized by Google Research on Monday, was an update to the seminal 1965 survey of some 20,000 words gathered from a variety of printed sources — books, magazines, newspapers — conducted by Mark Mayzner, a former Bell Labs researcher.

Mayzner’s survey involved a lengthy and painstaking process of identifying each word occurrence and transferring it over to Hollerith (IBM) punch cards and running them through a sorter.

Mayzner recently contacted Google’s Norvig via email to see if Norvig was interesting in repeating the experiment using Google’s much more voluminous English language database — the entire Google Books collection of scanned English volumes. Norvig accepted the challenge. Using the Google Books Ngram viewer (which shows word popularity over time), Norvig created a new dataset of some 97,565 unique words, collectively repeated 743.8 billion times, which he noted on his blog is 37 million more occurrences than the 20,000-word sample that Mayzner assembled. Norvig’s sample also included over 3 trillion individual letters.

On his website, Norvig published the results of his word and letter frequency tabulation.

Previously, Linotype machines assumed the most commonly used letters to be, in order, “Etaoin shrdlu,” and had their keyboard letter order arranged accordingly. Here are the most-used English letters, according to the new survey:

And here are the most frequently appearing English words, according to Norvig’s work:

Among other intriguing findings of the new survey are the fact that the there are 7.9 average letters per English word, and that 80 percent of English words are between 2 and 7 letters. The most common 2-letter combination is “th,” while the most common 7-letter combination is “present.”

Check out these and other findings at Norvig’s website. Norvig previously served as NASA’s lead computer scientist, and has worked at Google since 2001.

Latest Idealab
Comments
Masthead Masthead
Founder & Editor-in-Chief:
Executive Editor:
Managing Editor:
Associate Editor:
Editor at Large:
General Counsel:
Publisher:
Head of Product:
Director of Technology:
Associate Publisher:
Front End Developer:
Senior Designer: