Library Of Congress Twitter Archive Nears Finish, Remains Unusable

Screenshot of Twitter's fail whale error message.

January 4, 2013 7:16 a.m.

Start your day with TPM.

Updated 12:26 p.m. EST, Friday, December 4

Almost four years after the project was first announced, the Library of Congress on Friday announced that it expects by the end of January to finish a research archive of all the tweets publicly posted on Twitter since the service launched in 2006. The archive will remain unusable for the foreseeable future, however, due to technical challenges the agency said it encountered during the course of the project.

Specifically, the Library of Congress (LOC) wrote in a white paper (PDF) published online Friday that to date it has amassed an archive of 170 billion tweets and that is has almost completed its initial objectives — which include creating a chronological archive of tweets between 2006 and 2010 in addition to a separate archive of every tweet since then.

“This month, all those objectives will be completed,” the LOC’s white paper states.

But the LOC is still struggling with “technology challenges to making the archive accessible to researchers and policymakers,” specifically the fact that currently, with the archive of just all of the older tweets, it takes the system over 24 hours to execute a search of a single keyword.

As the white paper notes: “This is an inadequate situation in which to begin offering access to researchers, as it so severely limits the number of possible searches.”

So despite receiving over 400 inquiries and research requests to use the archive from around the globe since the project was announced, the LOC hasn’t yet allowed a single researcher access and has no set time for when it may begin to do so. The archive was never meant to be completely publicly accessible, however, but restricted to accredited academic researchers.

The news of the project’s status comes almost three years after the LOC first began transferring tweets into its archive, a process that started in February 2011 and was facilitated by Gnip, a social media enterprise company based out of Colorado that was handpicked by Twitter for the job.

As the Library explains it, Twitter gives Gnip access to its full firehose — the real time stream of all public tweets posted by all of Twitter’s 200 million active users, and Gnip in turn organizes those tweets into “hour-long segments” and uploads them to a secure server that the Library can pull them from.

The LOC doesn’t blame Gnip for the difficulty that it has encountered in setting up the archive, but doesn’t exactly suggest the social media enterprise firm has gone out of its way to help the project, either, writing in the white paper that recently, “Library officials met with Gnip senior management in Washington” to set up a way to retrieve tweets through Gnip’s current products.

Gnip, for its part, provided the following statement to TPM about the progress and its role:

“Gnip believes Twitter represents the largest archive of human behavior to have ever existed. We’re thrilled that we’re able to partner with the Library of Congress to help make this data available to researchers. At Gnip, we believe that the value from social data is limitless and often get inquiries from academic researchers looking to analyze social data from Twitter. We’re excited by the progress the Library of Congress has made so far.”

The white paper notes that the LOC is still pushing to make its Twitter archive available to any researcher who asks sooner rather than later, though how much sooner is anyone’s guess: “In the near term, the Library is working to develop a basic level of access that can be implemented while archival access technologies catch up.”

Twitter’s growing popularity of course has only made the Library of Congress Twitter research archive project more difficult to achieve: The LOC explains in its white paper that when it first started accepting tweets back in 2011, it was taking in 140 million per day, but now receives up to half a billion tweets.

That number is even more than Twitter itself has publicly stated are posted on the service on average (Twitter CEO Dick Costolo said in November 2012 that Twitter averaged about 350 million tweets per day).

In terms of data storage, the 2006 through 2010 Twitter archive alone accounts for 20 terabyets (2.3 terabytes compressed), and that was just the first 21 billion tweets, posted at a time when Twitter had far fewer and less active users.

So while the Library has made headway on its project of archiving all the pithy 140-character transmissions of Twitter’s users for research purposes, it still has a long way to go before it can actually make them useful to anyone.

Separately, Twitter in late December unveiled a feature allowing individual users to download personalized archives of their own tweets.

Late update: Modified to add Gnip’s statement in copy.