Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. An initial list of data sets is already available, and more will be added soon.
Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.
How It Works
Select public data sets are hosted on Amazon EC2 for free as Amazon Elastic Block Store (Amazon EBS) snapshots. Amazon EC2 customers can access this data by creating their own personal Amazon EBS volumes, using the public data set snapshots as a starting point. They can then access, modify and perform computation on these volumes directly using their Amazon EC2 instances and just pay for the compute and storage resources that they use. If available, researchers can also use pre-configured Amazon Machine Images (AMIs) with tools like Inquiry by BioTeam to perform their analysis.
To get started using the Public Data Sets on AWS, simply perform these three easy steps:
- Sign up for an Amazon EC2 account.
- Launch an Amazon EC2 instance.
- Create an Amazon EBS volume using the Snapshot ID listed in the catalog above for your chosen snapshot.
The ElasticFox Getting Started Guide provides a simple walkthrough of how to launch an instance and create an Amazon EBS volume using ElasticFox, a convenient FireFox plug-in. Or, see the Amazon EC2 Getting Started Guide
- Annotated Human Genome Data provided by ENSEMBL
An annotated form of the Human Genome, perfect for biological research, which was released as of December 10, 2008. The first snapshot, called the main Ensembl data, includes human and approximately 40 other species (see www.ensembl.org for a list) as well as comparative genomics data (550GB). The second snapshot, called the Ensembl Biomart, is a denormalized, query-optimized database that facilitates complex queries of one or more datasets (172GB).
- Main Ensembl (Linux/UNIX): snap-c78360ae
- Ensembl BioMart (Linux/UNIX): snap-c48360ad
- GenBank provided by the National Center for Biotechnology Information
An annotated collection of all publicly available DNA sequences including more than 85.7B bases and 82.8M sequence records (250GB)
- Linux/UNIX: snap-e249a38b (updated 03/20/2009)
- UniGene provided by the National Center for Biotechnology Information
A set of transcript sequences of well-characterized genes and hundreds of thousands of expressed sequence tags (EST), last updated as of December 9, 2008. (10 GB)
- Linux/UNIX: snap-5ad83b33
- Windows: snap-60d83b09
- A 3D Version of the PubChem Library provided by Rajarshi Guha at Indiana University
A 3D (single conformer) version of Pubchem, a public database of chemical structures in SD Format (70 GB)
- Linux/UNIX: snap-a8dd3dc1
- Windows: snap-40dd3d29
- UGI Virtual Conformer Library provided by Rajarshi Guha at Indiana University
80GB of data in SD format on conformers for 500,000 molecules that can be used for virtual screening (85 GB)
- Linux/UNIX: snap-59d33330
- Windows: snap-48ce2r21
- PubChem Library provided by by the National Center for Biotechnology Information
A data set of information on the biological activities of small molecules (230 GB)
- Linux/UNIX: snap-e6df3c8f
- Windows: snap-63d83b0a
- Various US Census Databases provided by The US Census Bureau
United States demographic data from the 1980 (2 GB), 1990 (50 GB), and 2000 US Censuses (200GB), summary information about Business and Industry (15 GB), and 2003-2006 Economic Household Profile Data (220 GB)
- 2000 US Census (Linux/UNIX): snap-92d333fb
- 2000 US Census (Windows): snap-36ce2e5f
- 1990 US Census (Linux/UNIX): snap-33f8185a
- 1990 US Census (Windows): snap-8cf818e5
- 1980 US Census (Linux/UNIX): snap-9df717f4
- 1980 US Census (Windows): snap-b6f818df
- 2003-2006 Economic Data (Linux/UNIX): snap-0bdf3f62
- 2003-2006 Economic Data (Windows): snap-4edd3d27
- Business and Industry Summary Data (Linux/UNIX): snap-5cf81835
- Business and Industry Summary Data (Windows): snap-8af818e3
- Various Labor Statistics Databases provided by The Bureau of Labor Statistics
Statistics on Inflation & Prices, Employment, Unemployment, Pay & Benefits, Spending & Time Use, Productivity, Workplace Injuries, International Comparisons, Employment Projections, and Regional Resources (15 GB)
- Linux/UNIX: snap-30f81859
- Windows: snap-8df818e4
- Various Transportation Databases provided by The Bureau of Transportation Services
Data and statistics from the US Department of Transportation on Aviation, Maritime, Highway, Transit, Rail, Pipeline, Bike/Pedestrian and other modes of transportation (15 GB)
- Linux/UNIX: snap-e1608d88
- Windows: snap-37668b5e
- DBpedia Knowledge Base provided by DBpedia.
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. The DBpedia knowledge base currently describes more than 2.6 million things, including at least 213,000 persons, 328,000 places, 57,000 music albums, 36,000 films, 20,000 companies. The knowledge base consists of 274 million pieces of information (RDF triples). It features labels and short abstracts for these things in 30 different languages; 609,000 links to images and 3,150,000 links to external web pages; 4,878,100 external links into other RDF datasets, 415,000 Wikipedia categories, and 75,000 YAGO categories (67GB).
Semantic extraction by DBpedia with contributions from the DBpedia Community, using data from Wikipedia.org. Snapshots prepared by the infochimps.org team using community curated metadata. Released under the GNU Free Documentation License.
- Linux/UNIX: snap-63cf3a0a (updated 03/20/2009)
- Windows: snap-09b75e60
- Freebase Data Dump provided by Freebase.com.
A data dump of all the current facts and assertions in the Freebase system. Freebase is an open database of the world’s information, covering millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations – all reconciled and freely available. This information is supplemented by the efforts of a passionate global community of users who are working together to add structured information on everything from philosophy to European railway stations to the chemical properties of common food ingredients. For more answers check the Freebase FAQ (26GB).
Data aggregated, processed and reconciled by freebase.com using data from Wikipedia.org, the freebase community, and many other open data sets. Snapshots prepared by the infochimps.org team using community curated metadata. Released under Creative Commons Attribution (CC-BY) license and the Freebase Terms of Service and Licensing Policy.
- Linux/UNIX: snap-86c93cef (updated 03/20/2009)
- Windows: snap-ab957cc2
- Wikipedia Extraction (WEX) provided by Freebase.com.
The Freebase Wikipedia Extraction (WEX) is a processed dump of the English language Wikipedia. The wiki markup for each article is transformed into machine-readable XML, and common relational features such as templates, infoboxes, categories, article sections, and redirects are extracted intabular form. Freebase WEX is provided as a set of database tables in TSV format for PostgreSQL, along with tables providing mappings between Wikipedia articles and Freebase topics, and corresponding Freebase Types (66GB.
Semantic extraction by freebase.com, using data from Wikipedia.org. Snapshots prepared by the infochimps.org team using community curated metadata. Released under the GNU Free Documentation License.
- Linux/UNIX: snap-1781757e (updated 03/20/2009)
- Windows: snap-a6957ccf