Java HashMap description – details of implementation, Part 1

One of the most important and most used collections in Java is java.util.HashMap. This is first post in series of posts describing details of implementation of java.util.HashMap.

java.util.HashMap is hash table based implementation of the java.util.Map interface. There are some important features of this implementation:

  • it permits null values to be keys
  • no guarantee about internal order of elements. Especiallty, internal order of elements can change while collection grows in size
  • O(1) complexity (constant time, independent of collection size) of get/put methods

Two methods should be carefully implemented if element intended to be used as key in java.util.HashMap. These are methods:

  • hashCode (here hash function over element is defined, it maps element into integer numbers)
  • equals (here equality of elements is defined)

There is a contract between hashCode and equals. If two elements are equal, then hasCode should be the same. But inverse formulation is following: if two elements have the same hashCode, such elements shouldn’t be necessary equal.

Internally HashMap is array of certain capacity. Each non-null element of this array is start element of linked list where data elements itself are placed. When element is put into HashMap array index is defined based of element hash code. Then, if there is already linked list node in this position, we simply iterate through this linked list trying to find equal element (equality is checked by calling equals method). If equal element is found, new element is put instead. If equal element is not found new element is put at the end of the linked list. Each list contains elements with similar hash codes (similar – because the hash codes are mixed and division remainder is taken).

Hence, importance of good hash function is clear. If hash function if bad (e.g. it distributes hash codes not uniformly) then hash map performance will be worse. The worst case: hash function always returns the same number despite object state. In such a case hash map performance will be the same as linked list, i.e. O(n) (proportional to collection size).

There are two parameters that affect HashMap performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data
structures are rebuilt) so that the hash table has approximately twice the number of buckets. Default initial capacity is 16, and default load factor is 0.75 (it offers a good tradeoff between time and space costs).

If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table. So, when creating HashMap for storage of a known quantity of elements, it is a good practice to define the initial size of HashMap as 1.1*N/0.75, where N is elements number (here default load factor value is used).

java.util.HashMap implementation is not synchronized. If multiple threads access a hash map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more mappings; merely changing the value associated with a key that an instance already contains is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the map.

The iterators returned by all of this class’s “collection view methods” are fail-fast: if the map is structurally modified at any time after the iterator is created, in any way except through the iterator’s own remove method, the iterator will throw a ConcurrentModificationException. Thus, in the face of concurrent modification, the iterator fails quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future.

Amazon Public Datasets, Part 4

Continue exploration of publicly available datasets from Amazon (Public Data Sets : Amazon Web Services). Lower there are datasets from Amazon with brief description (alphabetically sorted) from 31 to 40.

  1. Japan Census Data
    Size: 5 GiB
    Created On: February 15, 2012 2:23 AM GMT
    Last Updated: March 4, 2012 3:22 AM GMT
    Source: Statistics Bureau, Ministry of Internal Affairs and Communications, Japan
    Description: Population Census of Japan (1995, 2000, 2005, 2010). This data set contains Japanese population census data from 1995, 2000, 2005, and 2010. This includes information on:
    - Sex, Age and Marital Status of Population
    - Structure and Housing Conditions of Households
    - Labour Force Status of Population
    - Industry (Major) of Employed Persons
  2. Labor Statistics Databases
    Size: 15GB
    Created On: April 1, 2009 11:22 PM GMT
    Last Updated: June 4, 2009 8:25 PM GMT
    Source: The Bureau of Labor Statistics
    Description: Statistics on Inflation & Prices, Employment, Unemployment, Pay & Benefits, Spending & Time Use, Productivity, Workplace Injuries, International Comparisons, Employment Projections, and Regional Resources
  3. M-Lab dataset: Network Diagnostic Tool (NDT)
    Size: 550GB
    Created On: December 9, 2009 1:34 AM GMT
    Last Updated: December 10, 2009 2:00 AM GMT
    Source: Rich Carlson, with Measurement Lab
    Description: NDT is a network performance testing system that allows end-users to attempt to identify computer configuration and network infrastructure problems that degrade their broadband experience. By running a short test between a user’s computer and an NDT server, the tool can provide information on a user’s connection speed and attempt to diagnose what, if any, problems exist. The server collects test results and records the user’s IP address, upload/download speed, packet headers and various TCP variables from the test. NDT is an open source project that is under active development by Internet2.
    This dataset includes the test results from users from around the world who have run NDT through Measurement Lab (M-Lab). This data set includes results from various tests run between February 2009 and September 2009. M-Lab is an open server platform on which researchers deploy network measurement tests that generate information on broadband performance for end users. You can run NDT on M-Lab here.
    The data will be of interest to researchers who are interested in studying the actual performance of Internet users’ broadband connections.
  4. M-Lab dataset: Network Path and Application Diagnosis tool (NPAD)
    Size: 10GB
    Created On: December 9, 2009 1:34 AM GMT
    Last Updated: December 10, 2009 2:00 AM GMT
    Source: Matt Mathis, with Measurement Lab
    Description: NPAD is a network performance testing system that helps end-users to diagnose some of the common problems effecting the last network mile and end-users’ systems. As NPAD transfers bulk data between a user’s computer and an NPAD server, it gathers detailed statistics about what mechanisms actually regulate performance. In doing so, the server collects test results and records the IP addresses, upload/download speed, packet headers and TCP variables of the test. NPAD is a joint project of the Pittsburgh Supercomputing Center and the National Center for Atmospheric Research, funded under NSF grant ANI-0334061.
    This dataset includes the test results from users from around the world who have run NPAD through Measurement Lab (M-Lab). This data set includes results from various tests run between February 2009 and September 2009. M-Lab is an open server platform on which researchers deploy network measurement tests that generate information on broadband performance for end users. You can run NPAD on M-Lab here.
    The data will be of interest to researchers who are interested in studying the actual performance of Internet users’ broadband connections.
  5. Marvel Universe Social Graph
    Size: 1 GB
    License: Public Domain
    Created On: February 3, 2011 12:00 AM GMT
    Last Updated: February 3, 2011 12:00 AM GMT
    Source: http://bioinfo.uib.es/~joemiro/marvel.html
    Description: A fun Marvel Comics character collaboration graph constructed by Cesc Rosselló, Ricardo Alberich, and Joe Miro from the University of the Balearic Islands. The Marvel Universe, that is, the artificial world that takes place in the universe of the Marvel comic books, is an example of a social collaboration network. They compare the characteristics of this universe to real-world collaboration networks, such as the Hollywood network, or the one created by scientists who work together in producing research papers. See arxiv.org.
    The Marvel Universe is an artificial social network that pretends to imitate a real social graph, but is closer to a real social graph than one might expect. This data, and a following analysis of how the graph has grown, can be used to contrast and refine the models for the social graphs that have been used to date, the ones that later on will imprint subjects as dissimilar as epidemiology or security.
  6. Material Safety Data Sheets
    Size: 3 GB
    License: Public Domain
    Created On: April 1, 2011 12:00 AM GMT
    Last Updated: April 1, 2011 12:00 AM GMT
    Source: hazard.com
    Description: Over 230,000 material safety data sheets for various products in plain text format. Information includes chemical components, first aid measures, storage and handling, and more.
  7. Million Song Dataset
    Size: 500 GB
    Created On: February 8, 2011 12:00 AM GMT
    Last Updated: February 8, 2011 12:00 AM GMT
    Source: http://labrosa.ee.columbia.edu/millionsong/
    Description: The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are:
    - To encourage research on algorithms that scale to commercial sizes
    - To provide a reference dataset for evaluating research
    - As a shortcut alternative to creating a large dataset with The Echo Nest’s API
    - To help new researchers get started in the MIR field
  8. Million Song Sample Dataset
    Size: 5 GB
    Created On: February 8, 2011 12:00 AM GMT
    Last Updated: February 8, 2011 12:00 AM GMT
    Source: http://labrosa.ee.columbia.edu/millionsong/
    Description: This is a 10,000 song sample from the Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks.
  9. Model Organism Encyclopedia of DNA Elements (modENCODE)
    Size: 5 TB
    Created On: April 19, 2012 9:25 PM GMT
    Last Updated: April 24, 2012 9:18 PM GMT
    Description: The modENCODE data set (~5 TB total) is a comprehensive encyclopedia of genomic functional elements in the model organisms C. elegans (a simple worm) and D. melanogaster (the fruitfly). modENCODE is a consortium is formed by 11 primary projects, divided between worm and fly, spanning the domains of gene structure, mRNA and ncRNA expression profiling, transcription factor binding sites, histone modifications and replacement, chromatin structure, DNA replication initiation and timing, and copy number variation. The raw and interpreted data from this project is vetted by a data coordinating center (DCC) to ensure consistency and completeness.
  10. NASA NEX
    Created On: November 12, 2013 1:27 PM GMT
    Last Updated: November 12, 2013 1:27 PM GMT
    Source: NASA NEX
    Description: NASA NEX is a collaboration and analytical platform that combines state-of-the-art supercomputing, Earth system modeling, workflow management and NASA remote-sensing data. Through NEX, users can explore and analyze large Earth science data sets, run and share modeling algorithms, collaborate on new or existing projects and exchange workflows and results within and among other science communities.
    Three NASA NEX data sets are now available to all via Amazon S3. One data set, the NEX downscaled climate simulations, provides high-resolution climate change projections for the 48 contiguous U.S. states. The second data set, provided by the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument on NASA’s Terra and Aqua satellites, offers a global view of Earth’s surface every 1 to 2 days. Finally, the Landsat data record from the U.S. Geological Survey provides the longest existing continuous space-based record of Earth’s land.

Amazon Public Datasets, Part 3

Continue exploration of publicly available datasets from Amazon (Public Data Sets : Amazon Web Services). Lower there are datasets from Amazon with brief description (alphabetically sorted) from 21 to 30.

  1. Federal Reserve Economic Data – Fred
    Size: 1 GB
    Created On: May 5, 2009 12:40 AM GMT
    Last Updated: June 5, 2009 10:50 PM GMT
    Source: http://research.stlouisfed.org/fred2/
    Description: The Federal Reserve Economic Data (FRED) provides over twenty thousand time series of diverse US economic data, such as banking, interest rates, consumer price index and GDP. The data is available in csv, txt and xls in the same data set.
  2. Freebase Data Dump
    Size: 26GB
    Created On: April 7, 2009 6:38 PM GMT
    Last Updated: June 4, 2009 8:22 PM GMT
    Source: Freebase
    Description: A data dump of all the current facts and assertions in the Freebase system. Freebase is an open database of the world’s information, covering millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations — all reconciled and freely available. This information is supplemented by the efforts of a passionate global community of users who are working together to add structured information on everything from philosophy to European railway stations to the chemical properties of common food ingredients.
  3. Freebase Quad Dump
    Size: 35 GB
    License: CC-BY
    Created On: June 24, 2011 6:04 PM GMT
    Last Updated: June 24, 2011 6:04 PM GMT
    Source: Freebase.com
    Description: A data dump of all the current facts and assertions in Freebase. Freebase is an open database of the world’s information, covering millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations — all reconciled and freely available. This information is supplemented by the efforts of a passionate global community of users who are working together to add structured information on everything from philosophy to European railway stations to the chemical properties of common food ingredients.
  4. Freebase Simple Topic Dump
    Size: 5 GB
    License: CC-BY
    Created On: June 24, 2011 6:08 PM GMT
    Last Updated: June 24, 2011 6:08 PM GMT
    Source: Freebase.com
    Description: A data dump of the basic identifying facts about every topic in Freebase. Freebase is an open database of the world’s information, covering millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations — all reconciled and freely available. This information is supplemented by the efforts of a passionate global community of users who are working together to add structured information on everything from philosophy to European railway stations to the chemical properties of common food ingredients.
  5. GenBank
    Size: 200GB
    Created On: March 25, 2009 8:15 PM GMT
    Last Updated: December 9, 2009 2:49 AM GMT
    Source: National Center for Biotechnology Information (NCBI)
    Description: GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2008 Jan;36(Database issue):D25-30). There are approximately 85,759,586,764 bases in 82,853,685 sequence records in the traditional GenBank divisions and 108,635,736,141 bases in 27,439,206 sequence records in the WGS division as of February 2008
  6. Google Books Ngrams
    Size: 2.2 TB
    Created On: January 5, 2011 6:11 PM GMT
    Last Updated: January 21, 2012 2:12 AM GMT
    Source: Google Books
    Description: A data set containing Google Books n-gram corpuses. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License. The original dataset is available from http://books.google.com/ngrams/.
  7. Human Liver Cohort
    Size: 625 MB
    Created On: September 8, 2010 8:55 PM GMT
    Last Updated: September 8, 2010 8:56 PM GMT
    Source: Sage Bionetworks
    Description: Human Liver Cohort characterizing gene expression in liver samples with genotypes available for downloading from dbGAP (subject to human subjects protection approval). Clinical phenotypes are available in the form of Cytochrome P450 enzyme measurements.
  8. Human Microbiome Project
    Size: 5 TB
    Created On: September 25, 2013 9:29 PM GMT
    Last Updated: September 26, 2013 5:58 PM GMT
    Source: Human Microbiome Project
    Description: The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performed for thousands of these samples. In addition, whole genome sequences were generated for isolate strains collected from human body sites to act as reference organisms for analysis. Finally, 16S marker and whole metagenome sequencing was also done on additional samples from people suffering from several disease conditions. More information about the HMP is available at the NIH common fund.
  9. Illumina – Jay Flatley Human Genome Data Set
    Size: 350 GB
    Created On: January 13, 2010 7:21 AM GMT
    Last Updated: January 20, 2010 9:54 PM GMT
    Source: Illumina
    Description: This data set contains the raw export files of the first genome sequenced by Illumina Individual Genome Service using Illumina’s Genome Analyzer technology of paired 75-base reads. 92,254,659,274 bases were used to generate a consensus sequence with coverage of 32x average depth. The genome was obtained via peripheral blood of Jay Flatley, CEO of Illumina.
  10. Influenza Virus (including updated Swine Flu sequences)
    Size: 1GB
    Created On: April 29, 2009 12:20 AM GMT
    Last Updated: June 4, 2009 8:18 PM GMT
    Source: NCBI
    Description: This data set includes database and sequence data from the NIAID Influenza Genome Sequencing Project and Genbank. For more information on this data set refer to the NCBI Influenza Virus Resource

Amazon Public Datasets, Part 2

Continue exploration of publicly available datasets from Amazon (Public Data Sets : Amazon Web Services). Lower there are datasets from Amazon with brief description (alphabetically sorted) from 11 to 20.

  1. C57BL/6J by C3H/HeJ Mouse Cross (Sage Bionetworks)
    Size: 970 MB
    Created On: September 8, 2010 8:53 PM GMT
    Last Updated: September 8, 2010 8:53 PM GMT
    Source: Sage Bionetworks
    Description: C57BL/6J by C3H/HeJ mouse cross from the Jake Lusis lab at UCLA with clinical measurements on muscle, fat, brain, and liver tissue. Genotypes, gene expression, and clinical phenotypes are available, as well as network data.
  2. Common Crawl Corpus
    Size: 541 TB
    Created On: February 15, 2012 2:23 AM GMT
    Last Updated: March 17, 2014 5:51 PM GMT
    Source: Common Crawl Foundation - http://commoncrawl.org
    Description: Common Crawl is a non-profit organization dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.
    The most current crawl data sets includes three different types of files: Raw Content, Text Only, and Metadata. The data sets from before 2012 contain only Raw Content files.
    For more details about the file formats and directory structure please see: New Crawl Data Available! | CommonCrawl.
    Common Crawl provides the glue code required to launch Hadoop jobs on Amazon Elastic MapReduce that can run against the crawl corpus residing in the Amazon Public Data Sets. By utilizing Amazon Elastic MapReduce to access the S3 resident data, end users can bypass costly network transfer costs.
    Common Crawl’s Hadoop classes and other code can be found in its GitHub repository.
  3. Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD)
    Size: 20GB
    Created On: August 22, 2009 6:15 PM GMT
    Last Updated: September 29, 2009 12:48 AM GMT
    Source: National Climate Data Center (NCDC)
    Description: Data originally collected as part of the Global Surface Summary of Day (GSOD) by the National Climactic Data Center (NCDC). Data collected, transformed, and uploaded by Infochimps.org.
    Global summary of day data for 18 surface meteorological elements are derived from the synoptic/hourly observations contained in USAF DATSAV3 Surface data and Federal Climate Complex Integrated Surface Data (ISD). Historical data are generally available for 1929 to the present, with data from 1973 to the present being the most complete. For some periods, one or more countries’ data may not be available due to data restrictions or communications problems. In deriving the summary of day data, a minimum of 4 observations for the day must be present (allows for stations which report 4 synoptic observations/day). Since the data are converted to constant units (e.g, knots), slight rounding error from the originally reported values may occur (e.g, 9.9 instead of 10.0).
    The mean daily values described below are based on the hours of operation for the station. For some stations/countries, the visibility will sometimes ‘cluster’ around a value (such as 10 miles) due to the practice of not reporting visibilities greater than certain distances. The daily extremes and totals—maximum wind gust, precipitation amount, and snow depth—will only appear if the station reports the data sufficiently to provide a valid value. Therefore, these three elements will appear less frequently than other values. Also, these elements are derived from the stations’ reports during the day, and may comprise a 24-hour period which includes a portion of the previous day. The data are reported and summarized based on Greenwich Mean Time (GMT, 0000Z — 2359Z) since the original synoptic/hourly data are reported and based on GMT.
    As for quality control (QC), the input data undergo extensive automated QC to correctly ‘decode’ as much of the synoptic data as possible, and to eliminate many of the random errors found in the original data. Then, these data are QC’ed further as the summary of day data are derived. However, we expect that a very small % of the errors will remain in the summary of day data.
    The data are strictly ASCII, with a mixture of character data, real values, and integer values.
    Please see the README.txt, country-list.txt, and ish-history.txt files for more information on how to interpret weather measurements.
    **This data set can only be used within the United States. If you redistribute any of these data to others, you must include this same notification.**
  4. DBpedia 3.5.1
    Size: 17GB
    Created On: April 7, 2009 6:33 PM GMT
    Last Updated: August 10, 2010 3:23 PM GMT
    Source: http://dbpedia.org/
    Description: The DBpedia knowledge base currently describes more than 3.4 million things, out of which 1.5 million are classified in a consistent Ontology, including 312,000 persons, 413,000 places, 94,000 music albums, 49,000 films, 15,000 video games, 140,000 organizations, 146,000 species and 4,600 diseases. The DBpedia data set features labels and abstracts for these 3.4 million things in up to 92 different languages; 841,000 links to images and 5,081,000 links to external web pages; 9,393,000 external links into other RDF datasets, 565,000 Wikipedia categories, and 75,000 YAGO categories. The DBpedia knowledge base altogether consists of over 1 billion pieces of information (RDF triples) out of which 257 million were extracted from the English edition of Wikipedia and 766 million were extracted from other language editions.
  5. Denisova Genome
    Size: 159 GiB
    License: Following the Ft. Lauderdale principles on Community Resource Projects (see details below)
    Created On: February 6, 2012 8:00 AM GMT
    Last Updated: February 15, 2012 2:22 AM GMT
    Source: The Max Planck Institute for Evolutionary Anthropology
    Description: The high-coverage genome sequence of a Denisovan individual sequenced to ~30x coverage on the Illumina platform. Together with their sister group the Neandertals, Denisovans are the most closely related extinct relatives of currently living humans.
  6. Enron Email Data
    Size: 210 GB
    License: CC BY 3.0
    Created On: January 1, 1970 12:00 AM GMT
    Last Updated: February 15, 2012 2:26 AM GMT
    Source: Federal Energy Regulatory Commission (FERC)
    Description: Enron email data publicly released as part of FERC’s Western Energy Markets investigation converted to industry standard formats by EDRM. The data set consists of 1,227,255 emails with 493,384 attachments covering 151 custodians. The email is provided in Microsoft PST, IETF MIME, and EDRM XML formats.
  7. Ensembl – FASTA Database Files
    Size: 100GB
    Created On: June 8, 2009 11:33 PM GMT
    Last Updated: October 1, 2009 10:34 PM GMT
    Source: EMBL – EBI and the Wellcome Trust Sanger Institute
    Description: FASTA database files are sequence databases of transcript and translation models predicted by the Ensembl analysis and annotation pipeline, as well as by ab initio methods.
  8. Ensembl Annotated Human Genome Data (FASTA Release 73)
    Size: 210 GB
    Created On: May 24, 2010 5:59 PM GMT
    Last Updated: October 8, 2013 2:37 PM GMT
    Source: EMBL – EBI and the Wellcome Trust Sanger Institute
    Description: This data set provides scientists with the opportunity to research and understand this important area of biology. These snapshots includes all the databases that are available at http://www.ensembl.org.
  9. Ensembl Annotated Human Genome Data (MySQL Release 73)
    Size: 210 GB
    Created On: April 6, 2009 8:31 PM GMT
    Last Updated: October 8, 2013 2:38 PM GMT
    Source: EMBL – EBI and the Wellcome Trust Sanger Institute
    Description: This data set provides scientists with the opportunity to research and understand this important area of biology. These snapshots includes all the databases that are available at http://www.ensembl.org.
  10. Federal Contracts from the Federal Procurement Data Center (USASpending.gov)
    Size: 180GB
    Created On: April 21, 2009 8:43 PM GMT
    Last Updated: June 4, 2009 8:23 PM GMT
    Source: USASpending.gov
    Description: This data set is a dump from the Federal Procurement Data Center (FPDC), which manages the Federal Procurement Data System (FPDS-NG). FPDS-NG collects and disseminates procurement data — or information about contracts that the federal government gives to private companies. The FPDS-NG summarizes who bought what, from whom, and where. See https://www.fpds.gov.

Amazon Public Datasets, Part 1

Let’s explore publicly available datasets from Amazon. Here is link to the page: Public Data Sets : Amazon Web Services. Lower I’ll list datasets from Amazon with brief description (alphabetically sorted) from 1 to 10.

  1. 1000 Genomes Project
    Size: 200 TB
    Created on: October 17, 2010 9:59 PM GMT
    Updated on: July 18, 2012 4:34 PM GMT
    Source: National Center for Biotechnology Information (NCBI)
    Description: The 1000 Genomes Project aims to build the most detailed map of human genetic variation, ultimately with data from the genomes of over 2,600 people from 26 populations around the world. The data contained within this release include results from sequencing the DNA of approximately first 1,700 of over 2,600 people; the remaining samples are expected to be sequenced in 2012 and the data will be released to researchers as soon as possible. The data presented here, over 200Tb, is intended for use in analysis on Amazon EC2 or Elastic MapReduce, rather than for download.
  2. 1980 US Census
    Size: 5 GB
    Created On: April 2, 2009 5:42 PM GMT
    Last Updated: June 4, 2009 8:24 PM GMT
    Source: The US Census Bureau
    Description: Data from the 1980 US Census from the US Census Bureau
  3. 1990 US Census
    Size: 50GB
    Created On: April 2, 2009 12:42 AM GMT
    Last Updated: June 4, 2009 8:24 PM GMT
    Source: The US Census Bureau
    Description: Data from the 1990 US Census from the US Census Bureau
  4. 2000 US Census
    Size: 200GB
    Created On: April 1, 2009 11:33 PM GMT
    Last Updated: June 4, 2009 8:25 PM GMT
    Source: The US Census Bureau
    Description: Data from the 2000 US Census from the US Census Bureau
  5. 2003-2006 US Economic Data
    Size: 220GB
    Created On: April 8, 2009 12:31 AM GMT
    Last Updated: June 4, 2009 8:25 PM GMT
    Source: The US Census Bureau
    Description: US Economic Data for 2003-2006 from the The US Census Bureau
  6. 2008 TIGER/Line Shapefiles
    Size: 125 GB
    Created On: April 17, 2009 4:50 PM GMT
    Last Updated: June 4, 2009 8:26 PM GMT
    Source: The US Census Bureau
    Description: This data set is a complete set of Census 2000 and Current shapefiles for American states, counties, subdivisions, districts, places, and areas. The data is available as shapefiles suitable for use in GIS, along with their associated metadata. The official source of this data is the US Census Bureau, Geography Division.
  7. 3D Version of the PubChem Library
    Size: 70GB
    Created On: April 1, 2009 7:12 PM GMT
    Last Updated: June 4, 2009 8:21 PM GMT
    Source: Rajarshi Guha at Indiana University / NCBI
    Description: This data set is a 3D Version of the PubChem Library. PubChem provides information on the biological activities of small molecules. It is a component of NIH’s Molecular Libraries Roadmap Initiative.
  8. AnthroKids – Anthropometric Data of Children
    Size: 1GB
    Created On: April 9, 2009 3:07 AM GMT
    Last Updated: June 4, 2009 8:19 PM GMT
    Source: Sandy Ressler – NIST
    Description: This data set includes the results of two studies which collected anthropometric data of children. The studies, conducted in 1975 and 1977 are available in a number of different formats. These studies were the result of a Consumer Product Safety Commission (CPSC) effort in the mid-seventies. The creation of a publically accessible database is the result of a joint effort between the Information Technology Laboratory (ITL) at the National Institute of Standards and Technology (NIST) and the CPSC. Partial sponsorship came from the Systems Integration for Manufacturing Applications (SIMA) project at NIST.
  9. Apache Software Foundation Public Mail Archives
    Size: 200 GB
    Created On: August 15, 2011 10:00 PM GMT
    Last Updated: August 15, 2011 10:00 PM GMT
    Source: The Apache Software Foundation (http://www.apache.org)
    Description: A collection of all publicly available mail archives from the Apache Software Foundation (ASF), taken on July 11, 2011. This collection contains all publicly available email archives from the ASF’s 80+ projects (http://mail-archives.apache.org/mod_mbox/), including mailing lists such as Apache HTTPD Server, Apache Tomcat, Apache Lucene and Solr, Apache Hadoop and many more. Generally speaking, most projects have at least three lists: user, dev and commits, but some have more, some have less. The user lists are where users of the software ask questions on usage, while the dev list usually contains discussions on the development of the project (code, releases, etc.) The commit lists usually consists of automated notifications sent by the various ASF version control tools, like Subversion or CVS, and contain information about changes made to the project’s source code.
    Both tarballs and per project sets are available in the snapshot. The tarballs are organized according to project name. Thus, a-d.tar.gz contains all ASF projects that begin with the letters a, b, c or d, such as abdera.apache.org. Files within the project are usually gzipped mbox files.
  10. US Business and Industry Summary Data
    Size: 15GB
    Created On: April 8, 2009 12:35 AM GMT
    Last Updated: June 4, 2009 8:24 PM GMT
    Source: The US Census Bureau
    Description: Business and Industry Summary Data from the The US Census Bureau

Nanu Nanu!

This is yet another software developer blog. I’ll post here mostly boring and annoynig stories about making some shit for computers. Probably one time I can get hired on a larger salary (to be realistic – never can I get this =)) ). After that I’ll have no reason to post to this blog. Hah-hah-hah!