Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am looking to collect databases from real businesses and business-like entities, including those that have failed or otherwise become "past-tense". Read on if you or someone you know might have access to such things.

Background: I'm a data engineer with about 16 years in the industry under my belt. Something that's always frustrated me about the way that we design and build systems, is the way that knowledge fails to diffuse through the industry, because we don't study what we do, and especially we don't study our failures.

As an example, the 2010s witnessed the full hype cycle (rise and fall) of "NoSQL" databases, such as MongoDB, Cassandra, DynamoDB, Riak, Aerospike, and many others. Did they turn out to be any good? Individually, in local circumstances, some engineers know the answer, or at least an answer. Collectively, we have no idea. This knowledge only spreads as the primary sources write blog posts (mostly terrible), or move on to new jobs and tell stories (distorted by all sorts of biases). What we should be doing is studying what was actually built, out in the open, where everyone can see it if they're interested.

Additionally, I find it very difficult to teach other engineers about data systems, in a scalable way, without open example material. There are many online courses in SQL and things of that nature, but they always deal with trivially small, trivially clean data sets, without any of the richness or messiness of Real World Data. Many years ago, my own skill in dealing with data grew by leaps and bounds the instant I was exposed to real business data and asked to solve real business problems with it.

To these ends, I am looking to collect real business data sets. I use the term "business" loosely, in the same sense that engineers often say "business logic". Non-profits, community efforts, personal side projects, these things all count. The key thing I'm after are custom-built databases, meaning they either started from a blank MySQL/Postgres/MongoDB/etc, or heavily customized an off-the-shelf system like Wordpress or Salesforce.

I recognize there are thorny issues here with respect to intellectual property and personal data privacy. I do not expect anyone to just hand over a database and wish me well. We would have to work something out, whether that's an NDA, or thorough anonymization, or whatever.

In any event, if you possess a data set like this, and might be willing to share it for research purposes, please reply here and we can figure out how to connect and discuss.



An interesting, untapped source of data might be past legal cases where databases were made public in the process of discovery.

For example, during the electronic discovery (e-Discovery) process, the litigants may provide csv files which it might be useful to process into a more structured format. For example, the Enron email dataset: https://www.cs.cmu.edu/~enron/

If you can somehow negotiate rights to make these databases publicly available, it might be a good idea to donate/upload the data to the Internet Archive or some universities for posterity: https://archive.org/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: