Now we publish data snapshots
The OpenHatch project has an ongoing commitment to transparency and hackability. From the first day the website was live, you have been able to get the source code. Now, thanks to Karen Rustad‘s work, you can download snapshots of the OpenHatch database (except some private user data).
This goes beyond our obligations under the Affero GPL. Read on for why the data matters.
Web app developers need more than just code
When you get your own instance of the OpenHatch code running, you have a fully-functional version of the website. You’ll have an empty database, and only one user: yourself.
With an empty database:
- If you want to work on the bug importer, you have to start from scratch and downloading data from hundreds of projects’ bug trackers.
- Every page on your development environment is empty. If you want to tackle performance issues on some slow pages, like the maps of people, you won’t be able to reproduce the slow load time.
- If you want to improve the user interface for large, complex pages, you won’t be able to experiment with different representations of the data.
I personally used to work around this issue by copying the live database onto my development machine, private data and all. But as we get more contributors who don’t have access to that, it becomes increasingly important to provide tools necessary to identify and fix problems with the code.
So now we publish snapshots
As of this morning, we periodically publish a snapshot of the public data on the website. The data snapshot includes the public portions of the thousands of user profiles in the site. This is possible thanks to Karen Rustad, who responded to the request for help and wrote a substantial amount of the code that prevents private data from leaking into a snapshot.
You can learn more at the Importing a data snapshot page on our wiki. That page shows you how to get a snapshot and how to import them. If you find that the snapshots don’t work as you’d expect, do file a bug.
From a technical standpoint, there are a series of data whitelists: certain tables and certain columns are considered safe to share. A few require special scrubbing. Our automated tests verify that private data does not leak through into these snapshot files.
Openness isn’t always easy, but it’s what our users and developers deserve. What do you think — is it important to share this sort of data? How do other open source network services empower their developers?
[…] This post was mentioned on Twitter by P. F. Anderson, Karsten Wade. Karsten Wade said: ♺ @openhatchery: From our blog: Now we publish data snapshots https://openhatch.org/blog/2010/now-we-publish-data-snapshots/ […]
This data has neat researchy implications. Someone should talk about those and/or actually do interesting researchy stuff with it.