Code Vigorous

Dustin J. Mitchell


22 Apr 2008

Some time back, I had been thinking of trying to trace the “social network” of open source developers. I’m interested questions like

  • How many open-source developers contribute significantly to multiple projects? How many projects?

  • Is the set of developers reasonably well-connected by the “works on a project with” relation?

  • What proportion of projects are basically single-developer projects?

  • What is the distribution of number of developers per project,and relative levels of contribution from those developers? How uniqueis the Linux kernel in having a large set of substantial contributors?

  • How many open-source projects are there (above some level of completeness and activity)?

Long story short, I had grand plans for parsing VC histories, ChangeLogs, AUTHORS files, and so on, but never got around to actually coding anything. Then I found Ohloh. They’ve taken the liberty of indexing a lot of publicly available code, and have some web-based knobs that you can tweak to clean up some of their data. They have REST API available to access their data, and some of their code (their SLOC counter) is open source. I’m pretty impressed.

I will probably try out the API and see if I can answer some of these questions, but I worry that their database is quite dirty. They rely on individual users to log in to the site and “claim” their identity – connecting e.g., their SourceForge login to their login to their logins on project-specific sites. I would like to see them apply some additional intelligence to matching users up; for example, email addresses appear in lots of locations (ChangeLogs, AUTHORS, commit messages), and provide a fairly reliable proxy for a particular person.