Estimating the number of active and stable FLOSS projects

A recurring debate discussion among FLOSS-supporters and detractors is related to the estimation of the real number of active FLOSS projects. While it is easy to look at the main repository site (sourceforge.net) that boasts more than 100.000 projects, it is equally easy to look in more depth and realize that a significant number of those projects are really abandoned or have no significant development. How many active and stable projects are really out there?

choicesToo many cereal choices by PartsNpieces

For the purpose of obtaining some unbiased estimates in the context of the FLOSSMETRICS project, we performed a first search among the main repository sites and FLOSS announce portals; we also set a strict activity requirement, stately an activity index from 80 to 100% and at least a file release in the last 6 months. Of the overall 155959 projects, only 10656 (6.8%) are “active” (with a somehow very restrictive definition; a more relaxed release period of 1 year shows an active percentage of 9.2% or 14455 projects).

However, while Sourceforge can rightly be considered the largest single repository, it is not the only potential source of projects; there are many other vertical repositories, among them BerliOS, Savannah, Gna! and many others, derived both from the original version of the Sourceforge code and many more based on a rewritten version called GForge. That gives a total of 23948 projects, to which (using a sampling of 100 projects from each) we have found a similar number of active projects (between 8% and 10%).

The next step is the estimation of how many projects of the overall FLOSS landscape are hosted on those sites, and for performing this estimate we took the entire FreshMeat announce database, as processed by the FLOSSmole project and found that the projects that have an homepage in one of the repository sites are 23% of the total. This count is however biased by the fact that the probability of a project to be announced on FreshMeat is not equal for all projects; that is, english-based and oriented towards a large audience have a much higer probability to be listed. To take this into account, we performed a search for non-english based forges, and for software that is oriented towards a very specific area, using data from past IST projects like Spirit and AMOS.

We have found that non-english projects are underrepresented in FreshMeat in a significant way, but as the overall “business-readiness” of those projects is unclear (as for example there may be no translations available, or be specific to a single country legal environment) we have ignored them. Vertical projects are also underrepresented, especially with regard to projects in scientific and technical areas, where the probability of being included is around 10 times lower compared to other kind of software. By using the results from Spirit, a sampling from project announcements in scientific mailing lists, and some repositories for the largest or more visible projects (like the CRAN archive, that hosts libraries and packages for the R language for statistics, that hosts 1195 projects) we have reached a lower bound estimate of around 12000 “vertical” and industry-specific projects. So, we have an overall lower bound estimate of around 195000 projects, of which we can estimate that 7% are active, leading to around 13000 active projects.

Of those, we can estimate (using data from Slashdot, FreshMeat and the largest Gforge sites) that 36% fall in the “stable” or “mature” stage, leading to a total of around 5000 projects that can be considered suitable for an SME, that is with an active community, stable and with recent releases. It should be considered that this number is a lower bound, obtained with slightly severe assumptions; just enlarging the file release period from 6 months to one year nearly doubles the number of suitable projects. Also, this estimate does not try to assess the number of projects not listed in the announcement sites (even vertical application portals); this is a deliberate action, as it would be difficult to estimate the reliability of such a measure, and because the “findability” of a project and its probability of having a sustained community participation are lower if it is difficult to find information on the project in the first place; this means that the probability of such “out of the bounds” projects would probably be not a good opportunity for SME adoption in any case. By using a slightly more relaxed definition of “stability”, with an activity rating between 60% and 100% and at least a release in the last year, we obtain around 18000 stable and mature project from which to choose- not a bad result, after all.

Technorati Tags: open source metrics, sourceforge, flossmetrics, flossmole

Be Sociable, Share!

8 thoughts on “Estimating the number of active and stable FLOSS projects

  1. The activity criterion used underestimates the number of projects that provide useful software. A project may not have had a recent release because it is complete and has no known bugs, or no bugs significant enough to fix. Of course, it would be difficult to take this into account without a lot more work since it would be necessary to examine the status of each project.

  2. As mentioned in the text, this is meant to provide a lower bound to the number of available, active and stable projects; as such, we have chosen a very strict definition of activity, and we used the project choice of “stability”, even considering that this lowers the number of suitable projects even more (there are many “beta” projects that are really stable). We already have found projects that are stable but not included in the count; an example is GNU make (that is stable, but having no new release in one year would not make it to the list).
    It must be considered, however, that even projects that are more or less finished (no more bugs) may need a small recompile or modification to adapt to changing platforms and environments; in this sense, stable project with no release in one year should be considered an exception and not the rule. Using a simple sampling approach, we estimate that those are less than 2% of our original count, and so we would not rise the package count in a significant way. Our main objective was to demonstrate that the lower bound of the number of both stable and maintained packages was significant, and I believe that that result was reached.
    Many thanks for your comment (and for reading the article thoroughly :-))

  3. For some of the forge sites that allows for data extraction, such a list can be obtained through the FLOSSMOLE data source. For those sites that have no search functionality, or that provide only part of their database in a searchable way, statistical methods were using based on a sampling approach, and in this case no list (just the numbers) can be obtained. It is important to understand that what we were looking at was a lower bound on the number of active and stable projects, not a “final” list.

  4. Hi,
    I am doing currently a research on open source firm. for statistical model i need number of projects registered to sourceforge year by year, is there any way to extract these information from sourceforge?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>