Estimating the number of active and stable FLOSS projects
A recurring debate discussion among FLOSS-supporters and detractors is related to the estimation of the real number of active FLOSS projects. While it is easy to look at the main repository site (sourceforge.net) that boasts more than 100.000 projects, it is equally easy to look in more depth and realize that a significant number of those projects are really abandoned or have no significant development. How many active and stable projects are really out there?
Too many cereal choices by PartsNpieces
For the purpose of obtaining some unbiased estimates in the context of the FLOSSMETRICS project, we performed a first search among the main repository sites and FLOSS announce portals; we also set a strict activity requirement, stately an activity index from 80 to 100% and at least a file release in the last 6 months. Of the overall 155959 projects, only 10656 (6.8%) are “active” (with a somehow very restrictive definition; a more relaxed release period of 1 year shows an active percentage of 9.2% or 14455 projects).
However, while Sourceforge can rightly be considered the largest single repository, it is not the only potential source of projects; there are many other vertical repositories, among them BerliOS, Savannah, Gna! and many others, derived both from the original version of the Sourceforge code and many more based on a rewritten version called GForge. That gives a total of 23948 projects, to which (using a sampling of 100 projects from each) we have found a similar number of active projects (between 8% and 10%).
The next step is the estimation of how many projects of the overall FLOSS landscape are hosted on those sites, and for performing this estimate we took the entire FreshMeat announce database, as processed by the FLOSSmole project and found that the projects that have an homepage in one of the repository sites are 23% of the total. This count is however biased by the fact that the probability of a project to be announced on FreshMeat is not equal for all projects; that is, english-based and oriented towards a large audience have a much higer probability to be listed. To take this into account, we performed a search for non-english based forges, and for software that is oriented towards a very specific area, using data from past IST projects like Spirit and AMOS.
We have found that non-english projects are underrepresented in FreshMeat in a significant way, but as the overall “business-readiness” of those projects is unclear (as for example there may be no translations available, or be specific to a single country legal environment) we have ignored them. Vertical projects are also underrepresented, especially with regard to projects in scientific and technical areas, where the probability of being included is around 10 times lower compared to other kind of software. By using the results from Spirit, a sampling from project announcements in scientific mailing lists, and some repositories for the largest or more visible projects (like the CRAN archive, that hosts libraries and packages for the R language for statistics, that hosts 1195 projects) we have reached a lower bound estimate of around 12000 “vertical” and industry-specific projects. So, we have an overall lower bound estimate of around 195000 projects, of which we can estimate that 7% are active, leading to around 13000 active projects.
Of those, we can estimate (using data from Slashdot, FreshMeat and the largest Gforge sites) that 36% fall in the “stable” or “mature” stage, leading to a total of around 5000 projects that can be considered suitable for an SME, that is with an active community, stable and with recent releases. It should be considered that this number is a lower bound, obtained with slightly severe assumptions; just enlarging the file release period from 6 months to one year nearly doubles the number of suitable projects. Also, this estimate does not try to assess the number of projects not listed in the announcement sites (even vertical application portals); this is a deliberate action, as it would be difficult to estimate the reliability of such a measure, and because the “findability” of a project and its probability of having a sustained community participation are lower if it is difficult to find information on the project in the first place; this means that the probability of such “out of the bounds” projects would probably be not a good opportunity for SME adoption in any case. By using a slightly more relaxed definition of “stability”, with an activity rating between 60% and 100% and at least a release in the last year, we obtain around 18000 stable and mature project from which to choose- not a bad result, after all.