Skyscraper I love you

WikiVitals: Unofficial Wikipedia Statistics

In the March 8th edition of The Economist this article appeared about Wikipedia. This section piqued my interest

Mr Lih and other inclusionists worry that [the current Wikipedia administration and bureaucracy] deters people from contributing to Wikipedia, and that the welcoming environment of Wikipedia’s early days is giving way to hostility and infighting. There is already some evidence that the growth rate of Wikipedia’s article-base is slowing. Unofficial data from October 2007 suggests that users’ activity on the site is falling, when measured by the number of times an article is edited and the number of edits per month. The official figures have not been gathered and made public for almost a year, perhaps because they reveal some unpleasant truths about Wikipedia’s health.

I thought ‘perhaps because they reveal some unpleasant truths about Wikipedia’s health’ was a possibly a bit strong, especially against the lack of any properly referenced statistics. Whilst the official statistics might not be regularly updated at the moment a full export of the Wikipedia is still available at regular intervals. As such it’s possible, given a bit of work parsing the data, to generate some unofficial statistics. So that’s what I set out to do.

The results are at, or read on for more information about how they were created.

The official stats for the various Wikipedia statistics can be found here, as you can see most editions are updated regularly, it’s mainly the English edition that is falling behind. It’s perhaps not surprising if you consider the size that the English Wikipedia has grown to. If the compression ratios given on the downloads site are accurate then a fully uncompressed version of the XML export of the English edition is going to consume about 1.5TB of space.

Fortunately for us a 7-zip version (along size a bzip2 version) is made available for download which makes working with the data much more feasible. Dumps of the database are provided at regular intervals however it takes almost a month for a full export of the English Wikipedia to be generated. At the time of writing the most recent complete export is the March export which you can find here.

There are two dumps that can be used to produce the kind of statistical reports I was looking to do. ‘pages-meta-history.xml’ and ‘stub-meta-history.xml’. I initially started working with the pages-meta-history which contains a full dump of every page, as you can imagine this makes working with the data rather slow and to perform an initial parse of the data was taking about 8 hours.

The stub-meta-history alternative provides all the revision history of the articles but not the actual article contents, increasing processing speed by about 10 times. However this has the obvious side effect that no decisions can be made about the article contents. Officially Wikipedia only counts articles that contain at least one internal link toward the article count, as we’re only looking at stubs this distinction isn’t made. Notably this means that redirects are not distinguished from full articles which increases the total page count dramatically, although, one would hope, proportionately.

To ease the development of the parsing scripts I used the data exports for the Simple English Wikipedia until I was happy that everything was working as expected. These have the advantage of being much smaller but still being in English. I’d highly recommend starting with these if you’re interesting in taking a look at how the data export processes work, they’re much easier to work with.

The first step of getting the data into a workable format was to parse the data into individual month ‘buckets’. The idea being that all revisions taking place in a specific month are added to that months ‘bucket’. This means that when it comes to generating statistics the generator only has to concern itself with the current ‘bucket’ (and potentially any retained information from previous months), rather than having to parse the whole article space each time.

The spec for the XML export can be found here. Initially I wrote a parser using XML::Parser, whilst this has the advantage of robustness due to being a fully fledged XML parser it has the disadvantage of being rather slow, to perform a full parse of the English Wikipedia with XML::Parser would probably take a good few weeks. After ditching that plan I replaced it with a simple line parser. Fortunately the XML data provided is laid out consistently making a line parser straightforward.

When parsing the articles into the monthly ‘buckets’ a further division is made into those articles in the main namespace and those that aren’t. Whilst interesting stats could no doubt be generated from the pages located outside the main namespace they are not processed any further at the moment. The namespaces excluded include:

  • Talk
  • User & User talk
  • Wikipedia & Wikipedia talk (used for discussion of Wikipedia policy)
  • Category & Category talk
  • Template & Template talk
  • Image & Image talk
  • List & List talk
  • Help & Help talk
  • MediaWiki & MediaWiki talk

Once the data had been split into the monthly ‘buckets’ it was straightforward to run some simple reports against it. Currently reports are generated for:

  • The total number of articles at the end of the month
  • The number of new articles each month
  • The total number of edits each month
  • The average number of edits made per article each month
  • The number of articles receiving at least one edit per month
  • The total number of Wikipedians making 10 or more edits since they joined
  • The number of new Wikipedians making their 10th edit each month
  • The number of Wikipedians making at least one edit in the month
  • The number of Wikipedians making at least five edits in the month
  • The number of Wikipedians making 100 or more edits in the month

When talking about ‘Wikipedians’ we’re only concerning ourselves with logged in users. Edits made anonymously are not tracked for those reports, however they are included in the total edit counts.

As mentioned above as we’re using the stubs to generate the stats, as such the total article acount is much higher than the official count.

Some extra reports I’d like to generate when I have some spare time include:

  • Percentage of edits coming from each decile of active Wikipedians
  • Number of deleted articles, this information coming from the logging.sql export.
  • Average number of edits per article
  • InactiveWikipedians, i.e. the number of Wikipedians who have contributed more than, say 100, edits who haven’t contributed for 6 months or so.
  • Please comment if there’s anything else you’d like to see…

Once the data was analysed I used GD::Graph to produce the graphs. Whilst this is nice and easy to use it doesn’t generate the most attractive graphs in the world. I came across the Google Chart API the other day which looks like a nice easy way to generate charts with simple HTTP requests. If nothing else they look a bit prettier that the GD::Graph graphs so I may rewrite with that.

As mentioned the graphs can be found at clearly show that the period around March, April 2007 was a particularly popular time for Wikipedia editing, showing a large number of edits from a large number of Wikipedians. Since then it’s been a gentle decline  in the number of contributing users until the end of 2007 and then showing some increase in 2008. The total number of article edits has yet to hit their spring 2007 peak again. Whether this is cause for concern or just a sign of a maturing Wikipedia remains to be seen. It will be interesting to continue to track these statistics over the next year or so.

Categorised as: Wikipedia

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>