As a researcher, I always get annoyed when I see the wanton abuse of data. Mashable’s article Google+ Users Are Nearly All Male is a great example of data abuse.
The data abuse starts by reporting data without noting anything about the methodology behind it. Data is meaningless if you don’t know how it was collected. For this article, they report on two different websites which claim to have analyzed Google+ user profiles. Neither of these websites say that they’ve looked at all of the profiles, and neither of them note what method they used to sample the profiles that they did analyze.
The data abuse continues by ignoring the major differences in the data that is returned by the two sites. One says that 86.8% of sampled profiles are male, the other says that 73.7% are male. What explains a delta of more than 10 points? I can come up with possibilities, but I don’t know if any of my potential explanations are correct. In the case of any of the possibilities, it would tell us a very different story. For example, if the difference is one of time (that is, one set of data was collected earlier than the other), then we’d learn something about the early-adopter curve. If the difference is one of sampling method, then we might learn about the relative strengths of each of those sampling methods for this type of dataset.
What really bothers me about this breathless repeating of such statistics is that there is no attempt at analysis. If we accept that the current Google+ users skew male, is this any different than the usual early-adopter curve? Or the early-adopter curve for social media? Or the early-adopter curve for new Google applications? Data without analysis is meaningless. Reporting on the data suggests that we should care, that there is something different here. But it appears that no-one has bothered to answer such basic questions about the data.
We can do better than this. Let’s stop the blind reporting of data, and instead expend some effort on analyzing the data.
Lets see, Google+ is a technology offering that was rolled out in a limited invitation-only process. It was much hyped in the technology blogs and the early registration numbers seem to demonstrate a strong bias to the users being predominantly male. What industry has a very high number of males workers and is interested in new technology? It is quite possible that the adaption numbers, if accurate, could just be a reflection of the gender balance in software engineering fields. Particularly since Google+ was first rolled out to Google employees.
On a separate but related note, NPR broadcast a story today about how more and more people are getting lost for days in Death Valley, NV. It turns out that the popular GIS datasets include road information for the valley from century-old mining maps. The GIS datasets are used to generate the maps in standard GPS devices. So people are being directed down long abandoned and now non-exsistant mining roads, directed in circles and end up lost or stuck. This isn’t breathless reporting of data, rather use of and reliance on data without any critical evaluation of whether mining map records for a desert are relevant for inclusion in a modern, consumer device dataset. Poor GPS data has been implicated in a couple of deaths in Death Valley.
And here’s a link to that NPR story about Death Valley and GPS data: http://www.npr.org/2011/07/26/137646147/the-gps-a-fatally-misleading-travel-companion