Biases on Social Media Data

Social media data is often used to pulse the opinion of online communities, either by predicting sentiment or stances ( e.g., political ), to mention just two typical use cases. However, those analysis assume that the data samples really represent the underlying demographics of the overall community, both, in number and characteristics, which in most cases is not true. As a result, extrapolating these results to larger populations usually do not work. This happens because social media data is inherently biased, mainly due to two facts: (1) not all people is equally active in social media platforms and most of them are really passive; and (2) there are demographic biases in gender and age, among other attributes. Hence, the questions of how representative is the data and if is possible to make it representative are of crucial importance. We also discuss related issues such as using public samples of mostly private platforms as well as typical errors in the analysis of such data.


Inequality of Access
The first challenge is that Internet penetration is very heterogeneous (from 98% in Iceland to 2% in South Sudan) and reaches just 59% of the world population. 1 This mainly leaves out developing countries as well as poor people everywhere, creating an immediate wealth bias in Internet as a data source. Indeed, the penetration does not reach 40% in Africa while goes up to almost 95% in North America. In contrast, mobile phone penetration is larger than Internet penetration, specially in developing countries, reaching almost 67% of the Earth population.

Social Media and Data Privacy
Social media or online social networks (OSN) users are estimated in 3.8 billions, which represent almost 49% of the world population and 83% of the people with Internet access. In Table 1 we show the main online social networks, including implicit ones (chat apps) and excluding the ones focused in China (QQ, Qzone, Sina Weibo, and Kuaishou) with the exception of WeChat. These numbers must be considered as proxies to the real numbers as they come from different sources where not always there is consensus.
As we can see in Table 1, in many cases data is not public, hence we need to assume that people that makes their data public is similar to the one the keeps their data private. This is a very strong assumption and there are no studies, to the best of our knowledge, showing this (can only be done by each OSN). In addition, in many cases we sample the data with an API provided by the OSN. Hence we need to also assume that this sample is random, but probably is not completely true as at least some filtering might be done (e.g., adult or hate content).

Biases
The first bias is activity bias [2]. That is, not all people are active all the time. Indeed, Nielsen proposed the 90-9-1 rule to convey participation inequality in Internet [5]. That is, in any segment of time, 90% of the people is passive (lurkers), 1% is heavily active and 9% react to that activity. For example, in Table 1 we show active monthly and daily users when available, but the overall number of users of each OSN is in many cases much higher (e.g., for LinkedIn is about 670 million). If we consider the percentage of active users that generate half the content, we have found 7% and 2%, for Facebook and Twitter, respectively [1]. These values are consistent with the 90-9-1 rule and imply that content per user is quite heterogeneous, not only in content, but also in volume.  Table 1: Main online social networks (some are implicit) and their characteristics (age ranges are partial).
The second bias is demographic. Gender, age, and other characteristics do not necessarily reflect the general population. As we can see in Table 1, some social media are dominated by women while others are dominated by men (the numbers represent an informed guess as in some cases there is no consensus among the different sources). Similarly, some OSNs are dominated by young people while others are dominated by adults over 35 years old.

Discussion
Demographic bias can be mitigated by segmenting the users by gender and age ranges. To do this we need to build classifiers for gender and age range prediction, using users that self-report those attributes as training data. Then, we must compare the percentage of users in each segment with the latest census of the country where we are doing the analysis, to know how to weight each segment so we get the real representation of it. We did this in an analysis of Twitter data during the legislation of a controversial abortion law in Chile and when compared to real surveys, we got representative results [4]. Figure 1 shows the differences in age segments for self-reported, predicted and overall population. In this same work we estimated that 43% of the users were women, much more than the world average, probably due to the topic under study. Hence, demographic biases can be very different from one country to another.
Finally, when analyzing social media data, there is the tendency to look at the top-k influencers. For example, which percentage of them have a property (e.g., are foreign). However, this is completely bogus as the percentage depends on the value of k. In fact, we can choose k to maximize this percentage which will be much larger than the real percentage when all users are considered (a number much larger than k). We have seen this mistake in many studies, including some that are well publicized [3].