Bias Investigation in DBpedia - part 2

This data story is a continuation of the previous one where we asked the question whether the data in DBpedia is biased or not. Reading the previous data story is recommended and is linked here. The aim of this data story is to try to understand the results obtained when trying to understand predicate count between man and woman.

The initial step will be indeed to declare what constitutes as a man and a woman. For this, we take assistance of the previous story in which we did look into which option provides the most results and we found foaf:gender to provide us with the most results. Thus, we use this to create our query which is defined somewhat like this:

"List all predicates with counts of where the subject is male and object is female as well as where the subject is female and the object is male under foaf:gender."

As the aim is to check for bias, we will take a deeper dive into how DBpedia classifies each of the predicates for both male subjects and female subjects. This requires post-processing the above query, which was achieved through python and the results of which are saved in this Google Colab notebook.

Based on the post-processing of the data, we utilize the top five results achieved for cases where the mtfcount was higher and where ftmcount was higher and understand the difference for it.

The top five results where difference is in favor of mtfcount is as follows:

  • dbo:monarch - 1374 more results.
  • dbo:associatedMusicalArtist - 1263 more results.
  • dbo:associatedBand - 1263 more results.
  • dbo:successor - 944 more results.
  • dbp:after - 847 more results.

The top five results where difference is in favor of ftmcount is as follows:

  • dbo:predecessor - 1277 more results.
  • dbo:president - 1257 more results.
  • dbo:influencedBy - 1160 more results.
  • dbo:spouse - 1025 more results.
  • dbo:creator - 1024 more results.

To understand how these queries work we will be using two different sparql queries. The first query would be to see how the results actually look and what the predicate is meant to do. The second query would be to provide us with a count of results, by grouping it with the object.

Note: These queries have a limit set to them, if you would like to try these queries yourself I would recommend removing the limit and running them again.


Results from mtfcount

dbo:monarch

As we see from the queries below dbo:monarch is used to refer to individuals that served under a monarch of the opposite gender. We see clearly that there have been very few number of women that served under a male monarch whereas there has been a considerable amount of men that served under female monarchs.

Men serving a female monarch.

Women serving a male monarch.

An interesting result here is that though the count of men serving is higher, the actual number of women monarch listed is lower than men monarch. Queen Elizabeth II has the highest number of men serving under her at 834 out 1439 results, which does make sense as she was one of the longest living monarch compared to the highest result on her male counterpart is Willem-Alexander of the Netherlands who has 6 women serving under him out of the 65 results.

dbo:associatedMusicalArtist

This predicate seems to be for relations between producers and heads of record labels to artists of the opposite gender. The results below shows us that there has been a larger number of male producers that have worked with female artists.

Male subjects that worked with female musicians.

Female subjects that worked with male musicians.

We can also see here that though the count of female artist with male producers is higher than the vice versa. Such as Rihanna (50 out of 5918), Madonna (48 out of 5918) is very high compared to Prince (23 out of 4655) or Stevie Wonder (15 out of 4655). Additionally, we also see the previous result of higher count of male artist compared to female artist despite the final results being the other way around.

dbo:associatedBand

The results obtained from this predicate is the exact same to the results obtained in dbo:associatedMusicalArtist.

This does raise the question as to whether the properties are linked such that if a subject has either one of these tags, the other is automatically added or the other possibility being that DBPedia considers single artists also as a band, a one-man band, for instance which might provide explanation as to why both predicates have the same results.

dbo:successor

This predicate refers to an individual who succeeded another from the opposite sex. The results is not limited to just political succession but also contains royal successions as well as guild succession.

Men with female successors.

Women with male successors.

Only unique fact about this predicate is that there seems to be a higher number of women who have (or will) succeeded men compared to the other way around.

dbp:after

This predicate seems to be for any and all objects that came after something or someone. The results are widely varied and do not have any set pattern from what we can tell.

Male subjects with female objects that came after.

Female subjects with male objects that came after.

From the above queries, we can see that the results are from all over the place. David Johnston was a Governor General of Canada which did have Elizabeth II as the head since Canada is a constitutional monarchy. The next query list Victoria II who was the mother of Edward VII and also was the Queen before he ascended the throne after her death.


Results from ftmcount

dbo:predecessor

The results for this predicate seems fairly similar to its counterpart dbo:successor. As seen in successor, this also contains results for royal and entertainment-based succession.

dbo:president

The predicate shows number of individuals that served under the president of the opposite sex. The results seem to ranged from working directly under the president to being an ambassador during the period of time. There is some overlap as well in case the individual served in their position for longer terms than the president at the time.

The highest count has been for Barack Obama (278 out of 1563) which is over 176 more results than George W Bush who served as President before him. On the other side, we have Cristina Fernandez de Kirchner who had 34 men serve under her. This discrepancy might be due to the fact there have been very few female presidents throughout history as well as the fact that there are certain countries where Presidents do not hold all the power.

dbo:influencedBy

The predicate shows results for individuals that were influenced by individuals of the opposite sex.

Results are not exactly indicative of much as these results are bound to be subjective. But it is interesting to see from the results below that Ayn Rand being the one that influenced male figures the most followed by Sigmund Freud next for female figures.

dbo:spouse

The results list the number of spouses individuals had. The results range from monarchs to socialites and actors and also mythological characters.

An interesting find is that there are more results in the terms of men having multiple spouses but they are not listed as they lack their own dedicated articles in Wikidata (by extension Wikipedia). For e.g. Qianlong Emperor is listed to have 16 spouses according to DBpedia but through a manual count he seems to have over 49 listed spouses. This seems to be due to the lack of articles for the remaining spouses which prevents them from being tagged. This also tells us that spouse is probably usually freely and is not restricted to just current marital spouse but also to individuals that acted as consorts to the reigning monarch as well as divorced or widowed individuals.

The highest count of female marriages seems to stem for actresses and socialites with one rare occurrence of a Burmese Queen and an interesting issue of a mythological Hindu character Draupadi being listed as having 4 husbands in the query whereas her DBpedia page shows that she had 5 husbands (Pandavas) which can be due to lack of assigning proper properties or incomplete data.

dbo:creator

This lists every fictional character that was created by an individual of the opposite sex. An interesting finding here is that there have been more male creators who have created female characters over female creators who created male characters.

The top two results for male creators are comic book artists Chris Claremont and Stan Lee with 25 and 24 creations respectively. On the female creators side, we have TV and Film producers Shonda Rhimes and Sylvia Anderson with 11 and 10 creations respectively.


T-test for understanding the results.

A t-test was performed again to prove the presence of bias in the dataset. We set the significance level to 0.05 and set the hypotheses as follows:

Null Hypothesis (H0): There is bias in the data.

Alternate Hypothesis (H1): There is no bias in the data.

The Google Colab file contains the code for the tests at the end of the file.

As we can see from the results, we have a p-value that is 0.826 which is significantly higher than 0.05 which in turn proves that our null hypothesis is correct and that there is indeed some bias in the data.

Conclusion

Taking into consideration results from both the t-test undertaken across the two data stories we can say that there is some evidence of data bias. Based on personally understanding and browsing of data, there is indication of data quality issues as well as presence of incomplete data.

The properties and the values are in some cases incomplete and overused in other cases. There is also the issue of splitting of properties with the same meaning into different properties when they all convey the same thing. For e.g. male has multiple predicates used to define it but the foaf:gender value is clearly the most used which makes the other similar tags redundant.

Thus, in conclusion we can say there exists some kind of data bias in DBpedia that stems from quality issues.