Bias Investigation in DBpedia - part 1

Bias is anything that changes the outlook of anything to match one's own inclination. This can be seen in the smallest daily thing such as preference of choice of food to possibly decisions taken by a collective. Humans as a whole have a tendency towards bias as every individual has their own opinion. This investigation is thus in part trying to discern the presence and scope of bias in DBpedia as a whole.

This data story is part of a two part investigation into checking whether bias exists in any form in DBPedia dataset. To look for the same, we narrow down and limit the search to relations between people, in this case a man and a woman.

For starters, dbpedia seems to have a large amount of relations related to spouse, more specifically dbo:spouse. The next step was to limit which properties would be best for male and female classifications. Thus, a list of results were made and added to the query.

The query works on the basic line of ?subject ?spouse ?object which means a subject with object as spouse. The subject & objects will be, as we see in the following queries, both man and woman and a combination of both. There are many classifications of spouses but we take one which apparently is used widely called dbo:spouse . An optional is used to ensure that all results from dbo:spouse is kept whereas keeping the results from dbp:spouse only if the match the conditions or any of the variables are in common. Thus, the base idea of the query is as follows:

"List all subjects who satisfy that subject consisting of certain properties are spouses of objects containing certain properties."

To get the most information out of the data, we execute the query thrice. The first with only men as subjects, the second with only woman as objects and the final one with both men and women in subject & object positions. We get the results in the form of a count value divided based on the amount of results present from each options. We then do a t-test and also check if other gender declarations might have substantial results to be included in the previous queries.

Checking the query results

The first query looks at man in subject position with woman in object position and we see the results above where foaf:gender properties "male"@en & "female"@en are the one with most results, while the others are nowhere near the previously mentioned results.

This query provides us with a result of over 33634 results in total.

We can again see similar results in the query above that foaf:gender has the most results even when woman is in subject position and man in object position.

This query provides us with a total of 33599 results.

Lastly, the results from the previous queries is replicated in the final as well where man and woman are present in both subject and object position. Through the above queries we can clearly say that foaf:gender has the most amount of results despite having over fourteen different options listed from a vast variety of properties. The difference between the first two results and any following it significantly large which leads us to believe that foaf:gender might be the property that is mostly used.

This query has over 19423 results.

When we compare the result between the queries themselves we see that the first query has more male properties compared to the second one whereas the second one has more female properties instead. The third query interestingly has more male properties but the difference is almost negligible in comparison to the previous two queries.

Previous iterations

The current implementation is the result of trial and error process. Initially the options in the queries were written with OPTIONAL instead of UNION. Though OPTIONAL yielded higher results, there was a possibility of duplicates to be included in the results as well, which decreases the quality of the results we obtain. Hence, this was switched out for UNION which does not include duplicates.

Conducting T-tests

A t-test is a statistical test that is used to compare the means of two groups. A t-test has two hypothesis: null hypothesis (H0) and alternative hypothesis (H1).

A significance level is decided upon, which normally is 0.05 and we use the data to calculate the corresponding p-value. If the p-value is less than the significance value then we reject the null hypothesis. If it exceeds the significance level we fail to reject the null hypothesis.

We plan to use this test to give credence to our hypothesis that there is bias present in the dataset. Thus, for this investigation we can say that our hypothesis is as follows -

H0: The data present in the dataset is biased.

H1: The data present in the dataset is not biased.

This task was achieved in Python with additional libraries like scipy and pandas. The code and the subsequent results can be found in this Google Colab file.

As we can see from the results, the p-value is 0.9991 which way higher than 0.05 which means it fails to reject the null hypothesis, which means our null hypothesis is valid. This shows that there is some bias present in the dataset, or atleast in the spouse properties.

Other gender classifications

An additional step that we also considered after our initial testing was done was look into other gender classifications that might reside in dbpedia. We limited ourselves to foaf:gender for the simple reason that our previous queries had provided decent evidence that it was the most used property. Hence if there exists more gender classification they might exist in larger amount in foaf:gender but the results are not indicative of that leading to the conclusion that other genders might be either poorly defined or lack representation in dbpedia as a whole.

The query above should list all the objects present in foaf:gender and provide a count of them as well.

To ensure we were not missing out more details, we also attempted a query search with dbo:gender but the results were non-satisfactory but the resulting query is listed below.

Conclusion

This being the first part of the investigation, we can say that there has been positive reinforcement to the fact that there is indication of bias present in the dataset though somewhat limited in scope as of now. The aim going forward is to broaden the scope and also find predicates that provide the most results between man and woman from both directions.