Data Quality

data-qualityCreated on Aug 9th, 2020

This is where we collect approaches for detecting and resolving data quality issues.

Members

This query enumerated nodes with no human-readable label.

It is a best practice for each node to have a human-readable label.

Created 3 months ago, 1 version

Enumerates classes that are not specified as the domain and/or range of any property.

Created 3 months ago, 1 version

Enumerates properties that do not have a range specified.

Created 3 months ago, 1 version

Properties with no specified domain.

Created 3 months ago, 1 version

Data quality aspect

A dataset sometimes contains statements (i.e., triples) that are isolated from the rest of the knowledge graph.

Purpose

This query enumerates such isolated triples. These are not connected to the rest of the graph in any way.

Implementation

We check for the following:

  • Does the subject node has inlinks?
  • Does the object node have outlinks?
  • Are there other triples within this same record?

If neither of the above is true, the triple is shown in this query as a disconnected edge.

Created 3 months ago, 2 versions

Data quality aspect

Implicitly, every linked data node is an instance of rdfs:Resource. However, it is a best practice to make the instance-of assertion explicit. Also, a node is rarely a direct instance of rdfs:Resource, so making the assertion explicit often prompts the data publisher to think about a more descriptive class hierarchy.

Created 3 months ago, 3 versions

Data quality aspect

Terms that are used in the predicate position of at least one triple are implicitly instances of rdf:Property. However, it is a best practice to make explicit whether a property is a datatype property (instance of owl:DatatypeProperty) or object property (instance of owl:ObjectProperty).

Created 3 months ago, 1 version

Data quality aspect

There is sometimes an inconsistency between the defined domain for properties (rdfs:domain) and the subject terms that are used with those properties in the data.

Related

  • The same consistency check can be done for ranges: query
Created 3 months ago, 1 version

Data quality aspect

There is sometimes an inconsistency between the defined range for properties (rdfs:range) and the object terms that are used with those properties in the data.

Implementation

This query identifies the use of properties in the data ([ ?p ?o].) and identifies the classes of the corresponding object terms. These classes can either be asserted through rdf:type for IRIs, or be part of the term itself (extracted with datatype/1) for literals. rdfs:Resource is used as a fallback if no class is specified in the data.

Related

  • The same consistency check can be done for domains: query
Created 3 months ago, 1 version

Data quality aspect

consistency > range > syntax

Shows predicates that are defined in the vocabulary as owl:DatatypeProperty, but that have IRIs appear in their object position in at least some statements.

Created 3 months ago, 1 version

Data quality issue

Correctness > syntax > null

In traditional data paradigms it was often required to enter a value, even if the value was not present for a certain object. In linked data there is no reason to use null values anymore, and the use of null values is often merely a byproduct of old data sources and/or old habits.

Purpose

This query enumerates the empty literals that appear in a dataset.

Created 3 months ago, 1 version

Data quality issue

consistency > range > syntax

Query purpose

This query shows the predicates that are defined in the vocabulary as object properties (owl:ObjectProperty), but that have literals appear in the object position of data triples in at least some statements.

Created 3 months ago, 1 version

Data quality issue

Incorrectness > semantic > term > numeric

Datasets sometimes define their numeric data incorrectly at the term level. There is an important distinction between decimal numbers (including integers) and floating-point numbers. Both are defined in XML Schema 1.1: Datatypes. It is especially common to represent decimal numeric data using floating-point numbers.

Purpose

This query gives an overview of the properties that are likely to using floating-point numbers to represent decimal numeric data.

Implementation

This is done by automatically converting each double (xsd:double) to an integer (xsd:integer), and back to a double again. If no information was lost, the double could have been modeled as an integer.

Created 3 months ago, 2 versions

The query enumerates datatype properties that have a relatively small number of unique values. Such properties might be better modeled as object properties, and their values as IRIs. This is computed heuristically, based on the ratio between unique and non-unique literal occurrence.

Created 3 months ago, 1 version

Encoding issues are introduced when text is saved with an encoding other than Unicode (UTF-8).

Created 3 months ago, 1 version
Created 3 months ago, 1 version
Created 3 months ago, 1 version
Created 3 months ago, 1 version