Data Quality

data-quality
Created on Aug 9th, 2020

This is where we collect approaches for detecting and resolving data quality issues.

Members

human-readable-no-label

This query enumerates nodes with no human-readable label.

It is a best practice for each node to have a human-readable label.

Created 4 years ago, 1 version

disconnected-class

Enumerates classes that are not specified as the domain and/or range of any property.

Created 4 years ago, 1 version

completeness-no-range

Enumerates properties that do not have a range specified.

Created 4 years ago, 1 version

completeness-no-domain

Properties with no specified domain.

Created 4 years ago, 1 version

disconnected-triples

Data quality aspect

A dataset sometimes contains statements (i.e., triples) that are isolated from the rest of the knowledge graph.

Purpose

This query enumerates such isolated triples. These are not connected to the rest of the graph in any way.

Implementation

We check for the following:

  • Does the subject node has inlinks?
  • Does the object node have outlinks?
  • Are there other triples within this same record?

If neither of the above is true, the triple is shown in this query as a disconnected edge.

Created 4 years ago, 2 versions

incompleteness-node-type

Data quality aspect

Implicitly, every linked data node is an instance of rdfs:Resource. However, it is a best practice to make the instance-of assertion explicit. Also, a node is rarely a direct instance of rdfs:Resource, so making the assertion explicit often prompts the data publisher to think about a more descriptive class hierarchy.

Created 4 years ago, 3 versions

incompleteness-property-type

Data quality aspect

Terms that are used in the predicate position of at least one triple are implicitly instances of rdf:Property. However, it is a best practice to make explicit whether a property is a datatype property (instance of owl:DatatypeProperty) or object property (instance of owl:ObjectProperty).

Created 4 years ago, 1 version

inconsistency-domain-usage

Data quality aspect

There is sometimes an inconsistency between the defined domain for properties (rdfs:domain) and the subject terms that are used with those properties in the data.

Related

  • The same consistency check can be done for ranges: query
Created 4 years ago, 1 version

inconsistent-range-usage

Data quality aspect

There is sometimes an inconsistency between the defined range for properties (rdfs:range) and the object terms that are used with those properties in the data.

Implementation

This query identifies the use of properties in the data ([ ?p ?o].) and identifies the classes of the corresponding object terms. These classes can either be asserted through rdf:type for IRIs, or be part of the term itself (extracted with datatype/1) for literals. rdfs:Resource is used as a fallback if no class is specified in the data.

Related

  • The same consistency check can be done for domains: query
Created 4 years ago, 1 version

datatype-properties-with-iris

Data quality aspect

consistency > range > syntax

Shows predicates that are defined in the vocabulary as owl:DatatypeProperty, but that have IRIs appear in their object position in at least some statements.

Created 4 years ago, 1 version

empty-lexical-forms

Data quality issue

Correctness > syntax > null

In traditional data paradigms it was often required to enter a value, even if the value was not present for a certain object. In linked data there is no reason to use null values anymore, and the use of null values is often merely a byproduct of old data sources and/or old habits.

Purpose

This query enumerates the empty literals that appear in a dataset.

Created 4 years ago, 1 version

object-properties-with-literals

Data quality issue

consistency > range > syntax

Query purpose

This query shows the predicates that are defined in the vocabulary as object properties (owl:ObjectProperty), but that have literals appear in the object position of data triples in at least some statements.

Created 4 years ago, 1 version

doubles-that-could-be-integers

Data quality issue

Incorrectness > semantic > term > numeric

Datasets sometimes define their numeric data incorrectly at the term level. There is an important distinction between decimal numbers (including integers) and floating-point numbers. Both are defined in XML Schema 1.1: Datatypes. It is especially common to represent decimal numeric data using floating-point numbers.

Purpose

This query gives an overview of the properties that are likely to using floating-point numbers to represent decimal numeric data.

Implementation

This is done by automatically converting each double (xsd:double) to an integer (xsd:integer), and back to a double again. If no information was lost, the double could have been modeled as an integer.

Created 4 years ago, 2 versions

datatypes-that-could-be-objects

The query enumerates datatype properties that have a relatively small number of unique values. Such properties might be better modeled as object properties, and their values as IRIs. This is computed heuristically, based on the ratio between unique and non-unique literal occurrence.

Created 4 years ago, 1 version

encoding-issues

Encoding issues are introduced when text is saved with an encoding other than Unicode (UTF-8).

Created 4 years ago, 1 version

percent-encoding-in-iris

Created 4 years ago, 1 version

true-and-false

Created 4 years ago, 1 version

date-time-outliers

Created 4 years ago, 1 version