Overview
This article will explain how to query ordered lists in SPARQL.
Introduction
Knowledge graphs sometimes need to store lists that contain elements in a meaningful order. For example, a cooking recipe may require performing the following steps:
- Mix the dough.
- Heat the oven.
- Put the dough inside the oven.
When we query the steps of a recipe like this, we must preserve their order. After all, things would go wrong if we put the dough inside an unheated oven, or mix the dough only after it has been warmed in a heated oven.
Lists in knowledge graphs
The RDF standard introduces the rdf:List
class for representing ordered lists. Since rdf:List
is the most widely used class for this purpose, we will focus on its use for representing lists in this article.
Instances of the rdf:List
class follow a linked list structure. This means that every sublist is represented by a node in the knowledge graph with two outgoing edges:
rdf:first
points to the first element in the sublist.rdf:rest
points to the rest of the sublist.
Notice that for finite lists we will reach the end of the list at some point. In a linked list, the end of the list is always represented by the empty list. The RDF standard includes the rdf:nil
list for this purpose.
We can use the RDF standard as explained above to represent our recipe example. Notice that we use one node to represent the recipe itself (recipe:123
), 4 nodes to represent the (sub)lists (list:1
, list:2
, list:3
, rdf:nil
), and 3 nodes to represent the steps:
We can encode this representation in the Turtle or TriG serialization formats as follows:
id:recipe123
a def:Recipe;
def:steps list:1.
list:1
a rdf:List;
rdf:first 'Mix the dough.'@en;
rdf:rest list:2.
list:2
a rdf:List;
rdf:first 'Heat the oven.'@en;
rdf:rest list:3.
list:3
a rdf:List;
rdf:first 'Put the dough inside the oven.'@en;
rdf:rest rdf:nil.
Notice that the linked list structure is very generic. For instance, an element can itself be a list, allowing arbitrarily nested lists to be represented with ease.
But the linked list format is also a little verbose: it requires us to write the links between the sublists explicitly. For this reason, the Turtle and TriG formats support an abbreviated list notation. When we use the abbreviated list notation, we can write Snippet 1 as follows:
id:recipe123
a def:Recipe;
def:steps
( 'Mix the dough.'@en
'Heat the oven.'@en
'Put the dough inside the oven.'@en ).
Notice that Snippet 2 is a bit easier to read and write than Snippet 1. But it also gives us less control over how the sublists are named. Specifically, arbitrary Skolem IRIs will be used to represent the sublists, whereas previously we were able to use the purposefully chosen IRIs list:123
, list:23
, and list:3
to represent the sublists.
Querying lists naively
Let's assume we have a knowledge graph that contains recipes that use lists to represent their required steps. One may naively think that such lists can now be queried with SPARQL in the following way:
select ?step {
recipe:123 def:steps/rdf:rest*/rdf:first ?step.
}
Let's unpack what happens in this query:
- We want to retrieve the steps required by a specific recipe. We therefore include variable
?step
in the projection (first line) and specify the IRIrecipe:123
in the second line. - We use a property path to traverse the various properties that are used in the list representation (see Diagram 1). Notice that we first traverse the
def:steps
property, then therdf:rest
property, and finally therdf:first
property. The/
-operator is used to compose this sequence of properties into one path. - Depending on the depth of the elements in the list, the
rdf:rest
property must be traversed zero or more times. This is indicated by the*
-operator.
Unfortunately, Query 1 does not work. Or worse: it may seem to work for some lists, but may not work for others. The SPARQL standard does not guarantee the order in which property paths are evaluated. This means that a standard-conforming SPARQL engine may return the last list element first, or the middle element first, or the first element first.
Let us briefly reflect on why the SPARQL standard does not mandate an evaluation order for property paths. An important task of SPARQL implementations is to evaluate SPARQL queries quickly. Speed is achieved by reordering the query so that the fastest operation is executed first. For Query 1 this means that the SPARQL implementation may rewrite the query in the following way:
select ?step {
recipe:123 def:steps ?x.
?x rdf:rest* ?y.
?y rdf:first ?step.
}
Now let's suppose that the SPARQL implementation determines that the last triple pattern is the cheapest of the 3 to execute. It will then reorder the query as follows:
select ?step {
?y rdf:first ?step.
?x rdf:rest* ?y.
recipe:123 def:steps ?x.
}
Notice that the SPARQL implementation will now start by evaluating the triple pattern with property rdf:first
. This may return any of the steps in our dataset, not necessarily the first step. Since SPARQL implementations must be given the freedom to evaluate queries in the most optimal way possible, the SPARQL standard cannot mandate an evaluation order. This is unproblematic for data that is unordered but causes problems whenever data is ordered (as is the case in RDF lists).
Querying lists correctly
In this section we show how you can correctly query ordered lists in SPARQL.
Staying with our example list, let's support we want to generate HTML widgets that enumerate the steps required for a specific recipe. We want to build a function in SPARQL that has the following signature: (rdf:List
) → rdf:HTML
. We can define this signature in SHACL Advanced:
prefix listf: <https://triplydb.com/Triply/list/model/func/>
prefix list-param: <https://triplydb.com/Triply/list/model/param/>
listf:elements
a sh:SPARQLFunction;
sh:parameter list-param:list;
sh:returnType rdf:HTML;
sh:select '...'.
list-param:list
a sh:Parameter;
sh:class rdf:List;
sh:order 1;
sh:path list-param:list.
Notice the following details:
- We define an instance of
sh:SPARQLFuncton
. - We define a parameter list with
sh:parameter
. - We define individual parameters as instances of
sh:Parameter
. - We define the return type with
sh:returnType
.
The only thing that remains to be specified is a SPARQL query string (sh:select
) that implements this function signature. The essence of our query is the insight that the index at which an element occurs in a list is equal to the number of intermediate sublists to the start of the list (see Diagram 1). The number of sublists can be counted using the aggregate function count
. Ordering the elements based on this count will return the list in the correct order.
select (strdt(concat('<ol><li>',group_concat(str(?element);separator='</li><li>'),'</li></ol>'),rdf:HTML) as ?elements) {
{
select ?element {
$list rdf:rest* ?sublist.
?sublist rdf:rest*/rdf:first ?element.
}
group by ?element
order by (count(?sublist))
}
}
Notice the following details:
- We use function
strdt
to create a type literal with datatyperdf:HTML
. - We use function
concat
and aggregate functiongroup_concat
in the projection to create a valid ordered list serialization in HTML5. - We use a sub-select to order the elements internally.
- We use the
order by
clause to perform the ordering. - We use property path notation to traverse the linked list structure (using the
*
-operator and the/
-operator).
Most importantly, notice that we only used standardized features from RDF, SPARQL, and SHACL.
When we insert Query 1 into the sh:select
literal of Snippet 3, we finalize our implementation of the SPARQL Function. When we load this snippet into a standards-conforming SPARQL implementation, we can retrieve the ordered list of elements as follows:
prefix listf: <https://triplydb.com/Triply/list/model/func/>
select (listf:elements(?stepsList) as ?stepsHtml) {
recipe:123 def:steps ?stepsList.
}
Conclusion
Ordered lists are a common requirement in knowledge representation. We saw that it is possible to represent ordered lists using the RDF and SPARQL standards. Some SPARQL implementations introduce proprietary extensions that allow lists to be queried in a supposedly simpler way. However, these extensions are not really necessary. The standardized linked data features are sufficient for querying ordered lists.