IRI strategy

Linked data uses IRIs for universally unique identification. The IRI strategy is a specification that determines how new IRIs will be created to denote objects. It is part of the Triply approach to specify the IRI strategy before the first IRI is created.

This chapter introduces the Triply IRI Strategy Template. Triply users can instantiate this template in order to generate their own IRI strategy in a simple way. Triply guarantees that the IRI Strategy Template follows linked data best practices and avoids many of the pitfalls that are commonly encountered by linked data users that roll their own IRI strategy, or that have no IRI strategy at all.

1. Why does my organization need an IRI strategy?

Assigning names to things is one of the most difficult human tasks. The assignment of names is considered an important conceptualization step in scientific discovery and conceptualizing the world in general. The already difficult task of assigning names to things is especially difficult in linked data, where names must be universally unique. Before the invention of the Internet it was not common to think in terms of universal identification. This means that when we create linked data, we often have to come up with a unique name for things that previously had a non-unique name or no name at all.

If your organization uses linked data, it must also have an organization-wide IRI strategy.

It is a linked data best practice for IRIs to not change over time. Once linked data is being published, users rely on the continued availability of the IRIs that appear in it. Changing IRIs later in the process will result in broken applications and angry users.

Not having an organization-wide IRI strategy means accepting that linked data will never be used in serious applications. If your organization does not have an IRI strategy, then you must seriously consider whether the publication of linked data is a good idea at all.

2. Why would I follow the Triply Approach?

The Triply approach states that an IRI strategy should be created at the start of a linked data project.

Creating the IRI strategy early in the process allows it to be used throughout the linked data publication lifecycle. An important component of this is the automatic validation that newly created IRIs follow the IRI strategy. This ensures that wrong IRIs cannot be published in the first place.

Linked data projects that are not run by Triply often create the IRI strategy later in the process. (Sometimes even after some linked data has been published.) The IRI strategy then becomes more of a retrospective description of how IRIs have been created to denote objects in the past. When data has already been published, such a description must then also act as a specification for how future IRIs must be created. Depending on the consistency with which IRIs have been created without the guidance of an IRI strategy, this may or may not be possible. Changing an IRI strategy later, or detecting that an IRI strategy was not followed consistently in the past, is one of the most costly mistakes that an organization can make when applying linked data.

3. Why would I use the Triply IRI Strategy?

Triply has a lot of experience with running linked data projects. The Triply Approach requires that an IRI strategy is created at the beginning of each production-focussed linked data project.

Triply observes that it is difficult for organizations that are new to linked data to create an IRI strategy that is consistent and that avoids common pitfalls.

For this reason, Triply maintains the IRI Strategy Template. This template can be instantiated by entering a small number of text fields. After these text fields have been entered, a full-fledged and organization-specific IRI Strategy is created.

4. IRI strategy components

When we specify an IRI strategy, we specify 6 distinct components. Decomposing the difficult task of creating an IRI strategy into 6 subtasks is observed to simplify the process in practice. The 6 components compose the IRI structure as follows:

{scheme}://{domain}/[{context}/]{type}/{subtype}/{reference}

Notice that the context component ({context}) is the only optional component.

(Notice that the above template presumes that we will follow the HTTP(S) syntax for the IRI string: for example, the ://-part is HTTP(S) specific. This simplification is applied because it is very uncommon to publish linked data under schemes other than HTTP(S).)

We will address each of the 6 components in a separate subsection.

4.0.1 Blank-node replacing well-known IRIs

There is one kind of IRI that deviates from this generic pattern: blank node-replacing well-known IRIs. These are IRIs that are not used as named, but are used to denote unnamed things. Because these IRIs do not name things, it is preferred to keep them as unreadable as possible, to avoid incorrect human interpretation based on the IRI alone.

The pattern for blank node-replacing well-known IRIs is as follows:

{scheme}://{domain}/.well-known/genid/{reference}

4.1 The scheme component ({scheme})

The purpose of the scheme component is to determine the way in which the rest of the IRI should be parsed and processed. Specifically for the FTP/FTPS and the HTTP/HTTPS schemes, the scheme also determines whether connecting to the IRI requires establishing a secure SSL connection.

Triply always uses the HTTPS scheme (value: https).

The difference between the HTTP scheme and the HTTPS scheme is that the latter makes use of the SSL security protocol, while the former does not. The transitions away from HTTP and towards HTTPS can be broadly observed:

  • Modern web browsers (Chrome, Firefox, Edge, Safari) still support visiting HTTP IRIs, but have started to show warnings when HTTP IRIs are visited. No such warnings are shown when HTTPS IRIs are visited.
  • Modern search engines (Google, Bing) still show HTTP IRIs in their search results, but have lowered their rankings in favour of HTTPS IRIs.
  • Web services that are accessed at a secure HTTPS IRI are not allowed to use insecure HTTP IRIs in their operations. Specifically, it is not allowed to access an HTTP endpoint from a secure HTTPS web service. (Doing so would violate the secure HTTPS context.)

Linked datasets not created by Triply often still use the HTTP scheme (value: http). These datasets effectively introduce two IRIs to denote the same object: an HTTP IRI and its corresponding HTTPS IRI. The cost of maintaining two IRIs is obviously larger than the cost of maintaining one IRI. For example, the identity between HTTP IRIs and HTTPS IRIs must be made explicit in the linked dataset itself. All APIs that process IRIs must be able to process both HTTP and HTTPS variants, etc. Triply does not believe that such a dual HTTP/HTTPS IRI strategy is future proof.

In practice, it is common practice to service both HTTP and HTTPS IRIs, by letting the former redirect to the latter. While the redirect approach can be applied, and often is applied by Triply, only the HTTPS IRIs are part of the IRI strategy and are used to denote objects. The redirect from HTTP to HTTPS IRIs is only there for transitional purposes and is not part of the IRI strategy. Once the use of HTTP IRIs becomes even less common and less supported, such redirects may be disabled altogether.

The Web is very quickly moving away from HTTP and moving towards HTTPS. The Triply IRI strategy reflects this trend.

The {scheme} component should not include uppercase letters.

4.2 The domain component ({domain})

The purpose of the domain is to relate the online resource published at the IRI to an organization in the physical world. For example, the Dutch government owns the domain overheid.nl. This means that all IRIs that use a subdomain of overheid.nl are officially tied to the Dutch government. In this way, data is tied to its source of trust. Specifically, it is not possible for some other, potentially malicious, organization to publish information under an overheid.nl subdomain.

Examples of domains that were used in IRI strategies created by Triply:

bgt.basisregistraties.overheid.nl
triplydb.com

The {domain} component should not include uppercase letters.

4.3 The context component ({context})

The context component is the only component that is optional in an IRI strategy.

Sometimes the domain component ({domain}) alone is not sufficient to set the context in which linked data is created. We illustrate this with examples:

  • The domain component identifies the organization, but the linked data must be associated with a specific department of that organization. The context component allows the department to be identified as part of the IRI strategy.

  • The linked data must be specifically associated with a dataset name. The context component allows the dataset name to be identified as part of the IRI strategy.

  • The {domain} is also used to host the regular organization web site. Data IRIs should not conflict with URLs in the non-data parts of the web site. The context component can be used to distinguish data IRIs from regular, non-data URLs.

For example, Triply datasets that use the default IRI strategy identify the account and the dataset in the context component. The following template is used for IRIs that are part of the Pokemon dataset published by the Triply organization in the TriplyDB.com catalog:

https://triplydb.com/Triply/pokemon/{type}/{subtype}/{reference}

4.4 The type component ({type})

The utility of the type component is to determine what kind of resource is exposed at the IRI.

Triply uses the following types:

id
Used for individuals that are not concepts, classes, properties, shapes, or graphs.
model
Used for classes, collections, concepts, properties, and shapes.

4.5 The subtype component ({subtype})

For some types we want to specify a subtype.

Triply specifies subtypes for the types id and model.

4.5.1 The subtype component for type id

The utility of the subtype component is to partition the space of instance IRIs (type: id). This partition is often necessary, because the reference component for instances is very often not unique across instances of different subtypes.

For example, a dataset may contain records for roads, buildings, and neighborhoods. Every road, building, and neighborhood has a numeric indicator. But because these records are traditionally stored in different relational tables, the same identifier is used in either of these three tables. By using the subtype components building, neighborhood and road we have a generic approach for creating unique IRIs:

{scheme}://{domain}/id/building/123
{scheme}://{domain}/id/neighborhood/123
{scheme}://{domain}/id/road/123

While unique identification is the main reason why Triply applies the subtype component, it also makes low-level RDF notation more readable to a linked data engineer:

prefix building: <{scheme}://{domain}/id/building/>
prefix geo: <http://www.opengis.net/ont/geosparql#>
prefix neighborhood: <{scheme}://{domain}/id/neighborhood/>

building:123 geo:sfWithin neighborhood:123.

From the above it follows that the subtype component is not applied for definition IRIs (type: def) or shape IRIs (type: shp). Definitions and shapes are already unique and they also often have human-readable reference components.

To simplify the IRI strategy, Triply uses a standardized approach for determining the values for the subtype components:

  1. Find the main class IRI for each instance IRI. Since at least one class must be specified for each instance, such a main class always exists.

  2. Take the local name of the main class IRI and convert its first character to lowercase.

Examples of IRIs that Triply has created, where the subtype is specified according to the above approach:

prefix building: <{scheme}://{domain}/id/building/>
prefix def: <{scheme}://{domain}/model/def/>
prefix neighborhood: <{scheme}://{domain}/id/neighborhood/>

building:123 a def:Building.
neighborhood:123 a def:Neighborhood.

4.5.2 The subtype component for type model

col
IRIs that denote SKOS collections. Used in folksonomies.
con
IRIs that denote SKOS concepts and RDF properties. Used in folksonomies and value lists.
def
IRIs that denote OWL classes and OWL properties. Use in ontologies.
func
IRIs that denote SHACL functions.
rule
IRIs that denote SHACL rules.
scheme
IRIs that denote SKOS concept schemes. Used in folksonomies and value lists.
shp
IRIs that denote SHACL node and property shapes.
target
IRIs that denote SHACL targets.

4.5.3 Well-known IRIs

The subtype for well-known IRIs is genid.

4.6 The reference component ({reference})

The utility of the reference component is to uniquely identify a particular thing.

4.6.1 Reference components with type model

For classes, collections, concepts, properties, and shapes, a human-readable reference component is strongly preferred. This makes it easier to type such IRIs in SPARQL queries.

4.6.1.1 Use of camelCase and CamelCase

The following conventions are widely applied for the reference component of model IRIs:

  • The reference component for classes and node shapes starts with an uppercase letter and applies CamelCase.

  • The reference component for properties and property shapes starts with a lowercase letter and applies camelCase.

When CamelCase and camelCase are applied, acronyms and abbreviations are treated as one word. In the absence of whitespace or reading characters, a sequence of two or more acronyms or abbreviations ― while an unlikely construct ― would otherwise become unidentifiable.

The difference between using CamelCase for classes and camelCase for properties is often useful because a class and a property that links to instances of that class often have the same human-readable name. The following example illustrated this:

prefix album: <{scheme}://{domain}/{context}/id/album/>
prefix def: <{scheme}://{domain}/{context}/def/>
prefix person: <{scheme}://{domain}/{context}/id/person/>

album:petSounds a def:Album.
album:petSounds def:hasCreator person:brianWilson.

Notice how the class (def:Album) can be typographically distinguished from the property (def:hasCreator).

4.6.1.2 Equivalent reference components

For node shapes (shp:), the {reference} component is identical to the {reference} component of the corresponding class (def:) or concept (con:) IRI.

For property shapes (shp:), the {reference} component is identical to the {reference} component of the corresponding RDF (con:) or OWL (def:) property IRI.

If a property shape is uniquely connected to one node shape, then the {reference} component of the node shape IRI is also part of the property shape {reference} component.

The following example uses one node shape IRI and one property shape IRI to illustrate this principle:

shp:Person
  a sh:NodeShape;
  sh:closed true;
  sh:ignoredProperties ( rdf:type );
  sh:property
    shp:Person_age,
    #...
    shp:Person_name;
  sh:targetClass foaf:Person.

shp:Person_name
  a sh:PropertyShape;
  #...
  sh:path foaf:name.

4.6.2 Reference components with type id

The reference component for an IRI with type id is the local identifier of a specific thing. Here are some examples, where the local identifier for each thing is "123":

{scheme}://{domain}/id/building/123
{scheme}://{domain}/id/neighborhood/123
{scheme}://{domain}/id/road/123

4.6.2.1 Reference components with type id and subtype graph

IRIs that denote graphs are treated similar to other IRIs that denote specific things:

prefix graph: <{scheme}://{domain}/id/graph/{reference}>

The reference component should be chosen in a way that makes the graph easy identifiable for data management tasks. For a simple dataset, the following values for the reference component are commonly used:

graph:metadata
The one graph that contains dataset metadata.
graph:model
The one graph that contains the folksonomy and/or ontology.
graph:instances
The one graph that contains records describing instance data.

If a dataset is more complex, there may be multiple graphs per category (metadata, model, or instances). In such cases, the reference component should be determined in a dataset-specific way:

{scheme}://{domain}/id/graph/model-core
{scheme}://{domain}/id/graph/model-plus
{scheme}://{domain}/id/graph/instances-2020
{scheme}://{domain}/id/graph/instances-2021
{scheme}://{domain}/id/graph/instances-2022
{scheme}://{domain}/id/graph/instances-2023

4.6.2.1.1 Linkset graphs

A special kind of graph is a linkset. This is a graph that only contains links from a source dataset onto a target dataset. Linkset graphs should use the following naming scheme for their {reference} component: graph:{source}-2-{target}

For example, a linkset graph that relates persons in one dataset to Wikidata instances, may look as follows: graph:person-2-wikidata

4.7 The path component

The path component is the concatenation of the context, the type, the subtype, and the reference components.

This section documents properties of the IRI strategy that are generic to all these components.

  • Superfluous percent-encoding should not be applied.
  • The path component should not be completely empty (use / instead).
  • The dot (/./) and double dot (/../) path segments should not be used.
  • When percent-encoding is used, hexadecimal letters should be in uppercase.

5. Automatic validation

Because the IRI strategy is so important, we want to guarantee that every IRI that is created follows the strategy. This avoids expensive mistakes: when non-conforming IRIs are published and used in applications, supporting or deprecating them will result in technical debt.

The Triply IRI Strategy Template guarantees that all IRIs can be validated with the following shape:

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix sdo: <https://schema.org/>
prefix sh: <http://www.w3.org/ns/shacl#>
prefix shp: <{scheme}://{domain}/model/shp/>

shp:node_Person
  a sh:NodeShape;
  sh:property shp:node_Person_iriStrategy;
  sh:targetNode sdo:Person.
shp:node_Person_iriStrategy
  a sh:PropertyShape;
  sh:path [ sh:inversePath rdf:type ];
  sh:pattern '^{scheme}://{domain}/id/person/\\S*'.

6. Alignment with other initiatives

The Triply IRI strategy implements the requirements of the URI strategy for the Dutch government (https://www.pldn.nl/wiki/Boek/URI-strategie).

Appendix A. Typical prefix declarations

This section contains examples of an imaginary IRI strategy with the following component choices:

  • scheme: https
  • domain: triplydb.com
  • context: Triply/example
prefix bnode: <https://triplydb.com/.well-known/genid/>
prefix col: <https://triplydb.com/Triply/example/model/col/>
prefix con: <https://triplydb.com/Triply/example/model/con/>
prefix def: <https://triplydb.com/Triply/example/model/def/>
prefix graph: <https://triplydb.com/Triply/example/graph/>
prefix shp: <https://triplydb.com/Triply/example/model/shp/>

prefix cat: <https://triplydb.com/Triply/example/id/cat/>
prefix mat: <https://triplydb.com/Triply/example/id/mat/>

Appendix B: Never use a hash (#)

Some IRI strategies use hash (#) instead of slash (/) before the last part (local name) of the IRI. Triply thinks that this approach is wrong and that a slash should be used in all cases and that a hash should be used in no single case. This appendix explains why.

What does a hash in an IRI mean?

The IRI standard allows a fragment component to be used. The start of a fragment component can be identified syntactically by a hash character (#). For HTTP(S) IRIs the fragment component is processed in a special way. Let us take as an example IRI https://example.com/model/def#someProperty as requested by some user in a popular client (e.g. a web browser):

  1. Before doing anything else, the client removes the fragment component (#somePorperty) from the IRI and sends the resulting truncated IRI (https://example.com/model/def) to the server.
  2. The server only receives the truncated IRI and never sees the fragment component. It is fundamentally unable to determine that the user is interested in 'someProperty'. (This is not a property of modern clients / web browsers but is behavior that is mandated by the HTTP(S) standard.)
  3. The server does the only thing it can do: it returns the representation associated with the truncated IRI. Before there were triple stores, the server was a plain file server most of the time. In those historic instances, the server would returns the file denoted by https://example.com/model/def. This file would typically be a text file in a standardized RDF format that encoded the full dataset, including all hash-containing IRIs that occur in that dataset, including IRI https://example.com/model/def#someProperty as well.
  4. Once the client retrieves the full dataset from the server, it must present the full dataset to the user. This is independent of the size of the full dataset.
  5. After the client has presented the full dataset to the user, the client is recommended to focus on that part of the full dataset that contains the content that corresponds with the fragment component. In modern web browsers, this is typically only implemented for content encoded in the HTML serialization format (in such cases, web browsers typically scroll to the part of the HTML document where the fragment component occurs). In modern web browsers and in the vast majority of other client, absolutely nothing is done in case the content is not encoded in the HTML serialization format. Specifically, if the full dataset is an RDF text file, the top of the RDF text file is shown to the user.

What are the downsides of using a hash in an IRI?

The use of a hash in IRIs introduces a large set of assumptions. This set of assumptions is inconsistent with the vast majority of linked data applications today. These are the introduced assumptions:

  1. Using a hash in IRIs assumes that the IRI without the hash serves the full content of the dataset, containing full descriptions of all hash-containing IRIs with that prefix. This assumption is often true for RDF stored in a single file on a plain file server. This assumption is almost always false for any other data publication system, including all sophisticated data publishing systems that use a column store, relational store, triple store, or document store. This means that an IRI strategy that uses hashes significantly narrows down the possible implementations of a linked data system that publishes data according to that IRI strategy.
  2. Using a hash in IRIs assumes that the server does not return results in pages (pagination). A server that returns paginated results will often not return the full content of a dataset in the first request. This immediately violates the implementation of the hash component as standardized by HTTP(S), which requires that the full dataset is returned in the first successful reply.
  3. Using a hash in IRIs assumes that the dataset in which all hash-containing IRIs occur is small enough to be downloaded from a server all at once and fully searched though by an unadvanced client. Specifically, the client cannot be assumed to use advanced forms of indexing to ensure speedy retrieval of the fragment. This means that retrieval may be a little bit faster for small vocabularies that are cached well by the client and whose terms are often looked up by the user, and that retrieval is slower or impossible in all other scenarios.
  4. Using a hash in IRIs assumes that the small dataset in which all hash-containing IRIs occur will remain small forever. If new hash-containing IRIs are regularly added to the dataset, retrieval times will increase over time.

Why not mix the use of hashes and slashes?

Some linked data experts claim that even though hashes should not be used for all IRIs in a dataset, hashes should still be used for some IRIs in a dataset. Specifically, if a dataset contains a relatively small number of properties and classes, then these should use hashes in their IRIs, and all other IRIs should use slashes. This allows the only -- although truly minor, and maybe not even measurable -- benefit of hashes in IRIs (point 3 above) to be used without immediately introducing scalability, standardization, and implementation issues. Triply is also opposed to this approach, for the following reasons:

  1. Documentation about the collection of datasets must include an explanation when a hash is used and when a slash is used. Issuers of new IRIs must be aware of this documentation and must not accidentally create slash-containing IRIs for a namespace that was previously used with hash-containing IRIs, or create hash-containing IRIs for a namespace that was previously used with slash-containing IRIs.
  2. It is common practice to match, extract, replace, or otherwise process the part of an IRI that occurs after the last slash or hash ('local name'). Any code that does this for slashes and hashes will be more complex than the equivalent code that only does this for slashes.
  3. An infrastructure for publishing data for some hash-containing IRIs and some slash-containing IRIs is inherently more complex and limiting than an infrastructure that publishes all and only slash-containing IRIs. The parts of a dataset that use hash-containing IRIs must be served as full files. The parts of a dataset that use slash-containing IRIs often cannot be served as full files. This almost inevitably results in a split of the infrastructure implementation into two largely disconnected technical components. The design, development, maintenance, and documentation of these two technological components will increase the cost of publishing linked data enormously.

Are there any benefits to using hashes in IRI?

Triply is not aware of any benefits of using hashes in IRI strategies.