Matching-Norway-data

Matching data from RailML Norway to the ERA knowledge graph.

Goal is to map the linked data from the RailML Norway dataset to the ERA knowledge graph. To do this you can use 4 different methods, going down from accuracy.

  • Adminstrative linking, using an administrative code or number that can link two entities from different datasets. This is the strongest link.
  • String matching, using strings, such as labels to match. This can be effictive as there is less room for small differences as the two below, but this method is sensitive for typo's and different spelling of names.
  • Geo matching, using geocoordinates to match, this can work, but here different geosystems and different accuracy, could mismatch objects.
  • Circumstantial evidence, this method can be effective, but is most of the time, time consuming and hard to maintain. This method is also sensitive for mistakes, and there is no default way of linking base on circumstantial evidence.

At the moment the dataset from Norway does not have an administrative link. The designators are not available in the RailML Norway dataset. Thus we can skip that matching strategy.

So instead we will try three different matching strategies.

String matching

First up is string matching. We are matching operational points by matching the strings of both operational points together. For example, if a station has the name 'Berkåk St' in the RailML knowledge graph we would be trying to match a string with the identical name 'Berkåk St' in the ERA knowledge graph. It can be that exact string matching does not return any results. This is due that the some words are shortened in the RailML knowledge graph and expanded in the ERA knowledge graph. Here it happens to be that the RailML knowledge graph writes up station names like: 'Berkåk St'. The ERA knowledge graph writes up station names like: 'Berkåk stasjon'. To solve this we add in preprocessing to replace 'stasjon' and try to match based on the processed string.

The table below shows the matching score.

We also visualize the matched data on the map. The map is showing the red dots as the RailML operational points, the green dots are the operational points found in the ERA knowledge graph. The blue lines show the two points that are linked together. The idea is that matches on the map should be really close together. If not, we can quickly see which points are not close together by following the blue lines.

First 10 results of string matching

Geo matching

Second method of matching is geo matching. This match is done by collecting geocoordinates and matching on the distance, each element of the first knowledge graph is matched in distance to the second knowledge graph. Then there are two solutions. Either take the closest match and suggest that as a match. The second solution is to match and use the distance as a gradient, only match objects that are relatively close together. We have decided to use the second method in the visuals.

The table below shows the matching score.

We also visualize the matched data on the map. The map is showing the red dots as the RailML operational points, the green dots are the operational points found in the ERA knowledge graph. The blue lines show the two points that are linked together. The idea is that matches on the map should be really close together. If not, we can quickly see which points are not close together by following the blue lines.

First 10 results of geo matching

Circumstantial evidence linking

A third method is Circumstantial evidence matching, we now try to create a match based on circumstantial evidence, this can be all kinds of ways. But for now we select the kilometer signage as circumstantial evidence. This mapping does create problems, firstly it is a costly mapping, mainantability is higher then the others, as this mapping is created by using custom vocabulary. Secondly the mapping does not make use of generic mapping priniciples and has to be constructed by hand.

The table below shows the matching score.

We also visualize the matched data on the map. The map is showing the red dots as the RailML operational points, the green dots are the operational points found in the ERA knowledge graph. The blue lines show the two points that are linked together. The idea is that matches on the map should be really close together. If not, we can quickly see which points are not close together by following the blue lines. Notice that the blue line here is quite lengthy for one of the possible matches, thus making it not as great of a match type as we thought.

First 10 results of circumstantial evidence matching

Combining the matching methods

In some cases matching data can be done better by combining methods and use a weighted average to comfirm or dismiss a match. To examine if this is the case here as well we will combine the three methods we have described above and use this to match objects.

We use a combined matching, where we combine each of the matching systems, geospatial, string matching and circumstantial evidence and give a score of 0 or 1 if the partial match is succesfull of not. We then combine the scores and only objects that have a score of 2 or more are then seen as matches. Where a score of 3 is exact match and a score of 2 is a close match.

Combined matching strategies results