Standards for linking geospatial data

Inhoud

Standards for linking geospatial data

Thijs Brentjens and Linda van den Brink, Geonovum February 24, 2014


Summary

In our view four things are needed in order to integrate geo-information successfully with the Semantic Web.

  1. A compact vocabulary, which could be based on GeoJSON. It has a basic data model and supports a decent set of geometry types. This is probably sufficient for most use cases, except very specialized ones. There should be discussion on whether more is needed, possibly from GeoSPARQL, NeoGeo, and Location Core Vocabulary.
  2. An encoding, which could be GeoJSON for most use cases. Again, it should be discussed if more (e.g. OGC GML, WKT) is needed for advanced use cases.
  3. Topological / spatial operations. GeoSPARQL defines these, but offers a choice between three mathematical definitions of topology. This is overly complex for most cases and a choice for one of these should be made.
  4. Discussion on whether support for different coordinate reference systems is needed. This would hinder interoperability (datasets using different CRSs cannot be easily combined) but is perhaps needed for use cases where a high precision of coordinates is important.
  5. Optionally a fifth topic should also be addressed: Is there a need for compression and/or performance optimization of geometry? In other words, do large sets of coordinates cause problems for geo data on the web and how can this be addressed?

These five things are described in more detail below.

Vocabulary

For the sake of interoperability there should be one standard vocabulary for linked geo data, published by W3C. W3C Basic Geo is too limited, supporting only point geometries. For leveraging all the geo data that is out there, more than just point geometry is needed. There are already several vocabularies around that offer this. Some of the well-known ones are the vocabulary from GeoSPARQL, OGC GML , OGC KML , NeoGeo, Location Core Vocabulary and GeoJSON . They have a lot of similarities. GeoSPARQL, NeoGeo, GML, KML and GeoJSON don’t only describe geometry types. They also share a class of objects called Feature which has the same or a similar meaning. Also, they all have a property called ‘geometry’. Location Core Vocabulary is a bit broader. Where Features are generally objects that have a geometry as property, Location Core defines the class Location (actually it reuses Dublin Core Location) which is a spatial region or a named place. Properties are defined not only for geometry but also for place names, place identifiers and for addresses.

In our opinion GeoJSON is a good starting point. It has a basic data model and supports a decent set of geometry types: "Point", "MultiPoint", "LineString", "MultiLineString", "Polygon", "MultiPolygon", and "GeometryCollection".

We estimate this is sufficient for all but very advanced use cases. For example, in the Dutch national dataset (which is now being created) for large scale topography, arcs are used. This geometry type is not supported in GeoJSON. There should be discussion on whether more than what is currently in GeoJSON is needed. Elements from GeoSPARQL (e.g. more geometry types), NeoGeo, and Location Core Vocabulary (e.g. named places, addresses) should be considered. This should be a joint W3C/OGC activity. The ideal outcome in our eyes would be a W3C vocabulary covering most use cases, and possibly an OGC vocabulary covering special, advanced use cases (like GML / GeoSPARQL).

Encoding

Recently, on January 16th, the W3C has published the JSON-LD recommendation. JSON-LD is a JSON encoding for Linked Data. Since in the geospatial domain (Geo)JSON is becoming increasingly popular and Linked Data is also a hot topic (at least in the Netherlands it is), the question also rises if JSON-LD and GeoJSON can be used together.

We tried this in an experimental setup and successfully combined GeoJSON with JSON-LD. GeoJSON-LD could be the web-encoding for linked geospatial data; it could be direct output of (Geo)SPARQL or be used as output of specific APIs.

JSON-LD allows for the use of any vocabulary to encode geometry. Instead of combining it with GeoJSON, it could also be combined with, for example, GeoSPARQL for more specialized GIS use cases. Via GeoSPARQL any GML geometry type can be used. Location Core Vocabulary also takes the approach of offering different possibilities: a geometry may be encoded as a WKT (Well Known Text, see ISO 19125-1) string literal, GML or KML, a GeoSPARQL or Basic Geo geometry class, schema.org RDF, or a geocoded URI. This allows people to select whatever fits them best, but does not help much with gaining interoperability between datasets.

GeoJSON, an extension of JSON for geometry, is in our opinion good enough for a large number of use cases. In its favour, it is a lightweight encoding, and less verbose than XML encodings like GML. Support in existing software and platforms is pretty good . After adding LD @context to GeoJSON in our experiment, we found that GIS applications with no understanding of JSON-LD could still use the data.

In some cases there is a need for embedding geometry as a string of a property. GeoSPARQL uses this approach where a geometry might be encoded as GML geometry or WKT in an RDF triple. Encoding geometry as WKT could also be useful for embedding geometry in HTML pages, using for example RDFa. WKT is preferred here, since GML is an XML encoding, which could easier result in issues with encoding in HTML if not done properly.

Topological / spatial operations

OGC GeoSPARQL, as an extension of W3C SPARQL, defines a vocabulary for asserting and querying topological relations between spatial objects. This is very useful as it allows you to assert / query whether two spatial objects cross each other, one lies within the other, is near another, etc. However the topology clause in GeoSPARQL is parameterized to allow the use of different families of topological relations (Simple Features, RCC8, and Egenhofer; this goes back to different mathematical definitions of what, for example, an intersection is precisely). This seems overly complex and could hinder wide implementation as well as interoperability.

In other standards where these topological relations are used, such as OGC Filter Encoding , there is no such parameterization. We recommend selecting just one from these families of topological relations. It still needs to be determined which one this should be. OGC Filter Encoding (ISO 19143) uses Simple Features (ISO 19125-1), NeoGeo uses RCC8. OGC Filter Encoding is probably the best starting point.

Coordinate reference systems

Coordinate reference systems (CRS) are to geo-information what character encodings are to text. If you don’t know which CRS is used, you can’t use the coordinates. Different CRSs exist for a reason: localized CRSs provide more precise coordinates for a certain part of the globe. It is not possible for a global CRS to be as precise, for example because the continental plates move a few centimetres every year. For large scale data and applications this continental drift could be very relevant over time. Take for example the boundary of cadastral parcels. If this drift is not taken into account, there could be issues if parcel boundaries that were established e.g. 10 years ago are overlaid over recently acquired aerial imagery with high accuracy (e.g. 10 cm). There could be visual differences, while the actual situation did not change.

Discussion is necessary on whether support for different coordinate reference systems, geographic as well as projected ones, is needed in linked geo data standards. The possibility to use different CRSs hinders interoperability (datasets using different CRSs cannot be easily combined, a complex transformation is necessary) but on the other hand this option is perhaps needed for use cases where a high precision of coordinates is important.

Even if this turns out to be necessary, the default should be WGS84 (lat/lon).

In GeoSPARQL it is possible to refer to a CRS, but the reference is part of the geometry literal. If this were a separate property it would be easier to use the CRS as a selection criterion (which is desirable, for example, when displaying data on a map: data which uses different CRS cannot be combined on a map). In the GeoJSON object model a member ‘crs’ is defined. In GML there is a similar property, ‘srsName’, to indicate the coordinate reference system used. These are good examples of how it should be done.

Performance optimization

Geometries, especially lines and polygons, may contain many coordinates. For example, a municipal boundary could easily contain more than 1500 coordinate pairs. Compared to non-geometric properties, this can result in large amounts of data to transfer and process. The coordinates can easily be 95% of all data of an object when using polygons. The question rises whether there is a need for performance optimization and/or compression techniques for large amounts of coordinates. If so, there could also be a need to standardize such a technique, similar to the PNG format for encoding images.

There are several examples of coordinate compression techniques. The Google Maps API defines an algorithm to compress the coordinate values of a polyline to a single string . Also, the human readable Well Known Text (WKT) representation of geometry has a binary counterpart, Well Known Binary (WKB), which is much more compact. Both are defined in ISO 19125-1 and used for storage and exchange of geometries. WKT is referenced in several OGC standards.

References