OSGeodata Discovery

Aus GISpunkt HSR
Wechseln zu: Navigation, Suche

See also: Geospatial Search (geodata and geo-webservices search) / Raumbezogene Suche (Geodaten- und Geodienste-Suche)

Need for geographic metadata exchange

Geoinformation content needs to be published and get disseminated somehow before it is being found by users.

Following are some thoughts to help discovery/search of georesource like datasets and WxS.

Geographic catalog (catalog service) or inventory are rather data provider centric names, so we prefer a user centric approach for the management of geodata and geometadata. See also OSGeodata for further discussion.

Before all we need a geographic metadata exchange model. Then we need a geographic metadata exchange protocol.

Keywords: Open access to and dissemination of geographic data (geodata) and information; Metadata; Finding, harvesting or discovery of geodata and web map services; Interoperability; Integration; Service binding; Spatial data infrastructure; Standards.

Vision and Scenario

"We don’t know what people will want to do, or how they will want to do it, but we may know how to allow people to do what they want to do when they want to do it". (ockham.org)

User needs

  • End users need search services to discover geographic information.
  • Geodata owners need a metadata management tool (which implies an internal metadata model).
  • Geodata owners and service providers need (i) a metadata exchange model, (ii) an encoding of it, as well as (iii) a protocol for the exchange, dissemination and sharing of metadata.

Geo-metadata network vision

In the web and internet there are...

  • Web accessible datasets.
  • Web accessible metadata about datasets and/or WxS (WxS imply pointers to data access points).
  • Web accessible datasets and metadata about datasets and/or WxS.
  • A lot of other documents, like WCS, etc.

Given these georesources, in the web and internet we need a...

  • Desktop and Web GIS: Access local datasets, remote datasets and remote WxS through direct launch or "(Remote) Resource open...". Requires online query to search services.
  • Meta-searchengine: List of georesources, especially ranked metadata records. Quality measures obvious to boost ranking. Requires web crawling.
  • Metadata directory/registry: A (sorted) list of public accessible MD catalogs, including their records. Allows consolidated searching across catalogs/repositories. Harvesters update their indexes. Quality measures possible. Requires registration of catalogs/repositories.
  • Metadata catalog (repository) = A set of each *owns* metadata about datasets and/or WxS. Typically metadata catalogs include a metadata editor; they are tightly coupled with georesources (aka registration of georesources).

Protocols and information models

Following protocols in the context of geo-metadata are under discussion:

  1. Between GIS clients as requesters and georesource owners or catalogs, registries or search engines an online query protocol is needed. Here we have WFS basic or SRU (= ISOxxx = lightweight Z39.50). Both protocols include a bbox filter. As information model for metadata a simple one would do it but probably an extension mechanismus could built in (as opposed to DC including DClite4G which has no clear extension regulations).
  2. Between catalogs, directories and search engines as requesters and similar components, a harvesting protocol is needed to move full sets of metadata around. Here OAI-PMH is there. A kind of WFS basic (without filter but with incremental update!) could be a viable alternative. A fixed minimal information model like DClite4G could be a solution. A bbox query is not really a necessessity here as harvesting is done in a pre-processing step but could be an option.
  3. Between GIS clients as requesters and georesource owners moving large datasets around is required. http/ftp and WFS basic (perhaps with GeoRSS simple *or* GML), WFS (original) or file sharing protocols like Bittorrent could do. They imply sometimes an own registry but lack bbox queries yet.
  4. Between webcrawlers (from search engines) and any georesource there is http/https with discovery/'boosting' mechanisms.

Issues:

  • How does a user search for remote georesources?
  • Is the discovery component (as opposite to an directory) part of a desktop GIS?
  • Where resides the searchable geo-metadata?
  • Repository managers should be able to disaggregate and re-aggregate metadata records (e.g. about the same dataset) as required.

A preliminary scenario

  • Users search or browse through metadata records. They use a web app. or a search component out of a desktop GIS (remark: Users don't search services per se).
  • 'Search service providers' enable the discovery of geodata and 'filter services', like transformation services (Note: WMS is a 'data access service' and belongs to geodata not to filter services)
  • 'Search service providers' gather (harvest) their information from 'data/filter service providers' and need a protocol to do this.
  • 'Data/filter service providers' offer metadata over this protocol. They typically also implement 'data access services' (WMS, WFS) or they offer 'filter services'.


Architecture: Geodata, metadata and protocols.
Diagram: Architectural sketch about the relationships between a 'metadata exchange model' (as part of a metadata exchange protocol, data provider and search tools) and a 'metadata management model' (as part of metadata management and GIS tools).

Possible realization plan

  1. All: Let's first recognize the uses cases and sort then out (minimal) metadata for exchange and (exhaustive) metadata for own management.
  2. Service providers and data owners (= all): Let's decide on one quickly first about the metadata exchange model together with an incremental specification process for future adaptations. This model probably borroughs from Dublin Core with spatial and geo-webservice extensions.
  3. Data owners (and providers) only: Let's decide internal in an organization about the metadata managemet model while adherring to minimal metadata exchange protocols. This model probably borroughs from ISO 19115/119 Core, FGDC and of course the minimal metadata exchange model (Dublin Core).

Discovery of geodata and services

We are aware of two types of 'protocols' both already implemented and ready to go (given an encoded information according to a well specified model):

  1. Registry
  2. Autodiscovery

Registry

Action is required from geodata owners (either procative or without indication) to register at service providers (webcrawlers, harvesters). Search services can do better focussed crawling. Examples are OAI-PMH, Google Sitemaps or SOAP.

  • Push principle.
  • More load on resource owner, less load in indexing and filtering on service providers.
  • If registered proactively agreement can be ensured from resource owners.
  • Examples: OAI-PMH, RSS pinging search services...

Autodiscovery

Geodata owners publish XML files of geometadata (exchange model) once on the web (=HTTP).

  • Pull principle.
  • Requires at least one relation - an URL or a tag like a class.
  • Uses UI to point to content-rich XML-encoded feed, therefore crosses from human mode to machine mode, in a way that’s easy for both humans and machines.
  • Autonomous/independent resource owner; no further action required from him.
  • Possible use of links/relations to point to (master) whole collection of resources (including pointers to these, like microformats.org.
  • Examples: (X)HTML pages, Extended and XML encoded Dublin Core.

There are some ideas around like embedding hints into the geometadata exchange model for augmented autodiscovery. See also THUMP from IETF which leads to something between 1. and 2.

About search services

Search services include, webbased databases, catalogs and search engines.

A search engine (for geodata discovery) has either a general purpose with data from a brute force search (horizontal, breath first; size is key) or it is based a structured and qualitatively good topical data (depth first, small is beautiful). The former is a market which is lead by Google (so its probably not our business) and does not exclude specialized businesses. The latter is originally a specialized market but does not exlude that it merges into large ones (tourism, etc.). It takes its input e.g. from focused crawling (c.f. Chakrabarti) and/or is built upon partnerships and contains authoritative and reliable sources. Both try to provide access to resources from a single entry point. Google cannot do this because it caters for broad and mainstream user base and covers just about anything on the internet. It is mostly unstructured and not vetted (quality proofed).

Why not just letting crawlers find XML data?

...Or, why a new protocol at all)? Commercial web crawlers are estimated to index only approximately 16% of the total "surface web" and the size of the "deep web" [...] is estimated to be up to 550 times as large as the surface web. These problems are due in part to the extremely large scale of the web. To increase efficiency, a number of techniques have been proposed such as more accurately estimating web page creation and updates and more efficient crawling strategies. [...] All of these approaches stem from the fact that http does not provide semantics to allow web servers to answer questions of the form "what resources do you have?" and "what resources have changed since 2004-12-27?" (from: (Nelson and Van de Sompel et al. 2005)).

How to help discovering resources

Both approaches, registry and autodiscovery, need awareness. So in both cases means to guide webcrawlers to xml-encoded content or services are helpful. What about:

  • Icons on top level home pages with links pointing to metadata?
  • A 'friend' attribute in a (yet to be re-defined ISO 19115) metadata model as described in some OAI guidelines? Note that tihs attribute is not at the level of an metadata instance/record but at the level of sets (of metadata records).

Distributed Searching vs. Harvesting

Distributed (cross) searching vs. harvesting. There are two possible approaches: (from: Implementing the OAI-PMH - an introduction).

  • distributed, cross, parallel searching multiple archives based on protocol, like Z39.50
  • harvesting metadata from federated meta-databases into one or more 'central' services – bulk move data to the user-interface
  • US digital library experience in this area (e.g. NCSTRL) indicated that cross searching not preferred approach - distributed searching of N nodes is viable, but only for small values of N (< 100). Gives advantages if you don't want to give away your metadata

There are following alternatives to harvesting (remarks: all considered either inappropriate or not established or at least 'painfull' compared to OAI-PMH):

  • Distributed searching:
    • WFS - the OGC thing - being designed for geographic vector data exchange services
    • Z39.50 and SRW - the bibliographic data exchange protocol things (SRW is the more leightweight and 'modern' successor of Z39.50)
  • 'Find and bind':
    • UDDI - the SOAP thing

Problems with distributed searching: (from: Implementing the OAI-PMH - an introduction)

  • Collection description: How do you know which targets to search?
  • Query-language problem: Syntax varies and drifts over time between the various nodes.
  • Rank-merging problem: How do you meaningfully merge multiple result sets?
  • Performance: tends to be limited by slowest target; difficult to build browse interface

Weblinks

Geographic or geospatial information discovery and search services: