OSGeodata Discovery: Unterschied zwischen den Versionen

Aus Geoinformation HSR
Wechseln zu: Navigation, Suche
(Discovery of geodata and services)
(Discovery of geodata and services)
Zeile 29: Zeile 29:
 
# '''Autodiscovery'''
 
# '''Autodiscovery'''
  
=== About '''registry''' ===
+
=== Registry ===
 
Action is required from geodata owners (either procative or without indication) to register at service providers (webcrawlers, harvesters). Search services can do better focussed crawling. Examples are [[OAI-PMH]], Google Sitemaps or SOAP.
 
Action is required from geodata owners (either procative or without indication) to register at service providers (webcrawlers, harvesters). Search services can do better focussed crawling. Examples are [[OAI-PMH]], Google Sitemaps or SOAP.
 
* Push principle.
 
* Push principle.
Zeile 36: Zeile 36:
 
* Examples: [[OAI-PMH]], RSS pinging search services...
 
* Examples: [[OAI-PMH]], RSS pinging search services...
  
=== About '''autodiscovery''' ===
+
=== Autodiscovery ===
 
Geodata owners publish XML files of geometadata (exchange model) once on the web (=HTTP).  
 
Geodata owners publish XML files of geometadata (exchange model) once on the web (=HTTP).  
 
* Pull principle.
 
* Pull principle.

Version vom 5. September 2006, 13:10 Uhr

Need for geographic metadata exchange

Geographic catalog (catalog service) or inventory are rather data provider centric names, so we prefer a user centric approach for the management of geodata and geometadata. See also OSGeodata for further discussion.

Before all we need a geographic metadata exchange model. Then we need a geographic metadata exchange protocol.

Thoughts: Links are for three things, (1) discovery (what pages exists), (2) reputation (how important is this page) and (3) annotation (what is this page about).

Keywords: Open access to and dissemination of geographic data (geodata) and information; Metadata; Finding, harvesting or discovery of geodata and web map services; Interoperability; Integration; Service binding; Spatial data infrastructure; Standards.

Vision and Scenario

"We don’t know what people will want to do, or how they will want to do it, but we may know how to allow people to do what they want to do when they want to do it". (ockham.org)

  • Users need search services to discover geographic information.
  • Geodata owners need a metadata management tool (which implies an internal metadata model).
  • Geodata owners and service providers need (i) a metadata exchange model, (ii) an encoding of it, as well as (iii) a protocol for the exchange, dissemination and sharing of metadata.

A preliminary scenario:

  • Users search or browse through metadata records. They use a web app. or a search component out of a desktop GIS (remark: Users don't search services per se).
  • 'Search service providers' enable the discovery of geodata and 'filter services', like transformation services (Note: WMS is a 'data access service' and belongs to geodata not to filter services)
  • 'Search service providers' gather (harvest) their information from 'data/filter service providers' and need a protocol to do this.
  • 'Data/filter service providers' offer metadata over this protocol. They typically also implement 'data access services' (WMS, WFS) or they offer 'filter services'.

Discovery of geodata and services

We are aware of two types of 'protocols' both already implemented and ready to go (given an encoded information according to a well specified model):

  1. Registry
  2. Autodiscovery

Registry

Action is required from geodata owners (either procative or without indication) to register at service providers (webcrawlers, harvesters). Search services can do better focussed crawling. Examples are OAI-PMH, Google Sitemaps or SOAP.

  • Push principle.
  • More load on resource owner, less load in indexing and filtering on service providers.
  • If registered proactively agreement can be ensured from resource owners.
  • Examples: OAI-PMH, RSS pinging search services...

Autodiscovery

Geodata owners publish XML files of geometadata (exchange model) once on the web (=HTTP).

  • Pull principle.
  • Requires at least one relation - an URL or a tag like a class.
  • Uses UI to point to content-rich XML-encoded feed, therefore crosses from human mode to machine mode, in a way that’s easy for both humans and machines.
  • Autonomous/independent resource owner; no further action required from him.
  • Possible use of links/relations to point to (master) whole collection of resources (including pointers to these, like microformats.org.
  • Examples: (X)HTML pages, Extended and XML encoded Dublin Core.

There are some ideas around like embedding hints into the geometadata exchange model for augmented autodiscovery. See also THUMP from IETF which leads to something between 1. and 2.

Why not just letting crawlers find XML data (or, why a new protocol at all)?

Commercial web crawlers are estimated to index only approximately 16% of the total "surface web" and the size of the "deep web" [...] is estimated to be up to 550 times as large as the surface web. These problems are due in part to the extremely large scale of the web. To increase efficiency, a number of techniques have been proposed such as more accurately estimating web page creation and updates and more efficient crawling strategies. [...] All of these approaches stem from the fact that http does not provide semantics to allow web servers to answer questions of the form "what resources do you have?" and "what resources have changed since 2004-12-27?" (from: (Nelson and Van de Sompel et al. 2005)).

Howto help discovering resources

Both approaches, registry and autodiscovery, need awareness. So in both cases means to guide webcrawlers to xml-encoded content or services are helpful. What about:

  • Icons on top level home pages with links pointing to metadata?
  • A 'friend' attribute in a (yet to be re-defined ISO 19115) metadata model as described in some OAI guidelines? Note that tihs attribute is not at the level of an metadata instance/record but at the level of sets (of metadata records).

Distributed Searching vs. Harvesting

There are following alternatives to harvesting (remarks: all considered either inappropriate or not established or at least 'painfull' compared to OAI-PMH):

  • Query models:
    • WFS - the OGC thing - being designed for geographic vector data exchange services
    • Z39.50 and SRW - the bibliographic data exchange protocol things (SRW is the more leightweight and 'modern' successor of Z39.50)
  • 'Find and bind':
    • UDDI - the SOAP thing

Why a central index?

  • Cross searching vs. harvesting. There are two possible approaches: (from: Implementing the OAI-PMH - an introduction).
    • cross, parallel searching multiple archives based on protocol like Z39.50
    • harvesting metadata from federated meta-databases into one or more 'central' services – bulk move data to the user-interface
    • US digital library experience in this area (e.g. NCSTRL) indicated that cross searching not preferred approach - distributed searching of N nodes viable, but only for small values of N (< 100)
  • Problems with distributed searching: (from: Implementing the OAI-PMH - an introduction)
    • Collection description: How do you know which targets to search?
    • Query-language problem: Syntax varies and drifts over time between the various nodes.
    • Rank-merging problem: How do you meaningfully merge multiple result sets?
    • Performance: tends to be limited by slowest target; difficult to build browse interface

Weblinks

Geographic information search services: