OSGeodata Discovery: Unterschied zwischen den Versionen

Aus Geoinformation HSR
Wechseln zu: Navigation, Suche
(Possible realization plan)
(Need for geographic metadata exchange)
Zeile 4: Zeile 4:
  
 
Before all we need a '''[[OSGeodata metadata exchange model| geographic metadata exchange model]]'''. Then we need a '''[[OSGeodata metadata exchange protocol| geographic metadata exchange protocol]]'''.  
 
Before all we need a '''[[OSGeodata metadata exchange model| geographic metadata exchange model]]'''. Then we need a '''[[OSGeodata metadata exchange protocol| geographic metadata exchange protocol]]'''.  
 
Thoughts: Links are for three things, (1) discovery (what pages exists), (2) reputation (how important is this page) and (3) annotation (what is this page about).
 
  
 
Keywords: ''Open access to and dissemination of geographic data (geodata) and information; Metadata; Finding, harvesting or discovery of geodata and web map services; Interoperability; Integration; Service binding; Spatial data infrastructure; Standards''.
 
Keywords: ''Open access to and dissemination of geographic data (geodata) and information; Metadata; Finding, harvesting or discovery of geodata and web map services; Interoperability; Integration; Service binding; Spatial data infrastructure; Standards''.

Version vom 11. September 2006, 07:06 Uhr

Need for geographic metadata exchange

Geographic catalog (catalog service) or inventory are rather data provider centric names, so we prefer a user centric approach for the management of geodata and geometadata. See also OSGeodata for further discussion.

Before all we need a geographic metadata exchange model. Then we need a geographic metadata exchange protocol.

Keywords: Open access to and dissemination of geographic data (geodata) and information; Metadata; Finding, harvesting or discovery of geodata and web map services; Interoperability; Integration; Service binding; Spatial data infrastructure; Standards.

Vision and Scenario

"We don’t know what people will want to do, or how they will want to do it, but we may know how to allow people to do what they want to do when they want to do it". (ockham.org)

Vision:

  • Users need search services to discover geographic information.
  • Geodata owners need a metadata management tool (which implies an internal metadata model).
  • Geodata owners and service providers need (i) a metadata exchange model, (ii) an encoding of it, as well as (iii) a protocol for the exchange, dissemination and sharing of metadata.

A preliminary scenario:

  • Users search or browse through metadata records. They use a web app. or a search component out of a desktop GIS (remark: Users don't search services per se).
  • 'Search service providers' enable the discovery of geodata and 'filter services', like transformation services (Note: WMS is a 'data access service' and belongs to geodata not to filter services)
  • 'Search service providers' gather (harvest) their information from 'data/filter service providers' and need a protocol to do this.
  • 'Data/filter service providers' offer metadata over this protocol. They typically also implement 'data access services' (WMS, WFS) or they offer 'filter services'.

Architecture Diagram:

                     ...... Intranet/Web (User) & Anonymous web (User)........

                                       
  SS) Search Service-   [geometa.info] | [NGDI portals] [Google and others]
  Provider:                            |

                     ====== Metadata exchange protocol =======================

  CS) Catalog Services/                |
  Meta-Data Providers (DP):  [GMDB]    | [e.g. geocat.ch]  [other DPs, e.g. cantons]
                                       |

                     ====== Full metadata access (internal, possibly lossless) =====

  FVS) Filter/Value-adding
  Services:                 Coord.Transformation, Visualization, etc.
  D)   Data Access-Provider 
  Services:                 PostGIS, ArcGIS, GeoShop, WMS, WFS(!)
Diagram: Architectural sketch about the relationships between a 'metadata exchange model' (as part of a metadata exchange protocol, data provider and search tools) and a 'metadata management model' (as part of metadata management and GIS tools).

Possible realization plan

  1. All: Let's first recognize the uses cases and sort then out (minimal) metadata for exchange and (exhaustive) metadata for own management.
  2. Service providers and data owners (= all): Let's decide on one quickly first about the metadata exchange model together with an incremental specification process for future adaptations. This model probably borroughs from Dublin Core with spatial and geo-webservice extensions.
  3. Data owners (and providers) only: Let's decide internal in an organization about the metadata managemet model while adherring to minimal metadata exchange protocols. This model probably borroughs from ISO 19115/119 Core, FGDC and of course the minimal metadata exchange model (Dublin Core).

Discovery of geodata and services

We are aware of two types of 'protocols' both already implemented and ready to go (given an encoded information according to a well specified model):

  1. Registry
  2. Autodiscovery

Registry

Action is required from geodata owners (either procative or without indication) to register at service providers (webcrawlers, harvesters). Search services can do better focussed crawling. Examples are OAI-PMH, Google Sitemaps or SOAP.

  • Push principle.
  • More load on resource owner, less load in indexing and filtering on service providers.
  • If registered proactively agreement can be ensured from resource owners.
  • Examples: OAI-PMH, RSS pinging search services...

Autodiscovery

Geodata owners publish XML files of geometadata (exchange model) once on the web (=HTTP).

  • Pull principle.
  • Requires at least one relation - an URL or a tag like a class.
  • Uses UI to point to content-rich XML-encoded feed, therefore crosses from human mode to machine mode, in a way that’s easy for both humans and machines.
  • Autonomous/independent resource owner; no further action required from him.
  • Possible use of links/relations to point to (master) whole collection of resources (including pointers to these, like microformats.org.
  • Examples: (X)HTML pages, Extended and XML encoded Dublin Core.

There are some ideas around like embedding hints into the geometadata exchange model for augmented autodiscovery. See also THUMP from IETF which leads to something between 1. and 2.

About search services

Search services include, webbased databases, catalogs and search engines.

A search engine (for geodata discovery) has either a general purpose with data from a brute force search (horizontal, breath first; size is key) or it is based a structured and qualitatively good topical data (depth first, small is beautiful). The former is a market which is lead by Google (so its probably not our business) and does not exclude specialized businesses. The latter is originally a specialized market but does not exlude that it merges into large ones (tourism, etc.). It takes its input e.g. from focused crawling (c.f. Chakrabarti) and/or is built upon partnerships and contains authoritative and reliable sources. Both try to provide access to resources from a single entry point. Google cannot do this because it caters for broad and mainstream user base and covers just about anything on the internet. It is mostly unstructured and not vetted (quality proofed).

Why not just letting crawlers find XML data?

...Or, why a new protocol at all)? Commercial web crawlers are estimated to index only approximately 16% of the total "surface web" and the size of the "deep web" [...] is estimated to be up to 550 times as large as the surface web. These problems are due in part to the extremely large scale of the web. To increase efficiency, a number of techniques have been proposed such as more accurately estimating web page creation and updates and more efficient crawling strategies. [...] All of these approaches stem from the fact that http does not provide semantics to allow web servers to answer questions of the form "what resources do you have?" and "what resources have changed since 2004-12-27?" (from: (Nelson and Van de Sompel et al. 2005)).

How to help discovering resources

Both approaches, registry and autodiscovery, need awareness. So in both cases means to guide webcrawlers to xml-encoded content or services are helpful. What about:

  • Icons on top level home pages with links pointing to metadata?
  • A 'friend' attribute in a (yet to be re-defined ISO 19115) metadata model as described in some OAI guidelines? Note that tihs attribute is not at the level of an metadata instance/record but at the level of sets (of metadata records).

Distributed Searching vs. Harvesting

Distributed (cross) searching vs. harvesting. There are two possible approaches: (from: Implementing the OAI-PMH - an introduction).

  • distributed, cross, parallel searching multiple archives based on protocol, like Z39.50
  • harvesting metadata from federated meta-databases into one or more 'central' services – bulk move data to the user-interface
  • US digital library experience in this area (e.g. NCSTRL) indicated that cross searching not preferred approach - distributed searching of N nodes is viable, but only for small values of N (< 100). Gives advantages if you don't want to give away your metadata

There are following alternatives to harvesting (remarks: all considered either inappropriate or not established or at least 'painfull' compared to OAI-PMH):

  • Distributed searching:
    • WFS - the OGC thing - being designed for geographic vector data exchange services
    • Z39.50 and SRW - the bibliographic data exchange protocol things (SRW is the more leightweight and 'modern' successor of Z39.50)
  • 'Find and bind':
    • UDDI - the SOAP thing

Problems with distributed searching: (from: Implementing the OAI-PMH - an introduction)

  • Collection description: How do you know which targets to search?
  • Query-language problem: Syntax varies and drifts over time between the various nodes.
  • Rank-merging problem: How do you meaningfully merge multiple result sets?
  • Performance: tends to be limited by slowest target; difficult to build browse interface

Weblinks

Geographic information search services: