OAI-PMH: Unterschied zwischen den Versionen
Aus Geoinformation HSR
Stefan (Diskussion | Beiträge) |
Stefan (Diskussion | Beiträge) (→Discussion) |
||
Zeile 21: | Zeile 21: | ||
== Discussion == | == Discussion == | ||
− | + | Why not just letting crawlers find XML data (or, why a new protocol at all)?: | |
− | Commercial web crawlers are estimated to index only approximately 16% | + | |
− | of the total "surface web" and the size of the "deep web" [...] is | + | :Commercial web crawlers are estimated to index only approximately 16% of the total "surface web" and the size of the "deep web" [...] is estimated to be up to 550 times as large as the surface web. These problems are due in part to the extremely large scale of the web. To increase efficiency, a number of techniques have been proposed such as more accurately estimating web page creation and updates and more efficient crawling strategies. [...] All of these approaches stem from the fact that http does not provide semantics to allow web servers to answer questions of the form "what resources do you have?" and "what resources have changed since 2004-12-27?" (from: [http://arxiv.org/abs/cs.DL/0503069 (Nelson and Van de Sompel et al. 2005)]). |
− | estimated to be up to 550 times as large as the surface web. These | ||
− | problems are due in part to the extremely large scale of the web. To | ||
− | increase efficiency, a number of techniques have been proposed such as | ||
− | more accurately estimating web page creation and updates and more | ||
− | efficient crawling strategies. [...] All of these approaches stem from | ||
− | the fact that http does not provide semantics to allow web servers to | ||
− | answer questions of the form "what resources do you have?" and "what | ||
− | resources have changed since 2004-12-27?" (from: [http://arxiv.org/abs/cs.DL/0503069 (Nelson and Van de Sompel et al. 2005)]). | ||
Why a harvesting protocol? | Why a harvesting protocol? | ||
Zeile 44: | Zeile 36: | ||
** Rank-merging problem: How do you meaningfully merge multiple result sets? | ** Rank-merging problem: How do you meaningfully merge multiple result sets? | ||
** Performance: tends to be limited by slowest target; difficult to build browse interface | ** Performance: tends to be limited by slowest target; difficult to build browse interface | ||
+ | |||
+ | * Ok, what's left to do? | ||
+ | # We need an (lightweight) information model for geographic medatata. => See [[OSGeodata]] for a discussion. | ||
+ | # We need implementations of OAI-PMH data providers. => See steps 1 to 8 in this [http://indico.cern.ch/getFile.py/access?resId=1&materialId=0&contribId=s2t3&sessionId=2&subContId=2&confId=a035925 tutorial]. There exist open source tools for many languages (see Weblinks). | ||
=== Architecture === | === Architecture === | ||
− | + | OAI-PMH uses following denotations of two logical groups of services and uses these for its client/server model (data=server, service=client): (from: [http://indico.cern.ch/getFile.py/access?resId=1&materialId=0&contribId=s2t3&sessionId=2&subContId=2&confId=a035925 Implementing the OAI-PMH - an introduction]) | |
− | * Data | + | * Data Providers (Open Archives, Repositories) |
** handle deposit/publishing of resources in archive | ** handle deposit/publishing of resources in archive | ||
** expose metadata about resources in archive | ** expose metadata about resources in archive | ||
− | * Service | + | ** refer to entities who possess data/metadata and are willing to share this with others (internally or externally) via well-defined protocols (e.g. database servers or simple XML files). |
− | ** harvest | + | ** normally: free access of metadata |
− | ** | + | ** not necessarily: free access to full texts / resources |
− | Note: | + | * Service Providers |
− | both functions may be offered by same | + | ** are entities who harvest and store data from Data Providers through OAI interfaces in order to provide higher-level services to users (e.g. search engines); no live requests! |
+ | ** offer single user-interface across all harvested metadata | ||
+ | ** offer (value-added) service on the basis of the metadata | ||
+ | ** may select certain subsets from Data Providers (set hierarchy, date stamp) | ||
+ | ** may enrich metadata | ||
+ | |||
+ | Note: Data provider may also be responsible for human-oriented (i.e. Web) interface to archive | ||
+ | both functions may be offered by same 'service'. | ||
+ | |||
+ | |||
[[Bild: OAI-PMH_Architecture.png]] | [[Bild: OAI-PMH_Architecture.png]] |
Version vom 10. August 2006, 08:48 Uhr
'Open Archives Initiative Protocol for Metadata Harvesting'. Part of discussion about open source geodata.
Inhaltsverzeichnis
Description
- Leightweight:
- A low barrier interoperability specification
- RESTful: HTTP based, GET / POST requests, XML responses
- Clear concepts:
- Based around metadata harvesting model (metadata aggregation and syndication)
- Metadata about resources, mandates unqualified Dublin Core as default
- Not a search protocol! That could be for example enhanced or merged versions of OpenSearch/GeoRSS and SRU/SRW besides WFS.
- Extendable: Has extension mechanism for more domain specific metadata models
- Stable: OAI has committed to making subsequent revisions of the protocol backwards compatible.
- Established: Implemented - among others - by Google, CiteSeer and MSN as well as by dozens of open source tools, like OAICat (Java) or mod_oai (C, installed at 500 sites); see more tools in the Weblinks section below.
There are following alternatives (Remarks: all considered either inappropriate or not established or at least 'painfull' compared to OAI-PMH):
- Z39.50 and SRW - the more leightweight and 'modern' successor of Z39.50
- UDDI - the SOAP thing
- WFS - the OGC thing
Discussion
Why not just letting crawlers find XML data (or, why a new protocol at all)?:
- Commercial web crawlers are estimated to index only approximately 16% of the total "surface web" and the size of the "deep web" [...] is estimated to be up to 550 times as large as the surface web. These problems are due in part to the extremely large scale of the web. To increase efficiency, a number of techniques have been proposed such as more accurately estimating web page creation and updates and more efficient crawling strategies. [...] All of these approaches stem from the fact that http does not provide semantics to allow web servers to answer questions of the form "what resources do you have?" and "what resources have changed since 2004-12-27?" (from: (Nelson and Van de Sompel et al. 2005)).
Why a harvesting protocol?
- Cross/distributed Searching vs. Harvesting. There are two possible approaches: (from: Implementing the OAI-PMH - an introduction).
- cross searching multiple archives based on protocol like Z39.50
- harvesting metadata into one or more 'central' services – bulk move data to the user-interface
- US digital library experience in this area (e.g. NCSTRL) indicated that cross searching not preferred approach - distributed searching of N nodes viable, but only for small values of N (< 100)
- Problems distributed searching: (from: Implementing the OAI-PMH - an introduction)
- Collection description: How do you know which targets to search?
- Query-language problem: Syntax varies and drifts over time between the various nodes.
- Rank-merging problem: How do you meaningfully merge multiple result sets?
- Performance: tends to be limited by slowest target; difficult to build browse interface
- Ok, what's left to do?
- We need an (lightweight) information model for geographic medatata. => See OSGeodata for a discussion.
- We need implementations of OAI-PMH data providers. => See steps 1 to 8 in this tutorial. There exist open source tools for many languages (see Weblinks).
Architecture
OAI-PMH uses following denotations of two logical groups of services and uses these for its client/server model (data=server, service=client): (from: Implementing the OAI-PMH - an introduction)
- Data Providers (Open Archives, Repositories)
- handle deposit/publishing of resources in archive
- expose metadata about resources in archive
- refer to entities who possess data/metadata and are willing to share this with others (internally or externally) via well-defined protocols (e.g. database servers or simple XML files).
- normally: free access of metadata
- not necessarily: free access to full texts / resources
- Service Providers
- are entities who harvest and store data from Data Providers through OAI interfaces in order to provide higher-level services to users (e.g. search engines); no live requests!
- offer single user-interface across all harvested metadata
- offer (value-added) service on the basis of the metadata
- may select certain subsets from Data Providers (set hierarchy, date stamp)
- may enrich metadata
Note: Data provider may also be responsible for human-oriented (i.e. Web) interface to archive both functions may be offered by same 'service'.
Specifications
- OAI-PMH 2.0 Spec. - Specification for Open Archives Initiative Protocol for Metadata Harvesting 2.0
- OAI-PMH Static Repository 2.0 Spec. - Specification for an OAI Static Repository and an OAI Static Repository Gateway 2.0
For Tools and Demos see Weblinks section below.
Weblinks
- Tutorials and short explanations:
- Concise technical description of the protocol
- OAI for beginners, OAI and OAI-PMH for absolute beginners by Philip Hunter, 2005.
- Tutorial: Implementing the OAI-PMH - an introduction by Uwe Müller, 2004
- Papers:
- "OAI-PMH Based Interoperation For Spatial Metadata" by Haixia Mao (2005). In: ISPRS Workshop on Service and Application of Spatial Data Infrastructure, XXXVI(4/W6), Oct.14-16 2005, Hangzhou, China.
- "Use of OAI to disclose metadata as an output from the portal" From: Go-Geo! Geo-Data Portal Project
- "Open Content and Access for Digital Scholarship" by Gerry McKiernan, 2004 (describes use of map resources)
- "Metadata aggregation and automated digital libraries: A retrospective on the NSDL experience" by Lagoze, Carl et al (2006).
- Software and Tools:
- OAI-PMH Data Provider Components (open source):
- OAI tools list from OAI-Home, tools from HU Berlin
- Java: OAICat
- PHP: hu-berlin.de, ...
- OAI-PMH Static Repository Gateways (open source):
- srepod - OAI-PMH Static Repository Gateway (C)
- Digital Repository System Software (open source):
- DSpace (Java)
- OAI-PMH Data Provider Components (open source):
- Demos and examples:
- Demo of OAICat - The OAICat Open Source project is a Java Servlet web application providing an OAI-PMH v2.0 repository framework.
- OAIster - Most often mentioned collection of freely available, previously difficult-to-access, academically-oriented digital resources that are easily searchable by anyone.
- PKP Open Archives Harvester - Free metadata indexing system developed by the Public Knowledge Project through its federally funded efforts to expand and improve access to research.
- OJAX Demo - Federated Metasearch Service using Ajax/Lucene