OAI-PMH: Unterschied zwischen den Versionen

Aus Geoinformation HSR
Wechseln zu: Navigation, Suche
K (Comparison between WFS and OAI-PMH)
K (Weblinks)
 
(87 dazwischenliegende Versionen von 3 Benutzern werden nicht angezeigt)
Zeile 1: Zeile 1:
'Open Archives Initiative Protocol for Metadata Harvesting'. This is part of discussion about [[OSGeodata | open source geodata]].
+
'Open Archives Initiative Protocol for Metadata Harvesting' (OAI-PMH). Ein Protokoll zum Einsammeln und Weiterverarbeiten von Metadaten.
 +
* An '''english documentation''' follows below.
 +
* Zum gepflegten '''[[HowTo OAI-PMH]]''' mit Kurzfassung (deutsch).
 +
 
 +
Siehe auch:
 +
* [http://de.wikipedia.org/wiki/OAI-PMH OAI-PMH-Artikel auf Wikipedia] (deutsch).
 +
* [http://www.openarchives.org OAI-Startseite] (englisch).
  
We definitively need a [[OSGeodataMetadataModel| metadata information model]] as well as a [[OSGeodataMetadataProtocol| metadata exchange protocol]]!
 
  
 
== Description ==
 
== Description ==
  
OAI-PMH is an metadata exchange protocol of type '''[[OSGeodataMetadataProtocol| harvesting procotol]]''' as compared to '''search protocols'''.
+
This is part of a discussion about '''[[OSGeodata | open source geodata]]''' (OSGeodata).
 +
 
 +
OAI-PMH is among the first on our short list for a '''[[OSGeodata metadata exchange protocol| metadata exchange protocol]]''' for the '''[[OSGeodata_Discovery| discovery]]''' of [[OSGeodata| geodata]]. For OAI-PMH there exist quite some open source tools (see Weblinks).
 +
 
 +
What's left to do anyway is:
 +
# Define a (lightweight) '''[[OSGeodata metadata exchange model| metadata exchange model]]'''.
 +
# Make '''implementations''' for data providers. => See weblinks below or this [http://indico.cern.ch/getFile.py/access?resId=1&materialId=0&contribId=s2t3&sessionId=2&subContId=2&confId=a035925 tutorial] (steps 1 to 8).
 +
 
 +
OAI-PMH is an metadata exchange protocol of type '''[[OSGeodata metadata exchange protocol| harvesting procotol]]''' as compared to '''search protocols'''. The protocol combines two so called providers: A data provider which serves XML metadata and a search provider (a harvester) which saves downloaded XML metadata in a searchable index.
  
Being an flexible protocol it does not depend on a single information model. The default, unqualified Dublic Core, can be extended or replaced by domain specific information models (see [[OSGeodataMetadataModel]]).  
+
Being an flexible protocol it does not depend on a single information model. The default, unqualified Dublic Core, can be extended or replaced by domain specific information models (see [[OSGeodata metadata exchange model]]).  
  
 
* Leightweight:  
 
* Leightweight:  
Zeile 20: Zeile 33:
 
* Established: Implemented - among others - by Google, CiteSeer and MSN as well as by dozens of open source tools, like OAICat (Java) or mod_oai (C, installed at 500 sites); see more tools in the Weblinks section below.
 
* Established: Implemented - among others - by Google, CiteSeer and MSN as well as by dozens of open source tools, like OAICat (Java) or mod_oai (C, installed at 500 sites); see more tools in the Weblinks section below.
  
There are following alternatives (Remarks: all considered either inappropriate or not established or at least 'painfull' compared to OAI-PMH):
+
Sizes compared to SOAP: A OAI-PMH record is about 2,4 kB; a set of 100 records 18 kB. In SOAP a record is about 4,7 kB; a set of 100 records is 179 kB.
* Query models:
 
** WFS - the OGC thing - being designed for geographic vector data exchange services
 
** Z39.50 and SRW - the bibliographic data exchange protocol things (SRW is the more leightweight and 'modern' successor of Z39.50)
 
* 'Find and bind':
 
** UDDI - the SOAP thing
 
 
 
== Discussion ==
 
Why not just letting crawlers find XML data (or, why a new protocol at all)?
 
 
 
:Commercial web crawlers are estimated to index only approximately 16% of the total "surface web" and the size of the "deep web" [...] is estimated to be up to 550 times as large as the surface web. These problems are due in part to the extremely large scale of the web. To increase efficiency, a number of techniques have been proposed such as more accurately estimating web page creation and updates and more efficient crawling strategies. [...] All of these approaches stem from the fact that http does not provide semantics to allow web servers to answer questions of the form "what resources do you have?" and "what resources have changed since 2004-12-27?" (from: [http://arxiv.org/abs/cs.DL/0503069 (Nelson and Van de Sompel et al. 2005)]).
 
  
Why a harvesting protocol?
+
See [[OSGeodata_Discovery]] for a discussion of alternatives.
* Cross/distributed Searching vs. Harvesting. There are two possible approaches: (from: [http://indico.cern.ch/getFile.py/access?resId=1&materialId=0&contribId=s2t3&sessionId=2&subContId=2&confId=a035925 Implementing the OAI-PMH - an introduction]).
 
** cross searching multiple archives based on protocol like Z39.50
 
** harvesting metadata into one or more 'central' services – bulk move data to the user-interface
 
** US digital library experience in this area (e.g. NCSTRL) indicated that cross searching not preferred approach - distributed searching of N nodes viable, but only for small values of N (< 100)
 
  
* Problems distributed searching: (from: [http://indico.cern.ch/getFile.py/access?resId=1&amp;materialId=0&amp;contribId=s2t3&amp;sessionId=2&amp;subContId=2&amp;confId=a035925 Implementing the OAI-PMH - an introduction])
+
== Architecture ==
** Collection description: How do you know which targets to search?
 
** Query-language problem: Syntax varies and drifts over time between the various nodes.
 
** Rank-merging problem: How do you meaningfully merge multiple result sets?
 
** Performance: tends to be limited by slowest target; difficult to build browse interface
 
  
* Ok, what's left to do?
+
OAI-PMH does harvesting and indexing before hand; so it scales and search functionality is up innovative search services as compared to distributed search in CAT/CSW (see below).
# We need an (lightweight) information model for geographic medatata. => See [[OSGeodata]] for a discussion.
 
# We need implementations of OAI-PMH data providers. => See steps 1 to 8 in this [http://indico.cern.ch/getFile.py/access?resId=1&amp;materialId=0&amp;contribId=s2t3&amp;sessionId=2&amp;subContId=2&amp;confId=a035925 tutorial]. There exist open source tools for many languages (see Weblinks).
 
  
=== Architecture ===
 
 
OAI-PMH uses following denotations of two logical groups of services and uses these for its client/server model (data=server, service=client): (from: [http://indico.cern.ch/getFile.py/access?resId=1&amp;materialId=0&amp;contribId=s2t3&amp;sessionId=2&amp;subContId=2&amp;confId=a035925 Implementing the OAI-PMH - an introduction])
 
OAI-PMH uses following denotations of two logical groups of services and uses these for its client/server model (data=server, service=client): (from: [http://indico.cern.ch/getFile.py/access?resId=1&amp;materialId=0&amp;contribId=s2t3&amp;sessionId=2&amp;subContId=2&amp;confId=a035925 Implementing the OAI-PMH - an introduction])
 
* Data Providers (Open Archives, Repositories)
 
* Data Providers (Open Archives, Repositories)
Zeile 68: Zeile 60:
 
[[Bild: OAI-PMH_Architecture.png]]
 
[[Bild: OAI-PMH_Architecture.png]]
  
=== Comparison between WFS and OAI-PMH ===
+
== Comparison of WFS, CAT/CSW and OAI-PMH ==
 +
 
 +
=== OAI-PMH vs. WFS or CSW ===
 +
Why a comparison of OAI-PMH with WFS and not with CSW 2.1? CAT/CSW is based on a distributed query architecture and there's redundant spec. and therefore code to WFS. So, why another protocol when there is WFS?
  
==== Relative (biased?) comparison ====
+
* [[OAI-PMH]] is a simple and strictly RESTful harvesting protocol based on Dublin Core (since 1995) whereas WFS is a query protocol which allows a RESTful or SOAP client to retrieve geospatial data encoded in GML (since 2000). See [[OSGeodata_Discovery| geodata discovery]] for more information about these two approaches.
* [[OAI-PMH]] is a simple and strictly RESTful harvesting protocol based on Dublin Core (since 1995) whereas WFS is a query protocol which allows a RESTful or SOAP client to retrieve geospatial data encoded in GML (since 2000).
 
 
* WFS though originated in GIS covers a smaller community than [[OAI-PMH]] which today has a larger user base and is supported even by Google and Yahoo!.
 
* WFS though originated in GIS covers a smaller community than [[OAI-PMH]] which today has a larger user base and is supported even by Google and Yahoo!.
* Both have in common that they can be made (WFS) or are ([[OAI-PMH]]) RESTful and both can be profiled to respond an metadata information model (to be defined, see [[OSGeodataMetadataModel]]).
+
* As compared to CSW which does disctributed searching, OAI-PMH does gather/harvest metadata. So search in WFS/CSW is at best limited to the slowest server and to a least denominator of implemented specs because each server needs to implement exactly the same query functionality.
 +
* Both have in common that they can be made (WFS) or are ([[OAI-PMH]]) RESTful and both can be profiled to respond an metadata information model (to be defined, see [[OSGeodataMetadataModel | metadata information model]]).
 
* WFS needs to be profiled (spec. size is ~200 p.) whereas [[OAI-PMH]] needs probably to be extended (spec. size is ~50 p.) - so WFS seems more complex and more costly to implement or adapt but much depends on the needs (make sophisticated queries vs. harvesting?) and underlying the architecture (distributed online services vs. local indexing).
 
* WFS needs to be profiled (spec. size is ~200 p.) whereas [[OAI-PMH]] needs probably to be extended (spec. size is ~50 p.) - so WFS seems more complex and more costly to implement or adapt but much depends on the needs (make sophisticated queries vs. harvesting?) and underlying the architecture (distributed online services vs. local indexing).
 
* Both need a [[OSGeodataMetadataModel | metadata information model]].
 
* Both need a [[OSGeodataMetadataModel | metadata information model]].
  
  
==== WFS ====
+
=== CAT/CSW ===
 +
 
 +
CAT stands for 'Catalogue Services Specification' and CSW is the HTTP binding part of it. Distributed query like in CAT/CSW means that search is at best limited to the slowest server and to a least denominator of implementations. That is because in a distributed query each server needs to implement exactly the same (OGC filter) functionality.
 +
 
 +
 
 +
=== WFS ===
 
* WFS is short for 'Web Feature Service'.  
 
* WFS is short for 'Web Feature Service'.  
 
* Spec. size: About 100 pages plus mandatory reference to Filter spec. of about 40 pages, plus reference to GML 2 (60 pages).
 
* Spec. size: About 100 pages plus mandatory reference to Filter spec. of about 40 pages, plus reference to GML 2 (60 pages).
Zeile 100: Zeile 100:
 
** ...
 
** ...
  
 
+
=== OAI-PMH ===
==== OAI-PMH ====
+
* [[OAI-PMH]] is short for 'Open Archives Initiative Protocol for Metadata Harvesting'. Note this is not a search protocol - that could be for example enhanced or merged versions of OpenSearch/GeoRSS and SRU/SRW besides WFS.
* [[OAI-PMH]] is short for 'Open Archives Initiative Protocol for Metadata Harvesting'. See [[OAI-PMH]] for a short description and architectural diagrams. Note this is not a search protocol - that could be for example enhanced or merged versions of OpenSearch/GeoRSS and SRU/SRW besides WFS.
 
 
* Spec. size: Version 2.0 still is < 50 pages(!)
 
* Spec. size: Version 2.0 still is < 50 pages(!)
 
* Background: Was initiated by libraries, universities, museums and galleries to 'open access' (OA) free online availability of digital content.
 
* Background: Was initiated by libraries, universities, museums and galleries to 'open access' (OA) free online availability of digital content.
* Request operations:  
+
* Request operations (* denotes an operation which is minimally needed for a first implementation):  
** ''Identify'' - describe an archive (similar to WFS' GetCapabilities)
+
** '''''Identify''''' (*) - describe an archive (similar to WFS' GetCapabilities)
** ''ListMetadataFormats'' - retrieve available metadata formats from archive
+
** '''''ListMetadataFormats''''' (*) - retrieve available metadata formats from archive
 
** ''ListIdentifiers'' - abbreviated form of ListRecords, retrieving only headers
 
** ''ListIdentifiers'' - abbreviated form of ListRecords, retrieving only headers
** ''ListRecords'' - harvest records from a repository (similar to WFS' GetFeature)
+
** '''''ListRecords''''' (*) - harvest records from a repository (similar to WFS' GetFeature)
 
** ''GetRecord'' - retrieve individual metadata record from a repository (also similar to WFS' GetFeature)
 
** ''GetRecord'' - retrieve individual metadata record from a repository (also similar to WFS' GetFeature)
 
** ''ListSets'' - retrieve set structure of a repository (optional)
 
** ''ListSets'' - retrieve set structure of a repository (optional)
Zeile 119: Zeile 118:
 
** For identifiers URI must be used (similar requirement to WFS).
 
** For identifiers URI must be used (similar requirement to WFS).
 
** Has no version negotiation (but see operation Identify)
 
** Has no version negotiation (but see operation Identify)
** Knows incremental update  
+
** Knows incremental update (through a resumption token) but which is only sensibel when a response contains more than 1000 records.
 
** Another spec. was released called 'OAI Static Repository and an OAI Static Repository Gateway'
 
** Another spec. was released called 'OAI Static Repository and an OAI Static Repository Gateway'
 
** [[OAI-PMH]] may respond results in compressed form which is handled at the HTTP-level (how RESTful!)
 
** [[OAI-PMH]] may respond results in compressed form which is handled at the HTTP-level (how RESTful!)
Zeile 132: Zeile 131:
 
** ...?
 
** ...?
  
== Specifications ==
+
== Specifications ==
 +
 
 +
The best thing to do in order to understand OAI-PMH check out the OAI Repository Explorer here at the official [http://www.openarchives.org/pmh/tools/ OAI-PMH tools page]. More Tools and Demos see Weblinks section below.
 +
 
 +
Specifications:
 
* [http://www.openarchives.org/OAI/openarchivesprotocol.html OAI-PMH 2.0 Spec.] - Specification for Open Archives Initiative Protocol for Metadata Harvesting 2.0
 
* [http://www.openarchives.org/OAI/openarchivesprotocol.html OAI-PMH 2.0 Spec.] - Specification for Open Archives Initiative Protocol for Metadata Harvesting 2.0
 
** [http://www.openarchives.org/OAI/2.0/guidelines.htm OAI-PMH Implementation Guidelines]
 
** [http://www.openarchives.org/OAI/2.0/guidelines.htm OAI-PMH Implementation Guidelines]
Zeile 138: Zeile 141:
 
** [http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm Specification for an OAI Static Repository and an OAI Static Repository Gateway]
 
** [http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm Specification for an OAI Static Repository and an OAI Static Repository Gateway]
  
For Tools and Demos see Weblinks section below.
+
See also [[HowTo OAI-PMH]] (german).
  
== OAI-PMH Implementations ==
+
== OAI-PMH implementations ==
  
 
In the domain of scholar knowledge dissemination they discuss similar issues like we have here. See e.g. [http://epub.mimas.ac.uk/papers/dlsr200603/iesr-dlsr200603_summary.html Introduction to the IESR: A Registry of Collections and Services], by Hill and Ann Apps, MIMAS, Univ. of Manchester, 2006. ([http://iesr.ac.uk/metadata/mappings/oaidcmap.html IESR's Extension to OAI-PMH Dublin Core])  
 
In the domain of scholar knowledge dissemination they discuss similar issues like we have here. See e.g. [http://epub.mimas.ac.uk/papers/dlsr200603/iesr-dlsr200603_summary.html Introduction to the IESR: A Registry of Collections and Services], by Hill and Ann Apps, MIMAS, Univ. of Manchester, 2006. ([http://iesr.ac.uk/metadata/mappings/oaidcmap.html IESR's Extension to OAI-PMH Dublin Core])  
 +
 +
Data Provider Services, demos and examples:
 +
* GIS specific:
 +
** [[Geometa-Editor]], open source
 +
** GeoShop, product from [http://www.infogrips.ch infoGrips GmbH].
 +
** [http://geonetwork-opensource.org GeoNetwork] v2.2.0 (From the List of changes: "... Added OAI-PMH server protocol - Added OAI-PMH harvesting type.").
 +
* General
 +
** [http://pkp.sfu.ca/?q=harvester PKP Open Archives Harvester] - Free metadata indexing system developed by the Public Knowledge Project through its federally funded efforts to expand and improve access to research.
 +
** [http://www.language-archives.org/tools.html OLAC: Open Language Archives Community]
 +
** [http://alcme.oclc.org/oaicat/ Demo of OAICat] - The OAICat Open Source project is a Java Servlet web application providing an OAI-PMH v2.0 repository framework.
 +
 +
Service Providers:
 +
* [http://oaister.umdl.umich.edu/o/oaister/ OAIster] - Most often mentioned collection of freely available, previously difficult-to-access, academically-oriented digital resources that are easily searchable by anyone.
 +
* [http://pkp.sfu.ca/harvester2/demo/ Service Provider Demo]
 +
* [http://arc.cs.odu.edu/friends.html OAI Service providers list]
 +
* [http://ojax.sourceforge.net/#Demo OJAX Demo] - Federated Metasearch Service using Ajax/Lucene
 +
* [http://eprints.hsr.ch/ ePrints HSR]
  
 
Software and Tools:
 
Software and Tools:
* OAI-PMH Data Provider Components (open source):
+
* Lists / resources (open source):
** [http://www.openarchives.org/tools/tools.html OAI tools list] from OAI-Home, [http://edoc.hu-berlin.de/oai tools from HU Berlin]
+
** [http://www.openarchives.org/tools/tools.html OAI tools list] of data providers, search providers (harvesters) and gateways
** Java: [http://www.oclc.org/research/software/oai/cat.htm OAICat]
+
** [http://www.openarchives.org/pipermail/oai-implementers/ OAI Implementers List w. Archive]
** PHP: [http://edoc.hu-berlin.de/oai/ hu-berlin.de], ...
+
** [http://edoc.hu-berlin.de/oai OAI tools from HU Berlin] and from [http://www.isn-oldenburg.de/projects/OAD/software.html Univ. Oldenburg]
* OAI-PMH Static Repository Gateways (open source):
+
** [http://uilib-oai.sourceforge.net/ UIUC OAI Metadata Harvesting Project]
 +
* Featured projects:
 +
** [http://sourceforge.net/project/showfiles.php?group_id=47963&release_id=183302 JSP OAI 2.0 Data Provider for database] by Univ. Illinois Urbana Champaign; Version/Date: 1.3, 08-09-2003; Framework: Java, JSP; License: NCSA Open Source License.
 +
** [http://www.oclc.org/research/software/oai/cat.htm OAICat Data Provider for single XML files, SRU or database] by OCLC.org; Version/Date: v1.5.46, 05-07-2006; Framework: Java Servlet; License: OCLC (GPL alike).
 +
** [http://edoc.hu-berlin.de/oai/ OAI-PMH2 HU Berlin for database] by HU Berlin; Version/Date: 17-12-2002; Language: PHP4; License: n/a.
 +
** [http://physnet.uni-oldenburg.de/oai/ OAI-PMH2 Uni Oldenburg for database] by Uni Oldenburg; Version/Date: v1.8.0, 20-06-2005; Language: PHP4; License: n/a.
 +
** [http://simba.cs.uct.ac.za/~tiki/tiki-index.php?page=XMLFile OAI-PMH2 XMLFile for XML files] by H. Suleman Virginia Tech ([http://www.dlib.vt.edu/projects/OAI/software/xmlfile/xmlfile.html old site]); Version/Date v2.21, 31.08.2005; Language Perl; License: Own (LGPL alike). 
 
** [http://srepod.sourceforge.net/ srepod] - OAI-PMH Static Repository Gateway (C)
 
** [http://srepod.sourceforge.net/ srepod] - OAI-PMH Static Repository Gateway (C)
 
* Digital Repository System Software (open source):
 
* Digital Repository System Software (open source):
 
** [http://www.dspace.org/ DSpace] (Java)
 
** [http://www.dspace.org/ DSpace] (Java)
 
+
** [http://www.dlese.org/libdev/dcs_overview.html DLESE Collection System] offers a metadata repository and editor and supports OAI-PMH as data provider and harvester
Demos and examples:
+
* OAI-PMH Protocol Validation
* [http://alcme.oclc.org/oaicat/ Demo of OAICat] - The OAICat Open Source project is a Java Servlet web application providing an OAI-PMH v2.0 repository framework.
+
** http://purl.org/net/oai_explorer (tip from XMLFile pages)
* [http://oaister.umdl.umich.edu/o/oaister/ OAIster] - Most often mentioned collection of freely available, previously difficult-to-access, academically-oriented digital resources that are easily searchable by anyone.
 
* [http://pkp.sfu.ca/?q=harvester PKP Open Archives Harvester] - Free metadata indexing system developed by the Public Knowledge Project through its federally funded efforts to expand and improve access to research.
 
* [http://ojax.sourceforge.net/#Demo OJAX Demo] - Federated Metasearch Service using Ajax/Lucene
 
 
 
  
 
== Weblinks ==
 
== Weblinks ==
* [http://www.openarchives.org/ Open Archives Initiative (OAI) Home])
+
* About
* [http://www.oaforum.org/documents/ OAI forum] and [http://www.openarchives.org/mailman/listinfo/OAI-general/ OAI general mailing list]  
+
** [http://www.openarchives.org/ Open Archives Initiative (OAI) Home])
 +
** [http://www.oaforum.org/documents/ OAI forum], [http://www.openarchives.org/mailman/listinfo/OAI-general/ OAI general mailing list] and [http://www.openarchives.org/pipermail/oai-implementers/ OAI implementers mailing list]
 +
** [http://en.wikipedia.org/wiki/Open_Archives_Initiative Wikipedia (en) article]
  
 
* Tutorials and short explanations:
 
* Tutorials and short explanations:
Zeile 169: Zeile 193:
 
** [http://www.oaforum.org/tutorial/english/intro.htm OAI for beginners], [http://eprints.rclis.org/archive/00005512/ OAI and OAI-PMH for absolute beginners] by Philip Hunter, 2005.
 
** [http://www.oaforum.org/tutorial/english/intro.htm OAI for beginners], [http://eprints.rclis.org/archive/00005512/ OAI and OAI-PMH for absolute beginners] by Philip Hunter, 2005.
 
** [http://indico.cern.ch/getFile.py/access?resId=1&materialId=0&contribId=s2t3&sessionId=2&subContId=2&confId=a035925 Tutorial: Implementing the OAI-PMH - an introduction] by Uwe Müller, 2004
 
** [http://indico.cern.ch/getFile.py/access?resId=1&materialId=0&contribId=s2t3&sessionId=2&subContId=2&confId=a035925 Tutorial: Implementing the OAI-PMH - an introduction] by Uwe Müller, 2004
 +
** [http://www.icbl.hw.ac.uk/perx/advocacy/exposingmetadata.htm 'Marketing' with Metadata] - How Metadata Can Increase Exposure and Visibility of Online Content. Version 1.0 8th March 2006.
 +
** [http://www.language-archives.org/sr Introduction to Static Repository Gateways] with examples from OLAC
 +
** [http://iesr.ac.uk/use/oaipmh/ IESR OAI-PMH Access] (for modeling services)
 +
 
* Papers:
 
* Papers:
 
** [http://www.commission4.isprs.org/workshop_hangzhou/papers/77-82%20Haixia%20Mao-A150.pdf "OAI-PMH Based Interoperation For Spatial Metadata"] by Haixia Mao (2005). In: ISPRS Workshop on Service and Application of Spatial Data Infrastructure, XXXVI(4/W6), Oct.14-16 2005, Hangzhou, China.
 
** [http://www.commission4.isprs.org/workshop_hangzhou/papers/77-82%20Haixia%20Mao-A150.pdf "OAI-PMH Based Interoperation For Spatial Metadata"] by Haixia Mao (2005). In: ISPRS Workshop on Service and Application of Spatial Data Infrastructure, XXXVI(4/W6), Oct.14-16 2005, Hangzhou, China.
Zeile 174: Zeile 202:
 
** [http://www.public.iastate.edu/~gerrymck/OpenContent.ppt "Open Content and Access for Digital Scholarship"] by Gerry McKiernan, 2004 (describes use of map resources)
 
** [http://www.public.iastate.edu/~gerrymck/OpenContent.ppt "Open Content and Access for Digital Scholarship"] by Gerry McKiernan, 2004 (describes use of map resources)
 
** [http://arxiv.org/abs/cs.DL/0601125 "Metadata aggregation and automated digital libraries: A retrospective on the NSDL experience"] by Lagoze, Carl et al (2006).
 
** [http://arxiv.org/abs/cs.DL/0601125 "Metadata aggregation and automated digital libraries: A retrospective on the NSDL experience"] by Lagoze, Carl et al (2006).
 +
** [http://indico.cern.ch/conferenceDisplay.py?confId=a035925 Implementing the benefits of OAI (OAI3)], CERN Workshop Series on Innovations in Scholarly Communication (2004).
 +
 +
[[Kategorie:English]]
 +
[[Kategorie:Abkürzungen]]
 +
[[Kategorie:Webservice]]

Aktuelle Version vom 14. Oktober 2012, 17:52 Uhr

'Open Archives Initiative Protocol for Metadata Harvesting' (OAI-PMH). Ein Protokoll zum Einsammeln und Weiterverarbeiten von Metadaten.

  • An english documentation follows below.
  • Zum gepflegten HowTo OAI-PMH mit Kurzfassung (deutsch).

Siehe auch:


Description

This is part of a discussion about open source geodata (OSGeodata).

OAI-PMH is among the first on our short list for a metadata exchange protocol for the discovery of geodata. For OAI-PMH there exist quite some open source tools (see Weblinks).

What's left to do anyway is:

  1. Define a (lightweight) metadata exchange model.
  2. Make implementations for data providers. => See weblinks below or this tutorial (steps 1 to 8).

OAI-PMH is an metadata exchange protocol of type harvesting procotol as compared to search protocols. The protocol combines two so called providers: A data provider which serves XML metadata and a search provider (a harvester) which saves downloaded XML metadata in a searchable index.

Being an flexible protocol it does not depend on a single information model. The default, unqualified Dublic Core, can be extended or replaced by domain specific information models (see OSGeodata metadata exchange model).

  • Leightweight:
    • A low barrier interoperability specification
    • HTTP based, GET / POST requests, XML responses, there for RESTful, XML machinery only as needed.
  • Clear concepts:
    • Designed around metadata harvesting model; no live requests!
    • Metadata about resources, mandates unqualified Dublin Core (oai_dc) as default
    • Not a search protocol! That could be for example enhanced or merged versions of OpenSearch/GeoRSS and SRU/SRW besides WFS.
  • Extendable: Has extension mechanism for more domain specific metadata models, like the tbd OSGeodataMetadataModel
  • Stable: OAI has committed to making subsequent revisions of the protocol backwards compatible.
  • Established: Implemented - among others - by Google, CiteSeer and MSN as well as by dozens of open source tools, like OAICat (Java) or mod_oai (C, installed at 500 sites); see more tools in the Weblinks section below.

Sizes compared to SOAP: A OAI-PMH record is about 2,4 kB; a set of 100 records 18 kB. In SOAP a record is about 4,7 kB; a set of 100 records is 179 kB.

See OSGeodata_Discovery for a discussion of alternatives.

Architecture

OAI-PMH does harvesting and indexing before hand; so it scales and search functionality is up innovative search services as compared to distributed search in CAT/CSW (see below).

OAI-PMH uses following denotations of two logical groups of services and uses these for its client/server model (data=server, service=client): (from: Implementing the OAI-PMH - an introduction)

  • Data Providers (Open Archives, Repositories)
    • handle deposit/publishing of resources in archive
    • expose metadata about resources in archive
    • refer to entities who possess data/metadata and are willing to share this with others (internally or externally) via well-defined protocols (e.g. database servers or simple XML files).
    • normally: free access of metadata
    • not necessarily: free access to full texts / resources
  • Service Providers
    • are entities who harvest and store data from Data Providers through OAI interfaces in order to provide higher-level services to users (e.g. search engines)
    • offer single user-interface across all harvested metadata
    • offer (value-added) service on the basis of the metadata
    • may select certain subsets from Data Providers (set hierarchy, date stamp)
    • may enrich metadata

Note: Data provider may also be responsible for human-oriented (i.e. Web) interface to archive both functions may be offered by same 'service'.

OAI-PMH Architecture.png

Comparison of WFS, CAT/CSW and OAI-PMH

OAI-PMH vs. WFS or CSW

Why a comparison of OAI-PMH with WFS and not with CSW 2.1? CAT/CSW is based on a distributed query architecture and there's redundant spec. and therefore code to WFS. So, why another protocol when there is WFS?

  • OAI-PMH is a simple and strictly RESTful harvesting protocol based on Dublin Core (since 1995) whereas WFS is a query protocol which allows a RESTful or SOAP client to retrieve geospatial data encoded in GML (since 2000). See geodata discovery for more information about these two approaches.
  • WFS though originated in GIS covers a smaller community than OAI-PMH which today has a larger user base and is supported even by Google and Yahoo!.
  • As compared to CSW which does disctributed searching, OAI-PMH does gather/harvest metadata. So search in WFS/CSW is at best limited to the slowest server and to a least denominator of implemented specs because each server needs to implement exactly the same query functionality.
  • Both have in common that they can be made (WFS) or are (OAI-PMH) RESTful and both can be profiled to respond an metadata information model (to be defined, see metadata information model).
  • WFS needs to be profiled (spec. size is ~200 p.) whereas OAI-PMH needs probably to be extended (spec. size is ~50 p.) - so WFS seems more complex and more costly to implement or adapt but much depends on the needs (make sophisticated queries vs. harvesting?) and underlying the architecture (distributed online services vs. local indexing).
  • Both need a metadata information model.


CAT/CSW

CAT stands for 'Catalogue Services Specification' and CSW is the HTTP binding part of it. Distributed query like in CAT/CSW means that search is at best limited to the slowest server and to a least denominator of implementations. That is because in a distributed query each server needs to implement exactly the same (OGC filter) functionality.


WFS

  • WFS is short for 'Web Feature Service'.
  • Spec. size: About 100 pages plus mandatory reference to Filter spec. of about 40 pages, plus reference to GML 2 (60 pages).
  • Background: A de-facto standard in GIS world owned by OGC
  • Request operations (Basic WFS, read only WFS):
    • GetCapabilities - describe service capabilities;
    • DescribeFeatureType - describe structure of a feature type
    • GetFeature - retrieve feature instances
  • Query:
    • Standard query (c.f. Filter) includes spatial constraints (but no temporal?)
  • Response:
    • Must be GML 2 and may be other XML encodings.
  • Peculiarities:
    • Knows version negotiation (which isn't implemented yet...)
    • Uses URN as unique identifier
  • Availability and outreach of services, tools and (open source) software:
  • Expected changes to spec. (profile or extension):
    • Profile to KVP binding
    • Profile Filter to Boolean and Wildcard matches as well as 'within'
    • ...

OAI-PMH

  • OAI-PMH is short for 'Open Archives Initiative Protocol for Metadata Harvesting'. Note this is not a search protocol - that could be for example enhanced or merged versions of OpenSearch/GeoRSS and SRU/SRW besides WFS.
  • Spec. size: Version 2.0 still is < 50 pages(!)
  • Background: Was initiated by libraries, universities, museums and galleries to 'open access' (OA) free online availability of digital content.
  • Request operations (* denotes an operation which is minimally needed for a first implementation):
    • Identify (*) - describe an archive (similar to WFS' GetCapabilities)
    • ListMetadataFormats (*) - retrieve available metadata formats from archive
    • ListIdentifiers - abbreviated form of ListRecords, retrieving only headers
    • ListRecords (*) - harvest records from a repository (similar to WFS' GetFeature)
    • GetRecord - retrieve individual metadata record from a repository (also similar to WFS' GetFeature)
    • ListSets - retrieve set structure of a repository (optional)
  • Query:
    • Standard query includes temporal constraints (but no spatial yet).
  • Response:
    • Either an encoded error or unqualified Dublin Core XML or another format announced with ListMetadataFormats.
  • Peculiarities:
    • For identifiers URI must be used (similar requirement to WFS).
    • Has no version negotiation (but see operation Identify)
    • Knows incremental update (through a resumption token) but which is only sensibel when a response contains more than 1000 records.
    • Another spec. was released called 'OAI Static Repository and an OAI Static Repository Gateway'
    • OAI-PMH may respond results in compressed form which is handled at the HTTP-level (how RESTful!)
  • Availability and outreach of services, tools and (open source) software:
    • Open source software see OAI-PMH tools
    • Gateways available from publishing files to others
    • Google reads it as a Sitemaps file (how to submit), Yahoo! and Internet Archive uses it among others.
  • Expected changes to spec. (profile or extension):
    • special treatment of ID's?
    • Need to be extended if spatial constraints are needed here
    • Add version negotiation?
    • ...?

Specifications

The best thing to do in order to understand OAI-PMH check out the OAI Repository Explorer here at the official OAI-PMH tools page. More Tools and Demos see Weblinks section below.

Specifications:

See also HowTo OAI-PMH (german).

OAI-PMH implementations

In the domain of scholar knowledge dissemination they discuss similar issues like we have here. See e.g. Introduction to the IESR: A Registry of Collections and Services, by Hill and Ann Apps, MIMAS, Univ. of Manchester, 2006. (IESR's Extension to OAI-PMH Dublin Core)

Data Provider Services, demos and examples:

Service Providers:

Software and Tools:

Weblinks