HSR Texas Geo Database Benchmark

The HSR Texas Spatial Database Benchmark - A Proposal

Status: Call for Comments.

Inhaltsverzeichnis

1 Introduction
2 Methodology
3 Queries
4 Datasets

Introduction

This is a propoal for a spatial database benchmark from the University of Applied Sciences Rapperswil (HSR). A study of existing database benchmarks revealed that there exists no information publicly available which compares spatial database systems regarding their performance. Spatial databases typically form the persistence layer of a geographic information system (GIS).

Thus the Institute for Software at the University of Applied Sciences Rapperswil (HSR) decided to propose such a benchmark. This benchmark was called The HSR Texas Spatial Database Benchmark. It was conducted and verified a first time on two database management systems software (DBMS), one commercially available an one under an Open Source license.

The benchmark is based on a predefined set of queries. These queries consist of simple spatial queries, defined in the OpenGIS(tm) 'Simple Features Interface Standard (SFS)'. The queries are performed on different-sized data sets for monitoring the behavior on various loads as well as on different hardware systems.

Draft of the benchmark.

Methodology

The benchmark follows the below defined rules of engagement.

Each test runs three times in a row, the results of the third run are used for the comparison: this benchmark assumes full system caches ('hot' benchmark).
Each test takes place on the same machine.
The other DBMS are shut down while the tests are running.
The respective default installation of one DBMS is used.
There is no tuning.
Each DBMS has the same data.
The spatial records used are spherical (geographical data) and come from real world data sets.
Each test uses bounding box or point variables which are the same for the respective systems in the test. These variables are chosen at random from within the subset space.

Queries

The following queries are selected from the SQL functions defined by the SFS. The placeholder '{dataset}' will be substituted by the respective table names. The variable @bbox complies with randomly selected rectangular polygons in the geographical range of each data set, but which are the same for every system. The variable @point corresponds to a random point which is located in the area of the subsets. "geo" identifies the table column with geographical data. The statements are given in pseudo SQL code.

Query 1. Loading the data

Create tiered data subsets by sub-dividing with a given bounding box @bbox, creating the indices.

 SELECT * INTO {dataset}
 FROM {original dataset}
 WHERE ST_Intersects(@bbox, geo);

 CREATE SPATIAL INDEX idx ON {dataset} ([geo])

Query 2. A Non-spatial Selection: Count

Count all railroads.

 SELECT Count(*)
 FROM {dataset lines} l 
 WHERE l.roadflg='Y'

Query 3. Spatial Selection I: Intersect Point, Line and Polygons

Count all a) points, b) lines, c) polygons that intersect with a given bounding box '@bbox'.

 a)
 SELECT Count(*)
 FROM {dataset points} p
 WHERE ST_Intersects(@bbox, p.geo) ;

 b)
 SELECT Count(*)
 FROM {dataset lines} l 
 WHERE ST_Intersects(@bbox, l.geo) ;

 c)
 SELECT Count(*)
 FROM {dataset polygons} pg 
 WHERE ST_Intersects(@bbox, pg.geo) ;

Query 4. Spatial Selection II: Distance within

Count all a) points, b) polygons, c) lines that are within 1000 meters from a given point '@point'.

 a)
 SELECT COUNT(*)
 FROM {dataset points} p 
 WHERE ST_Distance(@point, geo) <= 1000

 b)
 SELECT COUNT(*)
 FROM {dataset polygons} pg 
 WHERE ST_Distance(@point, geo) <= 1000

 c)
 SELECT COUNT(*)
 FROM {dataset lines} l 
 WHERE ST_Distance(@point, geo) <= 1000

Query 5. Spatial Selection III: Intersect/Join Lines and Polygons

Count all railroads that intersect with a water area.

 SELECT Count(*)
 FROM {dataset lines} l, areawater_full pg
 WHERE ST_Intersects(pg.geom, l.geom) = 1 and l.railflg = 'Y'

Datasets

At the Free and Open Source Software for Geospatial Conference (FOSS4G) 2009 in Sydney, a benchmark was performed, which compared the open source GIS software products GeoServer and MapServer in terms of performance. There, PostgreSQL/PostGIS and Oracle have been tested too. Several sets from the TIGER shapefiles of Texas that can be downloaded from the U.S. Census Bureau where used as data basis ().

This data is proposed to use within this benchmark containing rivers, roads, railroads, points-of-interest and water areas. The data originate from the following shape files:

Table: Data sets from the TIGER shapefiles)
Shape file name	edges_merge	gnis_names09	areawater_merge
Type	MULTILINESTRING	POINT	MULTIPOLYGON
Number of records	Over 5 M	103,000	380,000
SRID	EPSG:4326	EPSG:4326	EPSG:4326
Source	TIGER 2008	GNIS database	TIGER 2008
Description	All line elements (rivers and roads) from the TIGER 2008 dataset for the state of Texas.	All locations and point of interest for the state of Texas.	The TIGER set of polygons describing water surface for the state of Texas.

These data sets are divided by a bounding-box procedure to subsets. Using these subsets the behavior of the respective DBMS can be observed in different-sized data sets. Figure 3 shows the gnis_names09 data set (left) and the areawater_merge data set (right).

Visualization of the point (left) and polygon (right) data set.

HSR Texas Geo Database Benchmark

Inhaltsverzeichnis

Introduction

Methodology

Queries

Query 1. Loading the data

Query 2. A Non-spatial Selection: Count

Query 3. Spatial Selection I: Intersect Point, Line and Polygons

Query 4. Spatial Selection II: Distance within

Query 5. Spatial Selection III: Intersect/Join Lines and Polygons

Datasets

Navigationsmenü

Meine Werkzeuge

Namensräume

Varianten

Ansichten

Mehr

Suche

Navigation

Weblinks

Werkzeuge