vmx

the blllog.

FOSS4G 2009: “Geodata and CouchDB” presentation is online

2009-11-17 22:35

The final wrap-up of the FOSS4G 2009, my presentation on “Geodata and CouchDB” is available online in several formats. It should also be of interest for people who are new to CouchDB as huge parts of the talk are an introduction into CouchDB.

Categories: en, CouchDB, Python, geo

Benchmarking is not easy

2009-09-23 22:35

There are so many ways to have a play with CouchDB. This time I thought about using CouchDB as a TileCache storage. Sounds easy, so it was.

What is a tilecache

Everyone knows Google Maps and its small images, called tiles. Rendering those tiles for the whole world for every zoom level can be quite time consuming, therefore you can render them on demand and cache them once they are rendered. This is the business of a tilecache.

You can use the tilecache as a proxy to a remote tile server as well, that's what I did for this benchmark.

Coding

The implementation looks quite similar to the memcache one. I haven't implemented locking as I was just after something working, not a full-fledged backend.

When I finished coding, it was time to find out how it performs. That should be easy, as there's a tilecache_seeding script bundled with TileCache to fill the cache. So you fill the cache, then you switch the remote server off and test how long it takes if all requests are hits without any fails (i.e. all tiles are in your cache and don't need to be requested from a remote server).

The two contestants for the benchmark are the CouchDB backend and the one that stores the tiles directly on the filesystem.

Everyone loves numbers

We keep it simple and measure the time for seeding with time. How long will it take to request 780 tiles? The first number is the average (in seconds), the one in brackets the standard deviation.

  • Filesystem:

    real 0.35 (0.04)
    user 0.16 (0.02)
    sys  0.05 (0.01)
    
  • CouchDB:

    real 3.03 (0.18)
    user 0.96 (0.05)
    sys  0.21 (0.03)
    

Let's say CouchDB is 10 times slower that the file system based cache. Wow, CouchDB really sucks! Why would you use it as tile storage? Although you could:

  • easily store metadata with every tile, like a date when it should expire.
  • keep a history of tiles and show them as "travel through time layers" in your mapping application
  • easy replication to other servers

You just don't want such a slow hog. And those CouchDB people try to tell me that CouchDB would be fast. Pha!

Really??

You might already wonder, where the details are, the software version numbers, the specification of the system and all that stuff? These things are missing with a good reason. This benchmark just isn't right, even if I would add these details. The problem lies some layers deeper.

This benchmark is way to far away from a real-life usage. You would request much more tiles and not the same 780 ones with every run. When I was benchmarking the filesystem cache, all tiles were already in the system's cache, therefore it was that fast.

Simple solution: clear the system cache and run the tests again. Here are the results after as echo 3 > /proc/sys/vm/drop_caches

  • Filesystem:

    real 8.36 (0.71)
    user 0.29 (0.04)
    sys  0.18 (0.03)
    
  • CouchDB:

    real 6.64 (0.15)
    user 1.13 (0.07)
    sys  0.29 (0.06)
    

Wow, the CouchDB cache is faster than the filesystem cache. Too nice to be true. The reason is easy: loading the CouchDB database file, thus one file access on the disk, is way faster that 780 accesses.

Does it really matter?

Let's take the first benchmark, if CouchDB would be that much slower, but isn't it perhaps fast enough? Even with those measures (ten times slower than the filesystem cache) it would mean your cache can take 250 requests per second. Let's say a user requests 9 tiles per second it would be about 25 users at the same time. With every user staying 2 minutes on the map it would mean 18 000 users per day. Not bad.

Additionally you gain some nice things you won't have with other caches (as outlined above). And if you really need more performance you could always dump the tiles to the filesystem with a cron job.

Conclusion

  1. Benchmarking is not easy, but easy to get wrong.
  2. Slow might be fast enough.
  3. Read more about benchmarking on Jan's blog.

Categories: en, CouchDB, Python, TileCache, geo

GeoCouch: New release (0.10.0)

2009-09-19 22:35

Notice: This blog post is outdated, please move on :)

It has been way to long since the initial release, but it’s finally there: a new release of GeoCouch. For all first time visitors, GeoCouch is an extension for CouchDB to support geo-spatial queries like bounding box or polygon searches.

I keep this blog entry relatively short and only outline the highlights and requirements for the new release as GeoCouch finally has a real home at http://gitorious.org/geocouch/. Feel free to contribute to the wiki or fork the source.

Highlights

  • Many geometries are supported: points, lines, polygons (using Shapely).
  • Queries are largely along the lines of the OpenSearch-Geo extension draft. Currently supported are bounding box and polygon searches.
  • Adding new backends (in addition to SpatiaLite) is easily possible.

Requirements

Other versions might work.

Download

If you don’t like Git, you can download GeoCouch 0.10.0 here.

Categories: en, CouchDB, Python, geo

CouchDB: Returning all design documents with Python

2009-08-21 22:35

I just wanted to get all design documents of a CouchDB database with couchdb-python. I couldn’t find any hints how to do it, it took longer to find out than expected. Therefore this blog entry, perhaps I save someone a few minutes of research.

from couchdb.client import Server
couch_server = Server('http://localhost:5984/')
for designdoc in couch_server['yourdatabase']\
        .view('_all_docs', startkey='_design', endkey='_design0'):
    print 'designdoc: %s' % designdoc

Update: even simpler with slicing:

from couchdb.client import Server
couch_server = Server('http://localhost:5984/')
for designdoc in couch_server['yourdatabase']\
        .view('_all_docs')['_design':'_design0']:
    print 'designdoc: %s' % designdoc

Categories: en, CouchDB, Python

FOSS4G 2009: I'm speaking

2009-07-21 22:35

I did it! I'll speak on the FOSS4G Conference 2009 (Free and Open Source Software for Geospatial Conference), 20th–23rd October in Sydney about “CouchDB and Geodata”. More information is available at the official website.

Categories: en, CouchDB, geo

Poor man’s bounding box queries with CouchDB

2009-07-19 22:35

Several people store geographical points within CouchDB and would like to make a bounding box query on them. This isn’t possible with plain CouchDB _views. But there’s light at the end of the tunnel. One solution will be GeoCouch (which can do a lot more than simple bounding box queries), once there’s a new release, the other one is already there: you can use a the list/show API (Warning: the current wiki page (as at 2009-07-19) applies to CouchDB 0.9, I use the new 0.10 API).

You can either add a _list function as described in the documentation or use my futon-list branch which includes an interface for easier _list function creation/editing.

Your data

The _list function needs to match your data, thus I expect documents with a field named location which contains an array with the coordinates. Here’s a simple example document:


{
   "_id": "00001aef7b72e90b991975ef2a7e1fa7",
   "_rev": "1-4063357886",
   "name": "Augsburg",
   "location": [
       10.898333,
       48.371667
   ],
   "some extra data": "Zirbelnuss"
}

The _list function

We aim at creating a _list function that returns the same response as a normal _view would return, but filtered with a bounding box. Let’s start with a _list function which returns the same results as plain _view (no bounding box filtering, yet). The whitespaces of the output differ slightly.

function(head, req) {
    var row, sep = '\n';

    // Send the same Content-Type as CouchDB would
    if (req.headers.Accept.indexOf('application/json')!=-1)
      start({"headers":{"Content-Type" : "application/json"}});
    else
      start({"headers":{"Content-Type" : "text/plain"}});

    send('{"total_rows":' + head.total_rows +
         ',"offset":'+head.offset+',"rows":[');
    while (row = getRow()) {
        send(sep + toJSON(row));
        sep = ',\n';
    }
    return "\n]}";
};

The _list API allows to you add any arbitrary query string to the URL. In our case that will be bbox=west,south,east,north (adapted from the OpenSearch Geo Extension). Parsing the bounding box is really easy. The query parameters of the request are stored in the property req.query as key/value pairs. Get the bounding box, split it into separate values and compare it with the values of every row.

var row, location, bbox = req.query.bbox.split(',');
while (row = getRow()) {
    location = row.value.location;
    if (location[0]>bbox[0] && location[0]<bbox[2] &&
            location[1]>bbox[1] && location[1]<bbox[3]) {
        send(sep + toJSON(row));
        sep = ',\n';
    }
}

And finally we make sure that no error message is thrown when the bbox query parameter is omitted. Here’s the final result:

function(head, req) {
    var row, bbox, location, sep = '\n';

    // Send the same Content-Type as CouchDB would
    if (req.headers.Accept.indexOf('application/json')!=-1)
      start({"headers":{"Content-Type" : "application/json"}});
    else
      start({"headers":{"Content-Type" : "text/plain"}});

    if (req.query.bbox)
        bbox = req.query.bbox.split(',');

    send('{"total_rows":' + head.total_rows +
         ',"offset":'+head.offset+',"rows":[');
    while (row = getRow()) {
        location = row.value.location;
        if (!bbox || (location[0]>bbox[0] && location[0]<bbox[2] &&
                      location[1]>bbox[1] && location[1]<bbox[3])) {
            send(sep + toJSON(row));
            sep = ',\n';
        }
    }
    return "\n]}";
};

An example how to access your _list function would be: http://localhost:5984/geodata/_design/designdoc/_list/bbox/viewname?bbox=10,0,120,90&limit=10000

Now you should be able to filter any of your point clouds with a bounding box. The performance should be alright for a reasonable number of points. A usual use-case would something like displaying a few points on a map, where you don’t want to see zillions of them anyway.

Stay tuned for a follow-up posting about displaying points with OpenLayers.

Categories: en, CouchDB, JavaScript, geo

List function editing in Futon

2009-07-19 22:35

Futon is the graphical administration interface for CouchDB. It’s nice and slick for browsing and editing views, but there is one feature missing: you can’t edit _list functions in similar fashion. You need to edit them as JSON strings.

As I wanted to play a bit with _list, I’ve created a branch which implements such an interface. Its usage should be quite self-explanatory. Just select a _view, from there you can switch to the "List" tab to create or edit a _list function.

You can get my futon-list branch at GitHub. Instead of using git, you can also download the share/wwww directory (click on the download button within the ‘share/www’ directory) and unpack it over your current source.

In case you wonder why your _list function doesn’t work, the API has changed for CouchDB 0.10.

Screenshot of the _list interface in Futon

Screenshot of the _list interface in Futon

Categories: en, CouchDB, JavaScript

CouchDB _mix branch: Intersection of _view and _external

2009-04-21 22:35

In CouchDB it’s possible to query an external service (I’ll call it _external from now on) which returns an HTTP response directly to the client that made the request. Although this is already quite nice, it wasn’t possible to combine such _external requests with a classical _view.

The need for an intersection of _view and _external

Sometimes you’d like to exclude documents in a more dynamic fashion than a CouchDB _view supports it. Examples would be geospatial queries, a simple search like “exclude all documents that don’t contain a certain string in the title” or even fulltext searching. Therefore I’ve created a new handler called “_mix”.

The problem

As _external already exists quite a long time, it was clear that I would reuse the available functionality. The basic idea is simple: take all documents from a _view and all from _external, intersect them and finally output the result.

The problem is that CouchDB can be used for huge data sets, where you don’t want to keep a complete _view in memory to perform an intersection. The goals were:

  • The output needs to be streamable
  • Don’t keep all documents in memory
  • Use the existing functionality

The implementation

Over the past few months I had lengthy discussions with Paul Davis to find a suitable solution for the problem. We were going through all our ideas over and over again. The way I’ve implemented it now works for me so far, but it is definitely not the ultimate one and only solution, it’s just some solution.

As most of the functionality already exists, the current API of _view and _external is used. The difference is that it is POSTed as JSON to the mix handler instead of a GET request. Here’s an example with curl:

curl -d '{"design": "designdoc", "view": {"name": "viewname", "query": {"limit": "11"}}, "external": {"name": "minimal", "query": {"bbox": "[23,42,46,89]"}, "include_docs": false}}' http://localhost:5984/yourdb/_mix

At the moment most of the code is just copy and pasted from couch_httpd_view.erl and couch_httpd_external_* with some additional parsing of the POSTed JSON. The only new thing is that there’s an _external request before every document of a _view is outputted. This requests contains either the document ID or the whole document (if “include_docs” is set to “true”) and needs to return “true” if the document should be outputted (or resp. “false” if not).

I’ve included a sample _external script which excludes documents randomly (it can be found at src/contrib/minimal_external.py). To have a play with it, you just need to enable _external and add that script. How to do that can be found in the CouchDB Wiki.

Get it

All you need to do to have some fun with it is checking out my _mix branch at github.

Final words

And finally I’d like to thank Paul Davis for his time to discuss the issues with the intersection of _view and _external. Another “thank you” goes out to Adam Groves, he discovered a lot of annoyances with the parsing of the queries.

Categories: en, CouchDB

GeoCouch: Geospatial queries with CouchDB

2008-10-26 22:35

Notice: This blog post is outdated, please move on :)

Update (2009-09-19): There's a new GeoCouch release. More information at GeoCouch: New release 0.10.0.

After almost six months of silence I finally managed to get a prototype done (thanks Jan for keeping me motivated).

What do you get?

You get some code to play around with, to get a slight idea of how such a geospatial extension for CouchDB could look like. The code base isn’t polished yet, but it’s good enough to get it out of the door. The current version only supports one geometry type (POINT), and one operation (a bounding box search).

As CouchDB doesn’t allow an intersection of results gathered from an external service, the result of the bounding box search will be plain text document IDs and their coordinates.

How does it work?

GeoCouch consists of two parts, the indexer and the query processor. Both are connected through stdin/out with CouchDB.

Indexer (geostore)

In order to make the indexer understand which fields in the document contain geometries, a special design document is needed. As soon as a database has such a document, the database is geo-enabled and the indexer will store the geometries in a spatial index, which is a SpatiaLite database at the moment.

Everytime a database in CouchDB is altered (create, delete, update) the indexer gets notified and will act accordingly to keep the spatial index up to date with CouchDB.

Query processor (geoquery)

To process queries with an external service is possible with Paul Joseph Davis’ excellent external2 CouchDB branch. Queries to CouchDB can get passed along to an external service.

At the moment the result is the output of this service, it’s plain text in our case. In the future the external service will only return document IDs which will be passed back to the view. The result will be an intersection of document IDs of the view and the document IDs the external service returned.

How do I use it?

When everything is installed correctly it’s quite easy to get started.

Setting things up

  • Create a new database named geodata (could be anything).
  • Add a document named myhome, there you’ll store all the information of your home including the coordinates. As we are only interested in a bounding box search it’s enough to have a location:
    {
      "_id": "myhome",
      "_rev": "3358484250",
      "location": [ 151.208333, -33.869444 ]
    }
  • Add as many other documents like this, make sure all of them have a field called location with the coordinates as array. As for the database, the name of the field could be anything, but has to be the same in all documents.
  • Now we come to the interesting part, the special design view that geo-enables the database. The document has to be named “_design/_geocouch”. After creating it also needs some special fields and will look like this:

    {
      "_id": "_design/_geocouch",
      "_rev": "610069068",
      "srid": 4326,
      "loc": {
        "type": "POINT",
        "x": "location[0]",
        "y": "location[1]"
      }
    }

    The coordinate system that should be used is specified by an SRID. If you don’t know which value to use for srid, use 4326. It’s assumed that all geometries in your document belong to the same coordinate system.

    The other field is the information where to find the geometry in the documents. The name you choose will be used for the bounding box queries, I’ve chosen loc. It defines the type (POINT), and where to find the x/y coordinate (this will probably be changed to lat/lon in the future).

    The way to specify where to find the field is comparable to XPath, but much simpler. As JSON consists of nested dictionaries and arrays, you can get a property within an array with the index (e.g. location[0] is the first element in an Array called location). If it is a dictionary you specify it separated by a dot (e.g. location.x is a property named x within another one called location). It can of course be nested much deeper, the path always starts at the root of the document (e.g. bike.stolen.found[0]).

Bounding box search

And finally you can make a bounding box search. Simply browse a URL like this one (this is a bounding box that encloses the whole world):

http://localhost:5984/geodata/_external/geo?q={"geom":"loc","bbox":[-180,-90,180,90]}

The expected result is:

myhome 151.208333 -33.869444

Requirements

You’d like to give it a try? Here is a list of the software and their versions I used to get it work on my system, but others might work as well. GeoCouch includes installation/configuration instructions.

Download GeoCouch

Get SpacialCouch now! It’s new, it’s free (MIT licensed).

What’s next?

The current version is meant to play with, many things are not possible, many things needs to be improved. But with the power of SpatiaLite (and the underlying libraries) it shouldn’t be too hard.

Therefore I hope this will only be start and will end up in a discussion on what should be done, what other things might be possible. I’d love to hear your use cases for a geospatially enabled CouchDB.

Categories: en, CouchDB, Python, geo

CouchDB and geodata?

2008-05-03 22:35

Let me introduce the two protagonists. If you know them already, just skip this part.

CouchDB

From the official website:

Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API.

The word database is often connected to RDBMS, but CouchDB is way different. You don’t store your data in predefined tables and fields with certain data types like INTEGER or VARCHAR, but every database record is stored on it’s own (in so-called documents).

In RDBMS you build relations between several tables to store and receive the data; in a document-oriented DB (DODB) one record is stored after another (these records can, of course, be splitted into several documents that might even reference each other through their ID). The structure of these documents doesn’t matter for their storage. The big advantage is that if a new property is needed, just add it to the document. There’s no need to change any global context (like schema definitions of tables in RDBMS).

Geodata

I haven’t found a good definition for geodata, so here’s my own:

Geodata is data with a spatial reference.

This data is not restricted to the spatial reference only. Far more important is the actual (meta)data that is connected to this spatial reference. This data describes what it is all about. It could be a house with information about its number, age, size or a measuring station that monitors the temperature.

Are you serious?

Why should someone want to put his geodata into a big mess of thousands of documents instead of a nicely structured RDBMS? You don’t have to be a computer scientist to know that retrieving data out of a RDBMS is damn fast and a DODB approach sounds like a slow, “I grep through a long list of files”.

This might partly be true, but high performance shouldn’t be a use case for DODBs. Their flexibility and ease of usage is what they make them perform great. You have the choice between being fast or being flexible.

The use case

Flexibility over performance for geodata services has a use case when it comes to interoperability between different data sources.

Imagine you are the governor of a big country that consists of several smaller territories. Each of these have a smart guy that developed (independent of all the others) a system to collect data about how many bicycles topple over per day. It’s a geo-spational system, as the exact location where it happend is stored in the database.

All territories use a RDBMS, but from different manufacturers. In addition they store the information about the bikes differently. One territory distinguishes between bicycle for children, youth and adults; another one stores the size of the felly instead. Those information could be mapped very easily to a uniform one, but the territories don’t want to give up the infrastructures of their current systems. They still want to collect their data in their way.

What you really want is a solution to be able exchange the data easily between the territories and have uniform way to access the data country wide.

Solution I

To exchange their data they set a new layer of transformation above the current DB. The output will be a new format they both agreed on. This sounds like a good solution for the problem, but there are a few downsides:

  • The transformation could be very difficult to express with SQL. This could lead to huge slow downs. This isn’t such a big problem if you just exchange the data, but a big advantage, the speed of RDBMS, gets lost.
  • The transformation layer needs to support for DBs of different manufacturers.
  • Queries across territory borders seem difficult. Will all servers serve all data? Will you need to query multiple servers to get the data of two territories?
  • Heterogeneous environments lead to higher maintenance costs than homogeneous ones.

Solution II

All territories store their data in a new shiny type of DB, a DODB. If they collect the data, it’s currently transformed somehow to fit into a RDBMS. They could either change this and store it directly into the new DB (long term goal) or transform their current data to make it fit.

So what’s the difference between transforming the data from the RDBMS to another RDBMS or to a DODB?

  • Transforming to a DODB is more like a dump of the data, thus easy.
  • Probably you can’t convert to another existing DB schema, as this will lead to a lost of information. So a new DB schema needs to be created/an existing one altered (every time something new occurs).

Characteristics:

  • All data can be stored in one big database, queries across territories are easy (simple “if’s”)
  • A single database can be replicated easily.
  • Queries are slow compared to plain SQL queries on RDBMS, probably not suitable for real-time applications

Solution III

Follow the approach of Solution II but using one gigantic RDBMS that stores the DB schemas of all territories. That would work, too. The difference it that RDBMS wasn’t meant for such things.

Forecast

I think Solution II shows that CouchDB has a big potential in that area. At the moment it's more an idea than a solution, there a still a view contradictions, but these will hopefully be solved.

One crux are speedy retrievals of features within a certain bounding box, this issue will be the spotlight of a future post.

Categories: en, geo, CouchDB

By Volker Mische

Powered by Kukkaisvoima version 7