vmx

the blllog.

GeoCouch: The future is now

2010-05-03 22:35

Update: This blog entry is outdated and kepts for historical reasons. Please do always check for newer blog posts. The up to date information on how to install and use GeoCouch can be found in its README.

An idea has become reality. Exactly two years after the blog post with the initial vision, a new version of GeoCouch is finished. It's a huge step forward. The first time the dependencies were narrowed down to CouchDB itself. No Python, no SpatiaLite any longer, it's pure Erlang. GeoCouch is tightly integrated with CouchDB, so you'll get all the nice features you love about CouchDB.

Current implementation

Thanks to the feedback after the FOSS4G 2009 and "GeoCouch: The future" blog entry" it was clear that people prefer a simple, yet powerful and tightly integrated approach, rather than having to many external dependencies (which was a showstopper for quite a few people).

I implemented an R-tree (I call it vtree as the implementation is subject to change a lot) from scratch. The reason why I haven't used the already existing R-Tree implementation available at Github is that I needed something to learn Erlang, it doesn't contain test or examples and that it is always a good idea to implement a data structure yourself to understand the details/problems. My implementation is far from being perfect but works good enough for now. The vtree is implemented as an append-only data structure just as CouchDB's B-trees are. Currently it doesn't support bulk insertion.

If you want to know details on how to create your own indexer, have a look at my Indexer tutorial.

Feature set

Following the "Release early, release often" philosophy currently only points can be inserted, the only supported query is a bounding box search. Though other geometries should follow soon.

Using GeoCouch

GeoCouch is now hosted at Github. Giving GeoCouch a go is easy:

git clone http://github.com/vmx/couchdb.git
cd couchdb
./bootstrap
./configure
make dev
./utils/run

To try the spatial features when it's up and running is easy as well. Just add a spatial property and a named function to your Design Document as you would to for show or list functions:

function(doc) {
    if (doc.loc) {
        emit(doc._id, {
            type: "Point",
            coordinates: [doc.loc[0], doc.loc[1]]
        });
    }
};

All you need to do is emitting GeoJSON as the value (Remember that point is the only supported geometry at the moment), the key is currently ignored.

curl -X PUT http://127.0.0.1:5984/places
curl -X PUT -d '{"spatial":{"points":"function(doc) {\n    if (doc.loc) {\n        emit(doc._id, {\n            type: \"Point\",\n            coordinates: [doc.loc[0], doc.loc[1]]\n        });\n    }};"}}' http://127.0.0.1:5984/places/_design/main

Before a bounding box query can return anything, you need to insert Documents that contain a location.

curl -X PUT -d '{"loc": [-122.270833, 37.804444]}' http://127.0.0.1:5984/places/oakland
curl -X PUT -d '{"loc": [10.898333, 48.371667]}' http://127.0.0.1:5984/places/augsburg

And finally you can make a bounding box request:

curl -X GET 'http://localhost:5984/places/_design/main/_spatial/points/%5B0,0,180,90%5D'

This one should return only augsburg:

{"query1":[{"id":"augsburg","loc":[10.898333,48.371667]}]}

Next steps

The development of GeoCouch was quite slow in the past, but it gets up to speed as my diploma thesis (comparable to a master's thesis) will be about GeoCouch. Additionally Couchio kindly supports the development.

The next steps are (in no particular order):

  • Better R-tree (better splitting algorithm, bulk operations)
  • Supporting more geometries
  • Polygon search
  • Improving CouchDB's plugin capabilities

Thanks

I'd like to thank all the people that kept me motivated over the past two years with their tremendous feedback. Special thanks go to Jan Lehnardt for getting me onto the Couch, Cameron Shorter for introducing me into the geospatial open source business and all people from Couchio for the great two weeks in Oakland.

Categories: en, CouchDB, Python, Erlang, geo

Processing PDF files: Auto advance

2010-02-23 22:35

Sometimes you need a PDF file that auto advances (auto flip, slide show) pages after a certain amount of seconds. For example for presenting a Lightning Talk the Ingnite way. There are several ways to achieve this. Today I've spent hours to find the best way.

You could just hope that your favourite PDF viewer supports changing slides automatically in a certain interval (Evince doesn't). But you never know which viewer will be used when you rely on other people's computers. The next step is obvious, try to get the PDF file itself to auto advance. It is possible as Adobe Acrobat supports such a setting (it seems that even Acrobat Reader does, though I can't find that option in my one under Linux), I just need to find out how.

After some further research I found out that Latex' hyperref package supports it as well (no, I don't speak Czech). So I made some minimal Latex Beamer presentation to give it a try. The important notice that the \hypersetup{pdfpageduration=n} must be the first item within a \begin{frame} was found in some presentation guidelines. Guess what? It even works with Evince (tex file, PDF file).

I'm getting closer. Though my problem is that I create my slides with Inkscape (resp. Inkscape Slide), so I can't really user Latex Beamer for it. But the previously mentioned presentation guidelines also mention the /Dur entry in the PDF page object. So it should be easy to add it manually. And it really is. A quick search through the PDF file generated by Latex you can see that /Dur occurs a close to /MediaBox. After adding those /Dur 2 to my original presentation PDF file right after \MediaBox it auto flipped every 2 seconds.

I could have written a simple script that adds it to the PDF at the right place, but that sounds pretty fragile. A better approach would be to use a PDF library that is meant for manipulating PDF files. As my favourite programming language is Python at the moment, I came across pyPdf. A quick look at the internals showed that it contains everything I need.

Here's my final solution for the problem of creating auto advancing PDF slides. A small script that does exactly what I need (and not more). I've used the Python 3 version of pyPdf, but the script should look similar for Python 2.x.

#!/usr/bin/env python3.1
# Copyright (c) 2010 Volker Mische (http://vmx.cx/)
# Licensed under MIT.

import sys
from pyPdf import PdfFileWriter, PdfFileReader
from pyPdf.generic import NameObject, NumberObject

def main(argv=None):
    if argv is None:
        argv = sys.argv

    if len(argv) != 4:
        print('Usage: setduration.py [duration-in-seconds] [input-pdf]',
              '[output-pdf]')
        return

    pdfin = PdfFileReader(open(argv[2], "rb"))
    pdfout = PdfFileWriter()

    for page in pdfin.pages:
        page[NameObject('/Dur')] = NumberObject(argv[1])
        pdfout.addPage(page)

    outputStream = open(argv[3], "wb")
    pdfout.write(outputStream)

if __name__ == '__main__':
    sys.exit(main())

Categories: en, Python

GeoCouch: The future

2009-12-20 22:35

GeoCouch started as a proof of concept and was heavily rewritten for the 0.10 release. As more and more people got interested, I got feedback to see what people really want/need. And now it's time to determine the future of GeoCouch. It's your chance to shape the future. In this blog entry I'll explain my ideas for the future, but I'm more than happy to get further ideas/complains from you. So please check if my ideas match your use-cases for GeoCouch.

Stripping it down

GeoCouch needs an external spatial index, at the moment I use SpatiaLite for it, but a PostGIS backend would be easily possible. My inital idea was that it is better to use the existing power of spatial databases, rather than reinventing the wheel. I though I could use all the power they have, that I can even use them for complex analytics, but I can't. As I only store the geometries, I need to “ask” CouchDB for the attributes (no, I don't want to store attributes in my spatial index).

If I don't use the full power of the spatial databases, but only a small fraction, there might be better solution. Therefore I propose that GeoCouch will use a simple spatial index for storing the geometries, not a full blown spatial database. I haven't decided yet which one it'll be, but I really think about moving this part to Erlang (I know that quite a few people would love that move).

You will loose functionality like reprojection. The spatial index won't know anything about projections. So GeoCouch won't be projection aware anymore, but you application still can be. For example if you want to return your data in a different projection than it was stored, you do the transformation after you've queried GeoCouch.

You would also loose fancy things for geometries, like boolean operations on them. But this is something I'd call complex analytics, and not simple querying.

GeoCouch would only support three simple queries: bounding search, polygon search and radius/distance search. If the search would be within a union of polygons, let's say all countries of the European Union, you would simply make the union operation before you query GeoCouch.

Complex analytics

What I call “complex analytics” is things like: “return all apple trees that are located with a 10km range around buildings that have are over 100m high, but only in countries with a population over 50 million people” is not possible with GeoCouch as you would need the attribute values as well. Those are stored in CouchDB, so you would need to request them. What GeoCouch only supports is a simple: give me all IDs within a bounding box/polygon/radius.

Conclusion

Simple requests are needed for everyday use, thus they should be incredibly fast. Complex analytics don't necessarily need to handle thousands of requests per second, in most cases they don't even need to be processed in real-time. I'd like to see some layer build above GeoCouch, so CouchDB can even be used for analytics (which is a thing I wanted to have right from the start).

This means that GeoCouch will be mainly for high performance and massive sized projects that need some simple spatial bits, what I think the majority of users need.

If you either think you really need only those simple queries, but you want them to be fast, or you think this is wrong, that you need dynamic reprojection I can only invite you to leave a comment below or drop a mail to volker.mische@gmail.com. Thanks.

Categories: en, CouchDB, Python, geo

FOSS4G 2009: “Geodata and CouchDB” presentation is online

2009-11-17 22:35

The final wrap-up of the FOSS4G 2009, my presentation on “Geodata and CouchDB” is available online in several formats. It should also be of interest for people who are new to CouchDB as huge parts of the talk are an introduction into CouchDB.

Categories: en, CouchDB, Python, geo

Benchmarking is not easy

2009-09-23 22:35

There are so many ways to have a play with CouchDB. This time I thought about using CouchDB as a TileCache storage. Sounds easy, so it was.

What is a tilecache

Everyone knows Google Maps and its small images, called tiles. Rendering those tiles for the whole world for every zoom level can be quite time consuming, therefore you can render them on demand and cache them once they are rendered. This is the business of a tilecache.

You can use the tilecache as a proxy to a remote tile server as well, that's what I did for this benchmark.

Coding

The implementation looks quite similar to the memcache one. I haven't implemented locking as I was just after something working, not a full-fledged backend.

When I finished coding, it was time to find out how it performs. That should be easy, as there's a tilecache_seeding script bundled with TileCache to fill the cache. So you fill the cache, then you switch the remote server off and test how long it takes if all requests are hits without any fails (i.e. all tiles are in your cache and don't need to be requested from a remote server).

The two contestants for the benchmark are the CouchDB backend and the one that stores the tiles directly on the filesystem.

Everyone loves numbers

We keep it simple and measure the time for seeding with time. How long will it take to request 780 tiles? The first number is the average (in seconds), the one in brackets the standard deviation.

  • Filesystem:

    real 0.35 (0.04)
    user 0.16 (0.02)
    sys  0.05 (0.01)
    
  • CouchDB:

    real 3.03 (0.18)
    user 0.96 (0.05)
    sys  0.21 (0.03)
    

Let's say CouchDB is 10 times slower that the file system based cache. Wow, CouchDB really sucks! Why would you use it as tile storage? Although you could:

  • easily store metadata with every tile, like a date when it should expire.
  • keep a history of tiles and show them as "travel through time layers" in your mapping application
  • easy replication to other servers

You just don't want such a slow hog. And those CouchDB people try to tell me that CouchDB would be fast. Pha!

Really??

You might already wonder, where the details are, the software version numbers, the specification of the system and all that stuff? These things are missing with a good reason. This benchmark just isn't right, even if I would add these details. The problem lies some layers deeper.

This benchmark is way to far away from a real-life usage. You would request much more tiles and not the same 780 ones with every run. When I was benchmarking the filesystem cache, all tiles were already in the system's cache, therefore it was that fast.

Simple solution: clear the system cache and run the tests again. Here are the results after as echo 3 > /proc/sys/vm/drop_caches

  • Filesystem:

    real 8.36 (0.71)
    user 0.29 (0.04)
    sys  0.18 (0.03)
    
  • CouchDB:

    real 6.64 (0.15)
    user 1.13 (0.07)
    sys  0.29 (0.06)
    

Wow, the CouchDB cache is faster than the filesystem cache. Too nice to be true. The reason is easy: loading the CouchDB database file, thus one file access on the disk, is way faster that 780 accesses.

Does it really matter?

Let's take the first benchmark, if CouchDB would be that much slower, but isn't it perhaps fast enough? Even with those measures (ten times slower than the filesystem cache) it would mean your cache can take 250 requests per second. Let's say a user requests 9 tiles per second it would be about 25 users at the same time. With every user staying 2 minutes on the map it would mean 18 000 users per day. Not bad.

Additionally you gain some nice things you won't have with other caches (as outlined above). And if you really need more performance you could always dump the tiles to the filesystem with a cron job.

Conclusion

  1. Benchmarking is not easy, but easy to get wrong.
  2. Slow might be fast enough.
  3. Read more about benchmarking on Jan's blog.

Categories: en, CouchDB, Python, TileCache, geo

GeoCouch: New release (0.10.0)

2009-09-19 22:35

Notice: This blog post is outdated, please move on :)

It has been way to long since the initial release, but it’s finally there: a new release of GeoCouch. For all first time visitors, GeoCouch is an extension for CouchDB to support geo-spatial queries like bounding box or polygon searches.

I keep this blog entry relatively short and only outline the highlights and requirements for the new release as GeoCouch finally has a real home at http://gitorious.org/geocouch/. Feel free to contribute to the wiki or fork the source.

Highlights

  • Many geometries are supported: points, lines, polygons (using Shapely).
  • Queries are largely along the lines of the OpenSearch-Geo extension draft. Currently supported are bounding box and polygon searches.
  • Adding new backends (in addition to SpatiaLite) is easily possible.

Requirements

Other versions might work.

Download

If you don’t like Git, you can download GeoCouch 0.10.0 here.

Categories: en, CouchDB, Python, geo

CouchDB: Returning all design documents with Python

2009-08-21 22:35

I just wanted to get all design documents of a CouchDB database with couchdb-python. I couldn’t find any hints how to do it, it took longer to find out than expected. Therefore this blog entry, perhaps I save someone a few minutes of research.

from couchdb.client import Server
couch_server = Server('http://localhost:5984/')
for designdoc in couch_server['yourdatabase']\
        .view('_all_docs', startkey='_design', endkey='_design0'):
    print 'designdoc: %s' % designdoc

Update: even simpler with slicing:

from couchdb.client import Server
couch_server = Server('http://localhost:5984/')
for designdoc in couch_server['yourdatabase']\
        .view('_all_docs')['_design':'_design0']:
    print 'designdoc: %s' % designdoc

Categories: en, CouchDB, Python

pythonpath: access nested data structures easily

2008-11-08 22:35

With pythonpath you can access values within nested Python data structures easily. This is especially useful if you use JSON. If know which element you want, you can access it directly with a string (a pythonpath).

I created pythonpath for GeoCouch as I needed a way to point to a specific element within a JSON structure. But I think it might be useful for others as well, hence this blog entry.

Take the following data structure:

{ "type": "Feature",
  "geometry": {"type": "Point", "coordinates": [151.21 -33.87]}
}

If you want to get the type of the geometry, you'd normally access it by

type = geo['geometry']['type'].

With pythonpath it would be:

type = pythonpath.get_item('geometry.type', geo)

Accessing array elements is just as easy:

lat = pythonpath.get_item('geometry.coordinates[1]', geo)

The syntax is quite simple, a dot (.propertyname) for dictionary element, array notation for an array element ([position]). The escape character is backslash (\).

Here's the code:

# Copyright (c) 2008 Volker Mische (http://vmx.cx/)
# Licensed under MIT.

import operator
import re

def parser(path):
    items = []

    for prop in re.split('(? 0:
            if len(prop) > len(brackets):
                items.append(prop[:-len(brackets)])

            for index in re.findall('\[(\d+)\]', brackets):
                items.append(int(index))

            continue

        items.append(prop)

    return items


def get_item(path, data):
    itemgetters = map(operator.itemgetter, parser(path))
    for getit in itemgetters:
        data = getit(data)
    return data

You can also download this code together with some tests.

Categories: en, Python

GeoCouch: Geospatial queries with CouchDB

2008-10-26 22:35

Notice: This blog post is outdated, please move on :)

Update (2009-09-19): There's a new GeoCouch release. More information at GeoCouch: New release 0.10.0.

After almost six months of silence I finally managed to get a prototype done (thanks Jan for keeping me motivated).

What do you get?

You get some code to play around with, to get a slight idea of how such a geospatial extension for CouchDB could look like. The code base isn’t polished yet, but it’s good enough to get it out of the door. The current version only supports one geometry type (POINT), and one operation (a bounding box search).

As CouchDB doesn’t allow an intersection of results gathered from an external service, the result of the bounding box search will be plain text document IDs and their coordinates.

How does it work?

GeoCouch consists of two parts, the indexer and the query processor. Both are connected through stdin/out with CouchDB.

Indexer (geostore)

In order to make the indexer understand which fields in the document contain geometries, a special design document is needed. As soon as a database has such a document, the database is geo-enabled and the indexer will store the geometries in a spatial index, which is a SpatiaLite database at the moment.

Everytime a database in CouchDB is altered (create, delete, update) the indexer gets notified and will act accordingly to keep the spatial index up to date with CouchDB.

Query processor (geoquery)

To process queries with an external service is possible with Paul Joseph Davis’ excellent external2 CouchDB branch. Queries to CouchDB can get passed along to an external service.

At the moment the result is the output of this service, it’s plain text in our case. In the future the external service will only return document IDs which will be passed back to the view. The result will be an intersection of document IDs of the view and the document IDs the external service returned.

How do I use it?

When everything is installed correctly it’s quite easy to get started.

Setting things up

  • Create a new database named geodata (could be anything).
  • Add a document named myhome, there you’ll store all the information of your home including the coordinates. As we are only interested in a bounding box search it’s enough to have a location:
    {
      "_id": "myhome",
      "_rev": "3358484250",
      "location": [ 151.208333, -33.869444 ]
    }
  • Add as many other documents like this, make sure all of them have a field called location with the coordinates as array. As for the database, the name of the field could be anything, but has to be the same in all documents.
  • Now we come to the interesting part, the special design view that geo-enables the database. The document has to be named “_design/_geocouch”. After creating it also needs some special fields and will look like this:

    {
      "_id": "_design/_geocouch",
      "_rev": "610069068",
      "srid": 4326,
      "loc": {
        "type": "POINT",
        "x": "location[0]",
        "y": "location[1]"
      }
    }

    The coordinate system that should be used is specified by an SRID. If you don’t know which value to use for srid, use 4326. It’s assumed that all geometries in your document belong to the same coordinate system.

    The other field is the information where to find the geometry in the documents. The name you choose will be used for the bounding box queries, I’ve chosen loc. It defines the type (POINT), and where to find the x/y coordinate (this will probably be changed to lat/lon in the future).

    The way to specify where to find the field is comparable to XPath, but much simpler. As JSON consists of nested dictionaries and arrays, you can get a property within an array with the index (e.g. location[0] is the first element in an Array called location). If it is a dictionary you specify it separated by a dot (e.g. location.x is a property named x within another one called location). It can of course be nested much deeper, the path always starts at the root of the document (e.g. bike.stolen.found[0]).

Bounding box search

And finally you can make a bounding box search. Simply browse a URL like this one (this is a bounding box that encloses the whole world):

http://localhost:5984/geodata/_external/geo?q={"geom":"loc","bbox":[-180,-90,180,90]}

The expected result is:

myhome 151.208333 -33.869444

Requirements

You’d like to give it a try? Here is a list of the software and their versions I used to get it work on my system, but others might work as well. GeoCouch includes installation/configuration instructions.

Download GeoCouch

Get SpacialCouch now! It’s new, it’s free (MIT licensed).

What’s next?

The current version is meant to play with, many things are not possible, many things needs to be improved. But with the power of SpatiaLite (and the underlying libraries) it shouldn’t be too hard.

Therefore I hope this will only be start and will end up in a discussion on what should be done, what other things might be possible. I’d love to hear your use cases for a geospatially enabled CouchDB.

Categories: en, CouchDB, Python, geo

By Volker Mische

Powered by Kukkaisvoima version 7