geo | vmx - the blllog.

About me

My name is Volker Mische and I'm an open source enthusiast and hacker. You can reach me via email, or Bluesky (@vmx.cx). Find me also on GitHub.

Non-validating WKT parser for Erlang

2010-05-14 22:35

The upcoming OpenSearch Geo specification will add support for querying with WKT (Well-Known Text). As I plan to support this specification in GeoCouch, I was in need of a WKT parser written in Erlang. I tried several ways to write this parser, but I ended up with writing it manually, based on the ideas of the fabulous MochiWeb JSON2 Parser

The parser is meant for fast parsing, it is non-validating. This means that it parses only valid WKT and all other strings that seem to be valid, but are not. The grammar is simplified to (in EBNF as used for the XML spec):

wkt ::= item | string  '(' space* item (comma item)* ')'
item ::= string (geom | list | nested_list | item | 'EMPTY')
nested_list ::= space* '(' list (comma list)* ')' | '(' nested_list+ ')'
list ::= '(' geom (comma geom)* ')'
geom ::= space* '(' coord (comma coord)* ')'
coord ::= space* number (space+ number)*
number ::= integer | float 
integer ::=  ('-' | '+')? [0-9]+
float ::= ('-' | '+')? [0-9]+ '.' [0-9]+ exponent?
exponent = 'E' ('-' | '+')? [0-9]+
string ::= [a-zA-Z]+ (space* [a-zA-Z])*
space :== #x20
comma :== ',' space*

I hope I got the grammar right, leave a comment if not. This means also strings like this(is(10 20), a test EMPTY) would be parsed to:

{this,[{is,[{10,20}]},{'a test',[]}]}

A validating parser would be much slower as it would also need to perform checks on the geometry, e.g. for polygons whether interiors are really within the exterior ring or not.

The general rule is, a list of coordinates is transformed to a tuple, a list of coordinates to a list. The geometry name will be an atom. Here's an example for a polygon:

wkt:parse("POLYGON ((102 103, 204 205, 306 107, 102 103),
                    (12 13, 24 25, 36 17, 12 13),
                    (62 63, 74 75, 86 67, 62 63))").
{polygon,[[{102,103},{204,205},{306,107},{102,103}],
          [{12,13},{24,25},{36,17},{12,13}],
          [{62,63},{74,75},{86,67},{62,63}]]}

In case you're getting excited now, the source is available at Github, realeased under the MIT License.

If someone plans to write a validating WKT parser for Erlang (please let me know), I propose using neotoma it's really a nice "packrat parser-generator for Erlang for Parsing Expression Grammars (PEGs)".

No comments

Categories: en, GeoCouch, Erlang, geo

GeoCouch: The future is now

2010-05-03 22:35

Update: This blog entry is outdated and kepts for historical reasons. Please do always check for newer blog posts. The up to date information on how to install and use GeoCouch can be found in its README.

An idea has become reality. Exactly two years after the blog post with the initial vision, a new version of GeoCouch is finished. It's a huge step forward. The first time the dependencies were narrowed down to CouchDB itself. No Python, no SpatiaLite any longer, it's pure Erlang. GeoCouch is tightly integrated with CouchDB, so you'll get all the nice features you love about CouchDB.

Current implementation

Thanks to the feedback after the FOSS4G 2009 and "GeoCouch: The future" blog entry" it was clear that people prefer a simple, yet powerful and tightly integrated approach, rather than having to many external dependencies (which was a showstopper for quite a few people).

I implemented an R-tree (I call it vtree as the implementation is subject to change a lot) from scratch. The reason why I haven't used the already existing R-Tree implementation available at Github is that I needed something to learn Erlang, it doesn't contain test or examples and that it is always a good idea to implement a data structure yourself to understand the details/problems. My implementation is far from being perfect but works good enough for now. The vtree is implemented as an append-only data structure just as CouchDB's B-trees are. Currently it doesn't support bulk insertion.

If you want to know details on how to create your own indexer, have a look at my Indexer tutorial.

Feature set

Following the "Release early, release often" philosophy currently only points can be inserted, the only supported query is a bounding box search. Though other geometries should follow soon.

Using GeoCouch

GeoCouch is now hosted at Github. Giving GeoCouch a go is easy:

git clone http://github.com/vmx/couchdb.git
cd couchdb
./bootstrap
./configure
make dev
./utils/run

To try the spatial features when it's up and running is easy as well. Just add a spatial property and a named function to your Design Document as you would to for show or list functions:

function(doc) {
    if (doc.loc) {
        emit(doc._id, {
            type: "Point",
            coordinates: [doc.loc[0], doc.loc[1]]
        });
    }
};

All you need to do is emitting GeoJSON as the value (Remember that point is the only supported geometry at the moment), the key is currently ignored.

curl -X PUT http://127.0.0.1:5984/places
curl -X PUT -d '{"spatial":{"points":"function(doc) {\n    if (doc.loc) {\n        emit(doc._id, {\n            type: \"Point\",\n            coordinates: [doc.loc[0], doc.loc[1]]\n        });\n    }};"}}' http://127.0.0.1:5984/places/_design/main

Before a bounding box query can return anything, you need to insert Documents that contain a location.

curl -X PUT -d '{"loc": [-122.270833, 37.804444]}' http://127.0.0.1:5984/places/oakland
curl -X PUT -d '{"loc": [10.898333, 48.371667]}' http://127.0.0.1:5984/places/augsburg

And finally you can make a bounding box request:

curl -X GET 'http://localhost:5984/places/_design/main/_spatial/points/%5B0,0,180,90%5D'

This one should return only augsburg:

{"query1":[{"id":"augsburg","loc":[10.898333,48.371667]}]}

Next steps

The development of GeoCouch was quite slow in the past, but it gets up to speed as my diploma thesis (comparable to a master's thesis) will be about GeoCouch. Additionally Couchio kindly supports the development.

The next steps are (in no particular order):

Better R-tree (better splitting algorithm, bulk operations)
Supporting more geometries
Polygon search
Improving CouchDB's plugin capabilities

Thanks

I'd like to thank all the people that kept me motivated over the past two years with their tremendous feedback. Special thanks go to Jan Lehnardt for getting me onto the Couch, Cameron Shorter for introducing me into the geospatial open source business and all people from Couchio for the great two weeks in Oakland.

12 Comments

Categories: en, CouchDB, Python, Erlang, geo

GeoCouch: The future

2009-12-20 22:35

GeoCouch started as a proof of concept and was heavily rewritten for the 0.10 release. As more and more people got interested, I got feedback to see what people really want/need. And now it's time to determine the future of GeoCouch. It's your chance to shape the future. In this blog entry I'll explain my ideas for the future, but I'm more than happy to get further ideas/complains from you. So please check if my ideas match your use-cases for GeoCouch.

Stripping it down

GeoCouch needs an external spatial index, at the moment I use SpatiaLite for it, but a PostGIS backend would be easily possible. My inital idea was that it is better to use the existing power of spatial databases, rather than reinventing the wheel. I though I could use all the power they have, that I can even use them for complex analytics, but I can't. As I only store the geometries, I need to “ask” CouchDB for the attributes (no, I don't want to store attributes in my spatial index).

If I don't use the full power of the spatial databases, but only a small fraction, there might be better solution. Therefore I propose that GeoCouch will use a simple spatial index for storing the geometries, not a full blown spatial database. I haven't decided yet which one it'll be, but I really think about moving this part to Erlang (I know that quite a few people would love that move).

You will loose functionality like reprojection. The spatial index won't know anything about projections. So GeoCouch won't be projection aware anymore, but you application still can be. For example if you want to return your data in a different projection than it was stored, you do the transformation after you've queried GeoCouch.

You would also loose fancy things for geometries, like boolean operations on them. But this is something I'd call complex analytics, and not simple querying.

GeoCouch would only support three simple queries: bounding search, polygon search and radius/distance search. If the search would be within a union of polygons, let's say all countries of the European Union, you would simply make the union operation before you query GeoCouch.

Complex analytics

What I call “complex analytics” is things like: “return all apple trees that are located with a 10km range around buildings that have are over 100m high, but only in countries with a population over 50 million people” is not possible with GeoCouch as you would need the attribute values as well. Those are stored in CouchDB, so you would need to request them. What GeoCouch only supports is a simple: give me all IDs within a bounding box/polygon/radius.

Conclusion

Simple requests are needed for everyday use, thus they should be incredibly fast. Complex analytics don't necessarily need to handle thousands of requests per second, in most cases they don't even need to be processed in real-time. I'd like to see some layer build above GeoCouch, so CouchDB can even be used for analytics (which is a thing I wanted to have right from the start).

This means that GeoCouch will be mainly for high performance and massive sized projects that need some simple spatial bits, what I think the majority of users need.

If you either think you really need only those simple queries, but you want them to be fast, or you think this is wrong, that you need dynamic reprojection I can only invite you to leave a comment below or drop a mail to volker.mische@gmail.com. Thanks.

10 Comments

Categories: en, CouchDB, Python, geo

FOSS4G 2009: “Geodata and CouchDB” presentation is online

2009-11-17 22:35

The final wrap-up of the FOSS4G 2009, my presentation on “Geodata and CouchDB” is available online in several formats. It should also be of interest for people who are new to CouchDB as huge parts of the talk are an introduction into CouchDB.

The raw slides as PDF (licensed under CC-BY-3.0-de).
The slides with comments as HTML (licensed under CC-BY-3.0-de).
The slides with audio (or at blib.tv). It’s the recording of the actual talk at the conference. Thanks Alex and FOSSLC for recording it (licensed under CC-BY-SA-3.0).

No comments

Categories: en, CouchDB, Python, geo

Drag as long as you want

2009-11-11 22:35

It has been a very long outstanding bug (officially it was a missing feature) in OpenLayers that annoyed me from the first time I’ve been using OpenLayers. I’m talking about ticket #39: “Allow pan-dragging while outside map until mouseup”.

Normally when you drag the map in OpenLayers it will stop dragging as soon as you hit the edge of the map viewport (the div that contains the map). Whenever you have a small map, but a huge window and a loooong way to drag, it can get quite annoying, as the maximum distance you can drag at once is the size of that viewport.

But yesterday it finally happend. A patch to fix it landed in trunk. A first rough cut was made at the OpenLayers code sprint at the FOSS4G. Andreas Hocevar reviewed the code and made a more unobtrusive version of it (thanks, again).

Try these two examples to see the difference. Click on the map an drag it a long way to the right and back to the left again (you might need to zoom it a bit to see the full effect):

As it is a new feature, it isn’t enabled by default (and only available on current SVN trunk, it will be available in OpenLayers 2.9). To enable it on your map, just use the following code to add the documentDrag parameter to the DragPan control (you obviously need a recent SVN checkout).

Update (2009-11-18): It got even easier with r9805:

// Use default controls but with documentDrag enabled.
var controls = [
    new OpenLayers.Control.Navigation({documentDrag: true}),
    new OpenLayers.Control.PanZoom(),
    new OpenLayers.Control.ArgParser(),
    new OpenLayers.Control.Attribution()]
map = new OpenLayers.Map('map', {controls: controls});

For a full working version have a look at the source of the documentDrag example.

No comments

Categories: en, OpenLayers, JavaScript, geo

FOSS4G 2009: It was great

2009-10-25 22:35

The FOSS4G 2009 (Free and Open Source Software for Geospatial Conference) is over now, it was great. I've finally met many people that I've previously only chatted or discussed on mailing lists with.

Organisation and venue

The Sydney Convention & Exhibition Centre Darling Harbour really is an amazing venue and Arinex did a great job as well. We had good food, the technicians were keeping everything up and running, even the wireless internet didn't break down and performed well.

The Organising Committee did an excellent job (especially Mark), too. I exclude myself a bit, I was more the code monkey before the conference, rather than keeping that conference running smoothly. But because of that I had the chance to visit quite a few presentations.

Presentations

Probably the most favoured presentation was Paul Ramsey's Keynote speech. It was just incredibly insightful and entertaining (watch it at YouTube). it here.

There were to other excellent presentations as well. First the Mapping interviews with open source technologies by Chris McDowall. He is using a video projector and a Wii remote control in order to map locations people are pointing at during an interview (just watch this video to get a better idea).

And second the Visualising animal movements in ‘near’ real time by Ben Madin. It was about a project where they try to track the movements if cows in Southeast Asia. The idea is to place GSM transmitters in one of the cows' stomach to track their position. But they are facing problem like "How to get a GSM signal through 40cm of meat". Really interesting.

Geodata and CouchDB

So how did my talk go? I'm very happy with it. I haven't expected so much positive feedback and so many good conversations about CouchDB and GeoCouch afterwards and during the next days.

After show parties

After the talks it's time to socialise while having a few beers. It was again great, every single night.

One outstanding event was the Ignite Spatial on Wednesday. 10 high paced talks with 20 slides displayed 15 secs each. My favourite one was the Pie charts are evil talk by Glen Bell. Another result of the night is that I'll always think about short green skirts whenever someone is mentioning Google Wave.

The code sprint

I was code sprinting OpenLayers. It was well organised and we got some cool new stuff in. Sadly, I haven't reached my goal of fixing Ticket 39, but hopefully soon (or next year in Barcelona). But I was discussing with Roald de Wit and Andreas Hocevar the implementation details of the abstraction of the UI in OpenLayers (that idea was discussed in the Openlayers BOF).

Final words

Yes, it really was great. I hope to see you all again in Barcelona at the FOSS4G 2010.

No comments

Categories: en, geo

Benchmarking is not easy

2009-09-23 22:35

There are so many ways to have a play with CouchDB. This time I thought about using CouchDB as a TileCache storage. Sounds easy, so it was.

What is a tilecache

Everyone knows Google Maps and its small images, called tiles. Rendering those tiles for the whole world for every zoom level can be quite time consuming, therefore you can render them on demand and cache them once they are rendered. This is the business of a tilecache.

You can use the tilecache as a proxy to a remote tile server as well, that's what I did for this benchmark.

Coding

The implementation looks quite similar to the memcache one. I haven't implemented locking as I was just after something working, not a full-fledged backend.

When I finished coding, it was time to find out how it performs. That should be easy, as there's a tilecache_seeding script bundled with TileCache to fill the cache. So you fill the cache, then you switch the remote server off and test how long it takes if all requests are hits without any fails (i.e. all tiles are in your cache and don't need to be requested from a remote server).

The two contestants for the benchmark are the CouchDB backend and the one that stores the tiles directly on the filesystem.

Everyone loves numbers

We keep it simple and measure the time for seeding with time. How long will it take to request 780 tiles? The first number is the average (in seconds), the one in brackets the standard deviation.

Filesystem:

real 0.35 (0.04)
user 0.16 (0.02)
sys  0.05 (0.01)

CouchDB:

real 3.03 (0.18)
user 0.96 (0.05)
sys  0.21 (0.03)

Let's say CouchDB is 10 times slower that the file system based cache. Wow, CouchDB really sucks! Why would you use it as tile storage? Although you could:

easily store metadata with every tile, like a date when it should expire.
keep a history of tiles and show them as "travel through time layers" in your mapping application
easy replication to other servers

You just don't want such a slow hog. And those CouchDB people try to tell me that CouchDB would be fast. Pha!

Really??

You might already wonder, where the details are, the software version numbers, the specification of the system and all that stuff? These things are missing with a good reason. This benchmark just isn't right, even if I would add these details. The problem lies some layers deeper.

This benchmark is way to far away from a real-life usage. You would request much more tiles and not the same 780 ones with every run. When I was benchmarking the filesystem cache, all tiles were already in the system's cache, therefore it was that fast.

Simple solution: clear the system cache and run the tests again. Here are the results after as echo 3 > /proc/sys/vm/drop_caches


  Filesystem:
real 8.36 (0.71)
user 0.29 (0.04)
sys  0.18 (0.03)

  
  CouchDB:
real 6.64 (0.15)
user 1.13 (0.07)
sys  0.29 (0.06)


Wow, the CouchDB cache is faster than the filesystem cache. Too nice to be
true. The reason is easy: loading the CouchDB database file, thus one file
access on the disk, is way faster that 780 accesses.


Does it really matter?
Let's take the first benchmark, if CouchDB would be that much slower, but
isn't it perhaps fast enough? Even with those measures (ten times
slower than the filesystem cache) it would mean your cache can take 250
requests per second. Let's say a user requests 9 tiles per second it would be
about 25 users at the same time. With every user staying 2 minutes on the map
it would mean 18 000 users per day. Not bad.

Additionally you gain some nice things you won't have with other
caches (as outlined above). And if you really need more performance you could
always dump the tiles to the filesystem with a cron job.


Conclusion

  Benchmarking is not easy, but easy to get wrong.
  Slow might be fast enough.
  Read more about benchmarking on
Jan's
blog.

No comments
Categories:
en, 
CouchDB, 
Python, 
TileCache, 
geo



GeoCouch: New release (0.10.0)


2009-09-19
22:35


Notice: This blog post is outdated, please move on :)


It has been way to long since the initial release, but it’s finally there:
a new release of GeoCouch. For all first time visitors, GeoCouch is an
extension for CouchDB to support
geo-spatial queries like bounding box or polygon searches.

I keep this blog entry relatively short and only outline the highlights and
requirements for the new release as GeoCouch finally has a real home at
http://gitorious.org/geocouch/.
Feel free to contribute to the wiki or fork the source.


Highlights

  Many geometries
are
supported: points, lines, polygons (using Shapely).
  Queries are largely along the lines of the
OpenSearch-Geo
extension draft. Currently
supported are
bounding box and polygon searches.
  Adding new backends (in addition to SpatiaLite) is easily possible.


Requirements

  Linux 2.6.26
  CouchDB 0.10.0
  Python 2.6.0
  couchdb-python 0.6.x (0.6.0 doesn't work)
  Shapely 1.0.12
  APSW - Another Python SQLite Wrapper 3.5.9-r2
  SpatiaLite 2.3.1

Other versions might work.

Download
If you don’t like Git, you can
download GeoCouch 0.10.0
here.

9 Comments
Categories:
en, 
CouchDB, 
Python, 
geo




FOSS4G 2009: I'm speaking


2009-07-21
22:35


  
    
  


I did it! I'll speak on the FOSS4G
Conference 2009 (Free and Open Source Software for Geospatial Conference),
20th–23rd October in Sydney about “CouchDB and Geodata”. More information
is available at the
official
website.

No comments
Categories:
en, 
CouchDB, 
geo




Poor man’s bounding box queries with CouchDB


2009-07-19
22:35


Several
people
store
geographical points within CouchDB and would like to make a
bounding box
query on them. This isn’t possible with plain CouchDB
_views. But there’s
light at the end of the tunnel. One solution will be
GeoCouch
(which can do a lot more than simple bounding box queries), once there’s a new
release, the other one is already there: you can use a the
list/show
API (Warning: the current wiki page (as at 2009-07-19) applies to CouchDB 0.9, I use the new 0.10 API).

You can either add a _list function as described in the
documentation or use my
futon-list
branch which includes an interface for easier _list function creation/editing.


Your data

The _list function needs to match your data, thus I expect documents with
a field named location which contains an array with the
coordinates. Here’s a simple example document:


  

{
   "_id": "00001aef7b72e90b991975ef2a7e1fa7",
   "_rev": "1-4063357886",
   "name": "Augsburg",
   "location": [
       10.898333,
       48.371667
   ],
   "some extra data": "Zirbelnuss"
}




The _list function

We aim at creating a _list function that returns the same response as a
normal _view would return, but filtered with a bounding box. Let’s start
with a _list function which returns the same results as plain _view (no
bounding box filtering, yet). The whitespaces of the output differ slightly.


  
function(head, req) {
    var row, sep = '\n';

    // Send the same Content-Type as CouchDB would
    if (req.headers.Accept.indexOf('application/json')!=-1)
      start({"headers":{"Content-Type" : "application/json"}});
    else
      start({"headers":{"Content-Type" : "text/plain"}});

    send('{"total_rows":' + head.total_rows +
         ',"offset":'+head.offset+',"rows":[');
    while (row = getRow()) {
        send(sep + toJSON(row));
        sep = ',\n';
    }
    return "\n]}";
};



The _list API allows to you add any arbitrary query string to the URL. In
our case that will be bbox=west,south,east,north (adapted from the
OpenSearch
Geo Extension). Parsing the bounding box is really easy. The query
parameters of the request are stored in the property req.query as
key/value pairs. Get the bounding box, split it into separate values and
compare it with the values of every row.


  
var row, location, bbox = req.query.bbox.split(',');
while (row = getRow()) {
    location = row.value.location;
    if (location[0]>bbox[0] && location[0]<bbox[2] &&
            location[1]>bbox[1] && location[1]<bbox[3]) {
        send(sep + toJSON(row));
        sep = ',\n';
    }
}

And finally we make sure that no error message is thrown when the
bbox query parameter is omitted. Here’s the final result:


  
function(head, req) {
    var row, bbox, location, sep = '\n';

    // Send the same Content-Type as CouchDB would
    if (req.headers.Accept.indexOf('application/json')!=-1)
      start({"headers":{"Content-Type" : "application/json"}});
    else
      start({"headers":{"Content-Type" : "text/plain"}});

    if (req.query.bbox)
        bbox = req.query.bbox.split(',');

    send('{"total_rows":' + head.total_rows +
         ',"offset":'+head.offset+',"rows":[');
    while (row = getRow()) {
        location = row.value.location;
        if (!bbox || (location[0]>bbox[0] && location[0]<bbox[2] &&
                      location[1]>bbox[1] && location[1]<bbox[3])) {
            send(sep + toJSON(row));
            sep = ',\n';
        }
    }
    return "\n]}";
};

An example how to access your _list function would be:
http://localhost:5984/geodata/_design/designdoc/_list/bbox/viewname?bbox=10,0,120,90&limit=10000

Now you should be able to filter any of your point clouds with a bounding
box. The performance should be alright for a reasonable number of points. A
usual use-case would something like displaying a few points on a map, where you
don’t want to see zillions of them anyway.

Stay tuned for a follow-up posting about displaying points with
OpenLayers.

4 Comments
Categories:
en, 
CouchDB, 
JavaScript, 
geo




Previous page
Next page



By Volker Mische
Powered by Kukkaisvoima version 7

About me

Categories

Archives

Current implementation

Feature set

Using GeoCouch

Next steps

Thanks

Stripping it down

Complex analytics

Conclusion

Organisation and venue

Presentations

Geodata and CouchDB

After show parties

The code sprint

Final words

What is a tilecache

Coding

Everyone loves numbers

Really??

Does it really matter?

Conclusion

Highlights

Requirements

Download

Your data

The _list function