CouchDB and geodata?

About me

My name is Volker Mische and I'm an open source enthusiast and hacker. You can reach me via email, or Bluesky (@vmx.cx). Find me also on GitHub.

2008-05-03 22:35

Let me introduce the two protagonists. If you know them already, just skip this part.

CouchDB

From the official website:

Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API.

The word database is often connected to RDBMS, but CouchDB is way different. You don’t store your data in predefined tables and fields with certain data types like INTEGER or VARCHAR, but every database record is stored on it’s own (in so-called documents).

In RDBMS you build relations between several tables to store and receive the data; in a document-oriented DB (DODB) one record is stored after another (these records can, of course, be splitted into several documents that might even reference each other through their ID). The structure of these documents doesn’t matter for their storage. The big advantage is that if a new property is needed, just add it to the document. There’s no need to change any global context (like schema definitions of tables in RDBMS).

Geodata

I haven’t found a good definition for geodata, so here’s my own:

Geodata is data with a spatial reference.

This data is not restricted to the spatial reference only. Far more important is the actual (meta)data that is connected to this spatial reference. This data describes what it is all about. It could be a house with information about its number, age, size or a measuring station that monitors the temperature.

Are you serious?

Why should someone want to put his geodata into a big mess of thousands of documents instead of a nicely structured RDBMS? You don’t have to be a computer scientist to know that retrieving data out of a RDBMS is damn fast and a DODB approach sounds like a slow, “I grep through a long list of files”.

This might partly be true, but high performance shouldn’t be a use case for DODBs. Their flexibility and ease of usage is what they make them perform great. You have the choice between being fast or being flexible.

The use case

Flexibility over performance for geodata services has a use case when it comes to interoperability between different data sources.

Imagine you are the governor of a big country that consists of several smaller territories. Each of these have a smart guy that developed (independent of all the others) a system to collect data about how many bicycles topple over per day. It’s a geo-spational system, as the exact location where it happend is stored in the database.

All territories use a RDBMS, but from different manufacturers. In addition they store the information about the bikes differently. One territory distinguishes between bicycle for children, youth and adults; another one stores the size of the felly instead. Those information could be mapped very easily to a uniform one, but the territories don’t want to give up the infrastructures of their current systems. They still want to collect their data in their way.

What you really want is a solution to be able exchange the data easily between the territories and have uniform way to access the data country wide.

Solution I

To exchange their data they set a new layer of transformation above the current DB. The output will be a new format they both agreed on. This sounds like a good solution for the problem, but there are a few downsides:

The transformation could be very difficult to express with SQL. This could lead to huge slow downs. This isn’t such a big problem if you just exchange the data, but a big advantage, the speed of RDBMS, gets lost.
The transformation layer needs to support for DBs of different manufacturers.
Queries across territory borders seem difficult. Will all servers serve all data? Will you need to query multiple servers to get the data of two territories?
Heterogeneous environments lead to higher maintenance costs than homogeneous ones.

Solution II

All territories store their data in a new shiny type of DB, a DODB. If they collect the data, it’s currently transformed somehow to fit into a RDBMS. They could either change this and store it directly into the new DB (long term goal) or transform their current data to make it fit.

So what’s the difference between transforming the data from the RDBMS to another RDBMS or to a DODB?

Transforming to a DODB is more like a dump of the data, thus easy.
Probably you can’t convert to another existing DB schema, as this will lead to a lost of information. So a new DB schema needs to be created/an existing one altered (every time something new occurs).

Characteristics:

All data can be stored in one big database, queries across territories are easy (simple “if’s”)
A single database can be replicated easily.
Queries are slow compared to plain SQL queries on RDBMS, probably not suitable for real-time applications

Solution III

Follow the approach of Solution II but using one gigantic RDBMS that stores the DB schemas of all territories. That would work, too. The difference it that RDBMS wasn’t meant for such things.

Forecast

I think Solution II shows that CouchDB has a big potential in that area. At the moment it's more an idea than a solution, there a still a view contradictions, but these will hopefully be solved.

One crux are speedy retrievals of features within a certain bounding box, this issue will be the spotlight of a future post.

Categories: en, geo, CouchDB

Comments are closed after 14 days.

By Volker Mische

vmx

About me

Categories

Archives