CouchDB _mix branch: Intersection of _view and _external
2009-04-21 22:35
In CouchDB it’s possible to query an external service (I’ll call it _external from now on) which returns an HTTP response directly to the client that made the request. Although this is already quite nice, it wasn’t possible to combine such _external requests with a classical _view.
The need for an intersection of _view and _external
Sometimes you’d like to exclude documents in a more dynamic fashion than a CouchDB _view supports it. Examples would be geospatial queries, a simple search like “exclude all documents that don’t contain a certain string in the title” or even fulltext searching. Therefore I’ve created a new handler called “_mix”.
The problem
As _external already exists quite a long time, it was clear that I would reuse the available functionality. The basic idea is simple: take all documents from a _view and all from _external, intersect them and finally output the result.
The problem is that CouchDB can be used for huge data sets, where you don’t want to keep a complete _view in memory to perform an intersection. The goals were:
- The output needs to be streamable
- Don’t keep all documents in memory
- Use the existing functionality
The implementation
Over the past few months I had lengthy discussions with Paul Davis to find a suitable solution for the problem. We were going through all our ideas over and over again. The way I’ve implemented it now works for me so far, but it is definitely not the ultimate one and only solution, it’s just some solution.
As most of the functionality already exists, the current API of _view and _external is used. The difference is that it is POSTed as JSON to the mix handler instead of a GET request. Here’s an example with curl:
curl -d '{"design": "designdoc", "view": {"name": "viewname", "query": {"limit": "11"}}, "external": {"name": "minimal", "query": {"bbox": "[23,42,46,89]"}, "include_docs": false}}' http://localhost:5984/yourdb/_mix
At the moment most of the code is just copy and pasted from
couch_httpd_view.erl
and couch_httpd_external_*
with some additional parsing of the POSTed JSON. The only new thing is that there’s an _external request before every document of a _view is outputted. This requests contains either the document ID or the whole document (if “include_docs
” is set to “true
”) and needs to return “true
” if the document should be outputted (or resp. “false
” if not).
I’ve included a sample _external script which excludes documents randomly (it can be found at src/contrib/minimal_external.py
). To have a play
with it, you just need to enable _external and add that script. How to do that
can be found in the
CouchDB Wiki.
Get it
All you need to do to have some fun with it is checking out my _mix branch at github.
Final words
And finally I’d like to thank Paul Davis for his time to discuss the issues with the intersection of _view and _external. Another “thank you” goes out to Adam Groves, he discovered a lot of annoyances with the parsing of the queries.