the blllog.

WebAssembly multi-value return in today's Rust without wasm-bindgen

2021-01-29 15:00

The goal was to run some WebAssembly within different host languages. I needed a WASM file that is independent of the host language, hence I decided to code the FFI manually, without using any tooling like wasm-bindgen, which is JavaScript specific. It needed a bit of custom tooling, but in the end I succeeded in having a WASM binary that has a multi-value return, generated with today's Rust compiler, without using wasm-bindgen annotations.


In my case I wanted to pass some bytes into the WASM module, do some processing and returning some other bytes. I found all information I needed in this excellent A practical guide to WebAssembly memory from radu. There he mentions the WebAssembly multi-value proposal and links to a blog post from 2019 called Multi-Value All The Wasm! which explains its implementation for the Rust ecosystem.

As it's from 2019 I just went ahead and thought I can use multi-value returns in Rust.

The journey

My function signature for the FFI looks like this:

pub extern "C" fn decode(data_ptr: *const u8, data_len: usize) -> (*const u8, usize) { … }

When I compiled it, I got this warning:

warning: `extern` fn uses type `(*const u8, usize)`, which is not FFI-safe
 --> src/lib.rs:2:67
2 | pub extern "C" fn decode(data_ptr: *const u8, data_len: usize) -> (*const u8, usize) {
  |                                                                   ^^^^^^^^^^^^^^^^^^ not FFI-safe
  = note: `#[warn(improper_ctypes_definitions)]` on by default
  = help: consider using a struct instead
  = note: tuples have unspecified layout

Multi-value returns are certainly not meant for C APIs, but for WASM it might still work, I thought. Running wasm2wat shows:

  (type (;0;) (func (param i32 i32 i32)))
  (func $decode (type 0) (param i32 i32 i32)

This clearly isn't a multi-value return. It doesn't even have a return at all, it takes 3 parameters, instead of the 2 the function definition has. I found an issue called Multi value Wasm compilation #73755 and was puzzled why it doesn't work. Is this a regression? Why did it work in that blog post from 2019? I gave the Multi-Value All The Wasm! blog post another read, and it turns out it explains all this in detail (look at the wasm-bindgen section). Back then it wasn't supported by the Rust compiler directly, but by wasm-bindgen.

So perhaps I can just use the wasm-bindgen command line tool and transform my compiled WASM binary into a multi-value return one. There is a command-line flag called WASM_BINDGEN_MULTI_VALUE=1 to enable that transformation. Sadly that doesn't really work as it needs some interface-types present in the WASM binary (which I don't have).

Thanks to open source, the blog post about the implementation of the transformation feature and some trial an error, I was able to extract the pieces I needed and created a tool called wasm-multi-value-reverse-polyfill. I didn't need to do any of the hard parts, just some wiring up. I was now able to transform my WASM binary into a multi-value return one simply by running:

$ multi-value-reverse-polyfill ./target/wasm32-unknown-unknown/release/wasm_multi_value_retun_in_rust.wasm 'decode i32 i32'
Make `decode` function return `[I32, I32]`.

The WAT disassembly now looks like that:

  (type (;0;) (func (param i32 i32) (result i32 i32)))
  (type (;1;) (func (param i32 i32 i32)))
  (func $decode_multivalue_shim (type 0) (param i32 i32) (result i32 i32)

There you go. There is now a shim function that has the multi-value return, which calls the original method. I can now use my newly created WASM binary with WebAssembly runtimes that support multi-value returns (like Wasmer or Node.js).


With wasm-multi-value-reverse-polyfill I'm now able to create multi-value return functions with the current Rust compiler without depending on all the magic wasm-bindgen is doing.

Categories: en, WASM, Rust

When npm link fails

2019-08-01 22:35

There are cases where linking local packages don't produce the same result as if you would've installed all packages from the registry. Here I'd like to tell the story about one of those real world cases and conclude with a solution to those problems.

The problem

When you do an npm install heavy module deduplication and hoisting, which doesn't always behave the same way in all cases. For example if you npm link a package, the resulting node_modules tree is different. This may lead to unexpected runtime errors.

It happened to me recently and I thought I use exactly this real world example to illustrate that problem and a possible solution to it.

Real world example


Start with cloning the js-ipfs-mfs and js-ipfs-unixfs-importer repository:

$ git clone https://github.com/ipfs/js-ipfs-mfs --branch v0.12.0 --depth 1
$ git clone https://github.com/ipfs/js-ipfs-unixfs-importer --branch v0.39.11 --depth 1

Our main module is js-ipfs-mfs and let's say you want to make local changes to js-ipfs-unix-importer, which is a direct dependency of js-ipfs-mfs.

First of all you of course make sure that currently the tests pass (we just run a subset, to get to the actual issue faster). I'm sorry that the installation takes so long and so much space, the dev dependencies are quite heavy.

$ cd js-ipfs-mfs
$ npm install
$ npx mocha test/write.spec.js
  53 passing (4s)
  1 pending

Ok, all tests passed.

Reproducing the issue

Before we even start modifying js-ipfs-unix-importer, we link it and check that the tests still pass.

$ cd js-ipfs-unixfs-importer
$ npm link
$ cd ../js-ipfs-mfs
$ npm link ipfs-unixfs-importer
$ npx mocha test/write.spec.js
  37 passing (2s)
  1 pending
  16 failing

Oh, no. The tests failed. But why? The reason is deep down in the code. The root cause is in the [hamt-sharding] module and it's not even a bug. It just checks if something is a Bucket:

static isBucket (o) {
  return o instanceof Bucket

instanceof only works if both instances we check on came from the exact same module. Let's see who is importing the hamt-sharding module:

$ npm ls hamt-sharding
ipfs-mfs@0.12.0 /home/vmx/misc/protocollabs/blog/when-npm-link-fails/js-ipfs-mfs
├── hamt-sharding@0.0.2
├─┬ ipfs-unixfs-exporter@0.37.7
│ └── hamt-sharding@0.0.2  deduped
└─┬ UNMET DEPENDENCY ipfs-unixfs-importer@0.39.11
  └── hamt-sharding@0.0.2  deduped

npm ERR! missing: ipfs-unixfs-importer@0.39.11, required by ipfs-mfs@0.12.0

Here we see that ipfs-mfs has a direct dependency on it, and an indirect dependency through ipfs-unixfs-exporter and ipfs-unixfs-importer. All of them use the same version (0.0.2), hence it's deduped and the instanceof call should work. But there's also an error about an UNMET DEPENDENCY, the ipfs-unixfs-importer module we linked to.

To make it clear what's happening inside Node.js. When you require('hamt-sharding') from the ipfs-mfs code base, it will load the module from the physical location js-ipfs-mfs/node_modules/hamt-sharding. When you require it from ipfs-unixfs-importer it will be loaded from js-ipfs-mfs/node_modules/ipfs-unixfs-importer/node_modules/hamt-sharding resp. from ipfs-unixfs-importer/node_modules/hamt-sharding, as js-ipfs-mfs/node_modules/ipfs-unixfs-importer is just a symlink to a symlink to that directory.

When you do a normal installation without linking, you won't have this issue as hamt-sharding will be properly deduplicated and only loaded once from js-ipfs-mfs/node_modules/hamt-sharding.

Possible workarounds that do not work

Though you still like to change ipfs-unixfs-importer locally and test those changes with ipfs-mfs without breaking anything. I had several ideas on how to workaround this. I start with the ones that didn't work:

  1. Just delete the js-ipfs-unixfs-importer/node_modules/hamt-sharding directory. The module should still be found in the resolve paths of ipfs-mfs. No it doesn't. Tests fail because hamt-sharding can't be found.
  2. Global linking runs an npm install when you run the initial npm link. What if we remove the js-ipfs-unixfs-importer/node_modules completely and symlink to the module manually. That also doesn't work, the hamt-sharding module also can't be found.
  3. Install ipfs-unixfs-importer directly with a relative path (npm install ../js-ipfs-unixfs-importer). No, that doesn't work either, it will still have its own node_modules/hamt-sharding, it won't be properly deduplicated.

There must be a way to make local changes to a module and testing them without publishing it each time. Luckily there really is.

Working workaround

I'd like to thank my colleague Hugo Dias for this workaround that he has been using for a while already.

You can just replicate what a normal npm install <package> would be doing. You pack the module and then install that packed package. In our case that means:

$ cd js-ipfs-mfs
$ npm pack ../js-ipfs-unixfs-importer
$ npm install ipfs-unixfs-importer-0.39.11.tgz
+ ipfs-unixfs-importer@0.39.11
added 59 packages from 76 contributors and updated 1 package in 31.698

Now all tests pass.

This is quite a manual process. Luckily Hugo created a module to automate exactly that workflow. It's called connect-deps.


Sometimes linking packages doesn't create the same structure of modules and you need to use packing instead. To automate this you can use connect-deps.

Categories: en, JavaScript, npm

Show your own stripes

2019-06-20 22:35

You want to create #ShowYourStripes for the location you live in? Here's how.


Timeline of yearly average temperatures in Augsburg, Germany

When I first saw #ShowYourStripes I immediately felt in love (thanks Stefan Münz for tweeting about it). I think it's a great and simple visualization by Ed Hawkins of what we are currently facing when it comes to climate change. You don't need to scroll through long tables or figure out the axis on some diagram. You can simply see that there's something massively changing.

After playing around a bit with the cities available on the #ShowYourStripes website I wanted to do the same for the city I live in, Augsburg, Germany. I looked at the website's source code first, in hope that it dynamically creates the data from some JSON or so. That isn't the case. I then searched Twitter, GitHub and the Web if I can find any related open source project. I wouldn't want to spend time figuring out the parameters that were used to create those. After all, I wanted mine to look exactly like those.

Luckily I found a Tweet from Zeke Hausfather saying that he could create those. I then asked him if he could please release the source code. And just 7h later he did.

Creating your own stripes

Now it's time for a quick tutorial on how you can create your own #ShowYourStripes with that source code.


I did those steps on a Debian system that had the most common tools installed (like Python3, or Wget). I'm using Pipenv for installing the required Python packages, but you can use any other package management tool for Python.

Let's get the data file with the global temperature values first. It's 200MB so it might take a while.

wget http://berkeleyearth.lbl.gov/auto/Global/Gridded/Complete_TAVG_LatLong1.nc

Now retrieve the source code:

$ wget https://raw.githubusercontent.com/hausfath/scrape_global_temps/master/City%20Warming%20Strips%20.ipynb -O showyourstripes.ipynb

In order to run the script, we need to get a few Python packages first:

$ pipenv install matplotlib nbconvert netcdf4 numpy_indexed pandas

Running the script

The original script is a Jupyter Notebook, so we convert it to a plain Python script (you can ignore the warnings):

$ pipenv run jupyter-nbconvert --to python showyourstripes.ipynb

Next we need to make some changes to the showyourstripes.py file so that it works on your machine and plots the stripes for your location. We work on the current directory, so you can comment out changing the directory:

#os.chdir('/Users/hausfath/Desktop/Climate Science/GHCN Monthly/')

The other changes we need is the location the stripes should be plotted from. Here I use the values for Augsburg, Germany. Use your own values there. When I don't know the coordinates of my location, I usually check Wikipedia. In the top right corner of an article you can find the coordinate of a place (if it has one attached). If you click on those you get to the GeoHack page of the article. There on the top right you can find the coordinate in decimals in lat/lon order. In my case it's "48.366667, 10.9".

savename = 'augsburg'

lat = 48.366667
lon = 10.9

Now you're ready to run the script:

pipenv run python showyourstripes.py

Now you should have an output file called augsburg.png in the same directory which contains the stripes.


Have fun creating your own #ShowYourStripes. Thanks again Zeke Hausfather for making and publishing the source code so quickly.

Categories: en, climatechange, tutorial

EU Urheberrechtsreform Nachlese

2019-04-23 22:35

Die EU Urheberrechtsreform ist seit dem 15. April endgültig beschlossen. Leider konnten die strittigen Dinge wie Änderungen beim Leistungsschutzrecht oder drohende Uploadfilter nicht verhindert werden.

Zunächst ein bisschen Hintergrund zur Urheberrechtsreform. Die Reform sieht mehrere Änderungen vor, die durchaus nicht alle schlecht sind, einen guten Überblick zum Thema ist [dieser Blog-Eintrag von Julia Reda]. Noch vor der Abstimmung gab es einen, meines Erachtens, sehr guten Gastbeitrag von Dorothee Bär bei der Main Post, bei dem auf die negativen Auswirkungen der Reform eingegangen wird.

Darüber hinaus gab es auch die Warnung vom UN-Sonderberichterstatter zur Meinungsfreiheit David Kayne davor, dass die Reform zur Einschränkung der Meinungsfreiheit führen wird.

Bereits einen Tag nach der Zustimmung des EU Parlaments am 26. März hatte der französische Kulturminister Franck Riester angekündigt, dass Frankreich in Filtertechnologie investieren will. Es wird also wohl trotz der Protokollnotiz Deutschlands zu Uploadfiltern kommen.

Was mich bei der gesamten Debatte wirklich gestört hat, war die Unwissenheit vieler Beteiligter. Auch ich habe Fehler gemacht, diesen aber umgehend korrigiert. Dabei ist es eben hilfreich auch die Argumente der Gegenseite zu hören. Besonders bei der Debatte im EU Parlament direkt vor der Abstimmung (direkt als Video) wurden deutlich, wie viele Abgeordnete nicht wirklich verstanden haben, um was es genau geht, bzw. welche Folgen die Reform hat. Es gab sogar plumpe Angriffe, die mit der eigentlichen Sache zu tun hatten. Es kam bei den Befürwortern wohl nicht an, dass auch die Gegner, wie ich, eine Urheberrechtsform wollen, es geht lediglich um deren Umsetzung. Der Redebeitrag von Julia Reda war (wie so oft) hervorragend. Sie fasst die Faktenlage noch einmal kurz zusammen und beschreibt auch den Frust der Teilnehmer der Massenproteste. Zur möglichen Folge der Politikverdrossenheit gibt es auch einen sehr guten Kommentar auf tagesschau.de.

Spannend ist auch noch die Frage, wer die eigentlichen Gewinner dieser Reform sind. Die Befürworter haben immer die Urheber als Gewinner der Reform ins Feld geführt. Allerdings handelte es dabei immer um Urheber die von Verwertungsgesellschaften vertreten werden. Es wurde dabei außer Acht gelassen, dass es gerade im Internetzeitalter eine Vielzahl von anderen Möglichkeiten gibt Urheber zu sein. Dazu gibt es zwei nette Geschichten, einmal der Versuch als Privatperson für Fotos Vertreten zu werden, das andere Mal als Videoproducer auf YouTube. Beides ist derzeit nicht möglich.

Ich hatte mich außerdem daran beteiligt bei den Brüsseler Büros der Abgeordneten anzurufen. Dabei war das Feedback sehr verschieden. Es reichte von einem freundlichen „ich werde dies so weitergeben“, über „die Abgeordnete stimmt wie ihre Kollegin, entgegen der Mehrheit ihrer Fraktion, bzw. der Bundesfraktion“ bis zu „es rufen so viele an, ich werde deshalb nicht mit Ihnen sprechen, die Abgeordnete wird sich aber eine fundierte Meinung bilden“ (im Endeffekt dann aber gar nicht abstimmen).

Zum Schluss möchte ich mich bei Allen bedanken die so hart dagegen gekämpft haben, Demos organisiert und natürlich auch den zahlreichen Teilnehmern.

Categories: de, EU, politics, copyright

Why I am against the EU Copyright Directive

2019-03-17 22:35

Update 2019-03-19: The argumentation below is wrong. A forum won't be considered a "online content sharing service provider" according to the definition of Article 2 (5) (page 51 of the full text of the final version). I'm sorry for this misinformation. I keep the text below for reference so that others can see what I got wrong.

There are many arguments against the EU Copyright Directive (more correct Directive on Copyright in the Digital Single Market) some I agree with, some I don't. Hence, here's my take on why I think that directive should be stopped. Short versions is: it strengthens the big platforms and weakens/destroys the small ones.

My hope is that this blog post will get more people interested in that topic and hopefully make you join the European wide protests on Saturday March 23rd 2019. If you want to join, there's an interactive map of all known protests created by the folks from stopACTA2.


It is confusing that platforms like YouTube are against the directive, it sounds like they have a lot to lose, hence they try everything they can against it. For me, this is normally a sign, that such a directive is exactly what it should do.

But it this case, it's not. YouTube will surely have its own reasons being against it. But what is more important for me is, that if the directive is approved by the European Parliament, the small platforms will almost have no chance to survive.

Why small platforms will die

There are exceptions in the directive for some platforms. You can find those in the full text of the final version at paragraph (38b), page 36. But those exceptions won't help all smaller platforms. For example a discussion board which is older than 3 years and has advertising to cover the server costs wouldn't be excluded. They would be liable for every copyright infringement.

It could be a small as a profile picture, let's say yours is Luke Skywalker. That platform could block custom profile pictures, but that still won't be enough. Someone could post some copyrighted text. But how would you make a discussion board without the users being able to post text? So the only way to not being liable would be to check for all infringements (how would you do that?), or close the platform.


I tried to keep it intentionally short and highlight the issue that matters to me most. Of course there's a lot more issues regarding the EU Copyright Directive, so if you want to know more, go to websites like savetheinternet.info, stopACTA2 or Julia Reda's website who is a Member of the European Parliament and puts lots of efforts in explaining and spreading the word on why the directive should be stopped (also follow her on Twitter). Thanks a lot Julia for doing such an amazing work!

Categories: en, EU, politics, copyright

Joining Protocol Labs

2018-01-24 22:35

I’m pumped to announce that I’m joining Protocol Labs as a software engineer. Those following me on Twitter or looking on my GiHub activity might have already got some hints.

Short term

My main focus is currently on IPLD (InterPlanetary Linked Data). I’ll smooth things out and also work on the IPLD specs, mostly on IPLD Selectors. Those IPLD Selectors will be used to make the underlying graph more efficient to traverse (especially for IPFS). That’s a lot of buzzwords, I hope it will get clearer the more I’ll blog about this.

To get started I’ve done the JavaScript IPLD implementations for Bitcoin and Zcash. Those are the basis to make easy traversal through the Bitcoin and Zcash blockchains possible.

Longer term

In the longer term I’ll be responsible to bring IPLD to Rust. That’s especially exciting with Rust’s new WebAssembly backend. You’ll get a high performance Rust implementation, but also one that works in Browsers.

What about Noise?

Many of you probably know that I’ve been working full-time on Noise for the past 1.5 years. It shapes up nicely and is already quite usable. Of course I don’t want to see this project vanish and it won’t. At the moment I only work part-time at Protocol Labs, to also have some time for Noise. In addition to that there’s also interest within Protocol Labs to use Noise (or parts of it) for better query capabilities. So far it’s only rough ideas I mentioned briefly at the end of my talk about Noise at the [Lisbon IPFS Meetup] two weeks ago. But what’s the distributed web without search?

What about geo?

I’m also part of the OSGeo community and FOSS4G movement. So what’s the future there? I see a lot of potential in the Sneakernet. If geo-processing workflows are based around IPFS, you could use the same tools/scripts whether it is stored somewhere in the cloud, or access you local mirror/dump if your Internet connection isn’t that fast/reliable.

I expect non-realiable connectivity to be a hot topic at the FOSS4G 2018 conference in Dar es Salaam, Tansania.


I’m super excited. It’s a great team and I’m looking forward to push the distributed web a bit forward.

Categories: en, ProtocolLabs, IPLD, IPFS, JavaScript, Rust, geo

Introduction to Noise’s Node.js API

2017-12-21 22:35

In the previous blog post about Noise we imported data with the help of some already prepared scripts. This time it’s an introduction in how to use Noise‘s Promise-based Node.js API directly yourself.

The dataset we use is not a ready to use single file, but one that consists of several ones. The data is the “Realized Cost Savings and Avoidance” for US government agencies. I’m really excited that such data gets openly published as JSON. I wished Germany would be that advanced in this regard. If you want to know more about the structure of the data, there’s documentation about the [JSON Schmema], they even have a “OFCIO JSON User Guide for Realized Cost Savings” on how to produce the data out of Excel.

I’ve prepared a repository containing the final code and the data. But feel free to follow along this tutorial by yourself and just point to the data directory of that repository when running the script.

Let’s start with the boilerplate code for reading in those files and parsing them as JSON. But first create a new package:

mkdir noise-cost-savings
cd noise-cost-savings
npm init --force

You can use --force here as you probably won’t publish this package anyway. Put the boilerplate code below into a file called index.js. Please note that the code is kept as simple as possible, for a real world application you surely want better error handling.

#!/usr/bin/env node
'use strict';

const fs = require('fs');
const path = require('path');

// The only command line argument is the directory where the data files are
const inputDir = process.argv[2];
console.log(`Loading data from ${inputDir}`);

fs.readdir(inputDir, (_err, files) => {
  files.forEach(file => {
    fs.readFile(path.join(inputDir, file), (_err, data) => {
      const json = JSON.parse(data);

const processFile = (data) => {
  // This is where our actual code goes

This code should already run. Checkout my repository with the data into some directory first:

git clone https://github.com/vmx/blog-introduction-to-noises-nodejs-api

Now run the script from above as:

node index.js <path-to-directory-from-my–repo-mentioned-above>/data

Before we take a closer look at the data, let’s install the Noise module. Please note that you need to have Rust installed (easiest is probably through rustup) before you can install Noise.

npm install noise-search

This will take a while. So let’s get back to code. Load the noise-search module by adding:

const noise = require('noise-search');

A Noise index needs to be opened and closed properly, else your script will hang and not terminate. Opening a new Noise index is easy. Just put this before reading the files:

const index = noise.open('costsavings', true);

It means that open an index called costsavings and create it if it doesn’t exist yet (that’s the boolean true). Closing the index is more difficult due to the asynchronous nature of the code. We can close the index only after all the processing is done. Hence we wrap the fs.readFile(…) call in a Promise. So that new code looks like this:

fs.readdir(inputDir, (_err, files) => {
  const promises = files.map(file => {
    return new Promise((resolve, reject) => {
      fs.readFile(path.join(inputDir, file), (err, data) => {
        if (err) {
          throw err;

        const json = JSON.parse(data);
  Promise.all(promises).then(() => {

If you run the script now it should print out the file names as before and terminate with a Done.. There got a directory called costsavings created after you ran the script. This is where the Noise index is stored in.

Now let’s have a look at the data files, e.g. the cost savings file from the Department of Commerce (or the JSON Schema), you’ll see that it has a single field called "strategies", which contains an array with all strategies. We are free to pre-process the data as much as we want before we insert it into Noise. So let’s create a separate document for every strategy. Our processFile() function now looks like:

const processFile = (data) => {
  data.strategies.forEach(async strategy => {
    // Use auto-generated Ids for the documents
    await index.add(strategy);

Now all the strategies get inserted. Make sure you delete the index (the costsavings directory) if you re-run the scripts, else you would end up with duplicated entries, as different Ids will be generated on every run.

To query the index you could use the Noise indexserve script that I’ve also used in the last blog post about Noise. Or we just add a small query at the end of the script after the loading is done. Our query function will do the query and output the result:

const queryNoise = async (query) => {
  const results = await index.query(query);
  for (const result of results) {

There’s not much to say, except it’s again a Promised-based API. And now hook up this function after the loading and before the index is closed. For that, replace the Promise.all(…) call with:

Promise.all(promises).then(async () => {
  await queryNoise('find {} return count()');

It’s a really simple query, it just returns the number of documents that are in there (644). After all this hard work, it’s time to make a more complicated query on this dataset to show that it was worth doing all this. Let’s return the total net savings of all agencies in 2017. Replace the query find {} return count() with:

find {fy2017: {netOrGross: == "Net"}} return sum(.fy2017.amount)

That’s $845m savings. Not bad at all!

You can learn more about the Noise Node.js API from the README at the corresponding repository. If you want to learn more about possible queries, have a look at the Noise Query Language reference.

Happy cost saving!

Categories: en, Noise, Node, JavaScript, Rust

Exploring data with Noise

2017-12-12 22:35

This is a quick introduction on how to explore some JSON data with Noise. We won’t do any pre-processing, but just load the data into Noise and see what we can do with it. Sometimes the JSON you get needs some tweaking before further analysis makes sense. For example you want to rename fields or numbers are stored as string. This exploration phase can be used to get a feeling for the data and which parts might need some adjustments.

Finding decent ready to use data that contains some nicely structured JSON was harder than I thought. Most datasets are either GeoJSON or CSV masqueraded as JSON. But I was lucky and found a JSON dump of the CVE database provided by CIRCL. So we’ll dig into the CVEs (Common Vulnerabilities and Exposures) database to find out more about all those security vulnerabilities.

Noise has a Node.js binding to get started easily. I won’t dig into the API for now. Instead I’ve prepared two scripts. One to load the data from a file containing new line separated JSON. And another one for serving up the Noise index over HTTP, so that we can explore the data via curl.


As we use the Node.js binding for Noise, you need to have Node.js, npm and Rust (easiest is probably through rustup) installed.

I’ve created a repository with the two scripts mentioned above plus a subset of the CIRCL CVE dataset. Feel free to download the full dataset from the CIRCL Open Data page (1.2G unpacked) and load it into Noise. Please note that Noise isn’t performance optimised at all yet. So the import takes some time as the hard work of all the indexing is done on insertion time.

git clone https://github.com/vmx/blog-exploring-data-with-noise
cd blog-exploring-data-with-noise
npm install

Now everything we need should be installed, let’s load the data into Noise and do a query to verify it’s installed properly.

Loading the data and verify installation

Loading the data is as easy as:

npx dataload circl-cve.json

For every inserted record one dot will be printed.

To spin up the simple HTTP server, just run:

npx indexserve circl-cve

To verify it does actually respond to queries, try:

curl -X POST -d 'find {} return count()'

If all documents got inserted correctly it should return


Everything is set up properly, now it’s time to actually exploring the data.

Exploring the data

We don’t have a clue yet, what the data looks like. So let’s start with looking at a single document:

curl -X POST -d 'find {} return . limit 1'
  "Modified": "2017-01-02 17:59:00.147000",
  "Published": "2017-01-02 17:59:00.133000",
  "_id": "34de83b0d3c547c089635c3a8b4960f2",
  "cvss": null,
  "cwe": "Unknown",
  "id": "CVE-2017-5005",
  "last-modified": {
    "$date": 1483379940147
  "references": [
  "summary": "Stack-based buffer overflow in Quick Heal Internet Security and earlier, Total Security and earlier, and AntiVirus Pro and earlier on OS X allows remote attackers to execute arbitrary code via a crafted LC_UNIXTHREAD.cmdsize field in a Mach-O file that is mishandled during a Security Scan (aka Custom Scan) operation.",
  "vulnerable_configuration": [],
  "vulnerable_configuration_cpe_2_2": []

The query above means: “Find all documents without restrictions and return it’s full contents. Limit it to a single result”.

You don’t always want to return all documents, but filter based on certain conditions. Let’s start with the word match operator ~=. It matches document which contains those words in a specific field, in our case "summary". As “buffer overflow” is a common attack vector, let’s search for all documents that contain it in the summary.

curl -X POST -d 'find {summary: ~= "buffer overflow"}'

That’s quite a long list of random characters. Noise assigns Ids to every inserted document if the document doesn’t contain a "_id" field. By default Noise returns such Ids of the matching documents. So no return value is equivalent to return ._id. Let’s return the CVE number of the matching vulnerabilities instead. That field is called "id":

curl -X POST -d 'find {summary: ~= "buffer overflow"} return .id'

If you want to know how many there are, just append a return count() to the query:

curl -X POST -d 'find {summary: ~= "buffer overflow"} return count()'

Or we can of course return the full documents to see if there are further interesting things to look at:

curl -X POST -d 'find {summary: ~= "buffer overflow"} return .'

I won’t post the output here, it’s way too much. If you scroll through the output, you’ll see that some contain a field named "capec", which is probably about the Common Attack Pattern Enumeration and Classification. Let’s have a closer look at one of those, e.g. from “CVE-2015-8388”:

curl -X POST -d 'find {id: == "CVE-2015-8388"} return .capec'
    "id": "15",
    "name": "Command Delimiters",
    "prerequisites": …
    "related_weakness": [
    "solutions": …
    "summary": …

This time we’ve used the exact match operator ==. As the CVEs have a unique Id, it only returned a single document. It’s again a lot of data, we might only care about the CAPEC names, so let’s return those:

curl -X POST -d 'find {id: == "CVE-2015-8388"} return .capec[].name'
  "Command Delimiters",
  "Flash Parameter Injection",
  "Argument Injection",
  "Using Slashes in Alternate Encoding"

Note that it is an array of an array. The reason is that in this case we only return the CAPEC names of a single document, but our filter condition could of course match more documents, like the word match operator did when we were searching for “buffer overlow”.

Let’s find out all CVEs where the CAPEC name “Directory Traversal”.

curl -X POST -d 'find {capec: [{name: == "Command Delimiters"}]} return .id'

The CAPEC data also contains references to related weaknesses as we’ve seen before. Let’s return the related_weakness of all CVEs that have the CAPEC name “Command Delimiters”.

curl -X POST -d 'find {capec: [{name: == "Command Delimiters"}]} return {cve: .id, related: .capec[].related_weakness}'
  "cve": "CVE-2015-8389",
  "related": [
  "cve": "CVE-2015-8388",
  "related": [

That’s not really what we were after. This returns the related weaknesses of all CAPECs and not just the one named “Command Delimiters”. The solution is a so called bind variable. You can store an array element that matches a condition in a variable which can then be re-used in the return value.

Jut prefix the array condition with a variable name separated by two colons:

find {capec: commdelim::[{name: == "Command Delimiters"}]}

And use it in the return value like any other path:

return {cve: .id, related: commdelim.related_weakness}

So the full query is:

curl -X POST -d 'find {capec: commdelim::[{name: == "Command Delimiters"}]} return {cve: .id, related: commdelim.related_weakness}'
  "cve": "CVE-2015-8389",
  "related": [
  "cve": "CVE-2015-8388",
  "related": [

The result isn’t that exciting as it’s the same related weaknesses for all CVEs, but of course the could be completely arbitrary. There’s no limitation on the schema.

So far we haven’t done any range requests yet. So let’s have a look at all CVEs that were last modified on December 28th with “High” severity rating according to the Common Vulnerability Scoring System. First we need to determine the correct timestamps:

date --utc --date="2016-12-28" "+%s"
date --utc --date="2016-12-29" "+%s"

Please note that the "last-modified" field has timestamps with 13 characters (ours have 10), which means that they are in milliseconds, so we just append three zeros and we’re good. The severity rating is stored in the field "cvss”, “High” severity means a value from 7.0–8.9. We need to put the field name last-modified in quotes as it contains a dash (just as you’d do it in JavaScript). The final query is:

curl -X POST -d 'find {"last-modified": {$date: >= 1482883200000, $date: < 1482969600000}, cvss: >= 7.0, cvss: <=8.9} return .id'

This was an introduction into basic querying of Noise. If you want to know about further capabilities you can have a look at the Noise Query Language reference or stay tuned for further blog posts.

Happy exploration!

Categories: en, Noise, Node, JavaScript, Rust

Printing panics in Rust

2017-12-05 22:35

This blog post is not about about dealing with normal runtime errors, you should really use the Result Type for that. This is about the case where some component might panic, but that shouldn’t bring the whole system to halt.

I was debugging some issue in the Node.js binding for Noise. It is using the noise_search crate which might panic if there’s an unrecoverable error. Though the Node.js binding should of course not crash, but handle it in a more graceful way. Hence it is catching the panics.

The existing code was only printing that there was some panic, but it didn’t contain the actual cause. I wanted to improve that.

I thought it would be easy and I could just print the debug version of the panic. So I changed the println!() to:

println!("panic happend: {:?}", result)

But that resulted only in a:

panic happened: Err(Any)

Which isn’t really that meaningful either. In the documentation about catch_unwind I read

…and will return Err(cause) if the closure panics. The cause returned is the object with which panic was originally invoked.

I didn’t really understand what this meant. Is the object that invokes the panic the function where the panic happens? I wanted the text I was putting into the panic!() call.

Thanks to rkruppe on IRC I learnt that panic!() can take any object, not just strings. Now the documentation made sense. He also mentioned that I can downcast Any if I know that type. As I always only use strings for panics that was easy:

if let Err(panic) = result {
    match panic.downcast::<String>() {
        Ok(panic_msg) => {
            println!("panic happened: {}", panic_msg);
        Err(_) => {
            println!("panic happened: unknown type.");

If you want to play a bit around with it, I’ve created a minimal example for the Rust Playground. Happy panicking!

Categories: en, Noise, Rust

Possible future direction for Noise

2017-10-06 22:35

I've applied for a grant from the Prototypefund to get some funding for Noise. It was a great opportunity to put some thoughts into which direction I might go with Noise. I've already posted my application in German, but I figured out it might also be interesting for a bigger audience. Hence here's the translated version of it.

On which open source project have you've worked before

What's the relation to main focus of the third round?

Note: The third round is about diversity.

Noise enables people that aren't computer experts to do data analysis. In my experience such analysis so far has been the privilege of a small group of people – developers – that know how to deal with raw data. Shouldn't the analysis of data be opened to a broader community? For example to people that have basic coding skills, but that don't have a deeper understanding how databases work, or how to administrate them. For those it should be easily possible to put the data into the environment they know and to get immediately started with the analysis.

Which social issues do you want to fix with your project?

Thanks to the open data movement there's a democratisation in data world happening. This has huge potential for freer formation of opinions and more self-determination. Statements and facts can get reproduced and verified. This potential must be exhausted in a broader way. Having the data available is not enough. The challenge is creating software solutions that makes such data analysis more accessible.

How do you want to implement your project?

Noise is a library written in Rust for searching and analysing JSON data. There's already a first working version. On the lowest level it's using Facebook's key-value store RocksDB, which was modified to support spatial queries.

There will be a C-API to integrate with other programming/scripting languages. Then it would also be possible to use it as a backend/driver for projects like GDAL or R. Integrating with programming/scripting languages doesn't stop with the API. Most languages have a full ecosystem including a package manager. Therefore it's important that Noise can be installed through those native mechanisms. This lowers the bar to get started. It already works for Node.js via “npm install noise-search”.

Which similar existing solutions are there and how is your project better?

Apache Lucene is a library for full text search. As it's pretty low-level it mostly isn't used directly, but together with Elasticsearch/Apache Solr. Noise is on a higher level than Apache Lucene and works with JSON. The processing/analysis is done with a simple query language.

Who is the target audience and how will your tool get a hold of them?

The target audience are people with basic programming knowledge. This could be scientists that want to do analysis for their empiric studies. Or it could be citizens from the civil society that want to do some fact-finding. With the integration into several programming/scripting languages, Noise is just another dependency/library and can easily be found and installed with the corresponding package manager.

Have you already worked on this idea? If yes, describe the current state and the future advances

The first version already supports basic full text search and it's also possible to query for numeric ranges and spatial queries on geodata (GeoJSON). The next steps are making the system more robust and to add additional interfaces. There could e.g. be a Python API in addition to the already existing Node.js one. Also there should be small projects doing some analysis to demonstrate the possibilities of Noise. Those can then be documented as tutorials for lowering the bar to get started even further.

Do a quick sketch of the most important milestones that you want to achieve during the period of funding

Note: The period of funding is 6 months.

  • C-API: Change the current Nodejs.API which is using Rust directly to a clean C-API
  • Python API: Deep integration as the Node.js one to get an easy installation through the package manager
  • More examples/documentation: Do small demo projects which are documented as tutorials to make the concepts of Noise more accessible
  • Internal improvements: The tightly coupled query parser needs to be refactored, i.a. for better error messages
  • Benchmarks: Benchmarks should prevent regressions and make Noise being able to be compared to other systems

Categories: en, Noise, funding

By Volker Mische

Powered by Kukkaisvoima version 7