vmx

the blllog.

Joining Protocol Labs

2018-01-24 17:59

I’m pumped to announce that I’m joining Protocol Labs as a software engineer. Those following me on Twitter or looking on my GiHub activity might have already got some hints.

Short term

My main focus is currently on IPLD (InterPlanetary Linked Data). I’ll smooth things out and also work on the IPLD specs, mostly on IPLD Selectors. Those IPLD Selectors will be used to make the underlying graph more efficient to traverse (especially for IPFS). That’s a lot of buzzwords, I hope it will get clearer the more I’ll blog about this.

To get started I’ve done the JavaScript IPLD implementations for Bitcoin and Zcash. Those are the basis to make easy traversal through the Bitcoin and Zcash blockchains possible.

Longer term

In the longer term I’ll be responsible to bring IPLD to Rust. That’s especially exciting with Rust’s new WebAssembly backend. You’ll get a high performance Rust implementation, but also one that works in Browsers.

What about Noise?

Many of you probably know that I’ve been working full-time on Noise for the past 1.5 years. It shapes up nicely and is already quite usable. Of course I don’t want to see this project vanish and it won’t. At the moment I only work part-time at Protocol Labs, to also have some time for Noise. In addition to that there’s also interest within Protocol Labs to use Noise (or parts of it) for better query capabilities. So far it’s only rough ideas I mentioned briefly at the end of my talk about Noise at the [Lisbon IPFS Meetup] two weeks ago. But what’s the distributed web without search?

What about geo?

I’m also part of the OSGeo community and FOSS4G movement. So what’s the future there? I see a lot of potential in the Sneakernet. If geo-processing workflows are based around IPFS, you could use the same tools/scripts whether it is stored somewhere in the cloud, or access you local mirror/dump if your Internet connection isn’t that fast/reliable.

I expect non-realiable connectivity to be a hot topic at the FOSS4G 2018 conference in Dar es Salaam, Tansania.

Conclusion

I’m super excited. It’s a great team and I’m looking forward to push the distributed web a bit forward.

Categories: en, ProtocolLabs, IPLD, IPFS, JavaScript, Rust, geo

Introduction to Noise’s Node.js API

2017-12-21 11:40

In the previous blog post about Noise we imported data with the help of some already prepared scripts. This time it’s an introduction in how to use Noise‘s Promise-based Node.js API directly yourself.

The dataset we use is not a ready to use single file, but one that consists of several ones. The data is the “Realized Cost Savings and Avoidance” for US government agencies. I’m really excited that such data gets openly published as JSON. I wished Germany would be that advanced in this regard. If you want to know more about the structure of the data, there’s documentation about the [JSON Schmema], they even have a “OFCIO JSON User Guide for Realized Cost Savings” on how to produce the data out of Excel.

I’ve prepared a repository containing the final code and the data. But feel free to follow along this tutorial by yourself and just point to the data directory of that repository when running the script.

Let’s start with the boilerplate code for reading in those files and parsing them as JSON. But first create a new package:

mkdir noise-cost-savings
cd noise-cost-savings
npm init --force

You can use --force here as you probably won’t publish this package anyway. Put the boilerplate code below into a file called index.js. Please note that the code is kept as simple as possible, for a real world application you surely want better error handling.

#!/usr/bin/env node
'use strict';

const fs = require('fs');
const path = require('path');

// The only command line argument is the directory where the data files are
const inputDir = process.argv[2];
console.log(`Loading data from ${inputDir}`);

fs.readdir(inputDir, (_err, files) => {
  files.forEach(file => {
    fs.readFile(path.join(inputDir, file), (_err, data) => {
      console.log(file);
      const json = JSON.parse(data);
      processFile(json);
    });
  });
});

const processFile = (data) => {
  // This is where our actual code goes
};

This code should already run. Checkout my repository with the data into some directory first:

git clone https://github.com/vmx/blog-introduction-to-noises-nodejs-api

Now run the script from above as:

node index.js <path-to-directory-from-my–repo-mentioned-above>/data

Before we take a closer look at the data, let’s install the Noise module. Please note that you need to have Rust installed (easiest is probably through rustup) before you can install Noise.

npm install noise-search

This will take a while. So let’s get back to code. Load the noise-search module by adding:

const noise = require('noise-search');

A Noise index needs to be opened and closed properly, else your script will hang and not terminate. Opening a new Noise index is easy. Just put this before reading the files:

const index = noise.open('costsavings', true);

It means that open an index called costsavings and create it if it doesn’t exist yet (that’s the boolean true). Closing the index is more difficult due to the asynchronous nature of the code. We can close the index only after all the processing is done. Hence we wrap the fs.readFile(…) call in a Promise. So that new code looks like this:

fs.readdir(inputDir, (_err, files) => {
  const promises = files.map(file => {
    return new Promise((resolve, reject) => {
      fs.readFile(path.join(inputDir, file), (err, data) => {
        if (err) {
          reject(err);
          throw err;
        }

        console.log(file);
        const json = JSON.parse(data);
        resolve(processFile(json));
      });
    });
  });
  Promise.all(promises).then(() => {
    console.log("Done.");
    index.close();
  });
});

If you run the script now it should print out the file names as before and terminate with a Done.. There got a directory called costsavings created after you ran the script. This is where the Noise index is stored in.

Now let’s have a look at the data files, e.g. the cost savings file from the Department of Commerce (or the JSON Schema), you’ll see that it has a single field called "strategies", which contains an array with all strategies. We are free to pre-process the data as much as we want before we insert it into Noise. So let’s create a separate document for every strategy. Our processFile() function now looks like:

const processFile = (data) => {
  data.strategies.forEach(async strategy => {
    // Use auto-generated Ids for the documents
    await index.add(strategy);
  });
};

Now all the strategies get inserted. Make sure you delete the index (the costsavings directory) if you re-run the scripts, else you would end up with duplicated entries, as different Ids will be generated on every run.

To query the index you could use the Noise indexserve script that I’ve also used in the last blog post about Noise. Or we just add a small query at the end of the script after the loading is done. Our query function will do the query and output the result:

const queryNoise = async (query) => {
  const results = await index.query(query);
  for (const result of results) {
    console.log(result);
  }
};

There’s not much to say, except it’s again a Promised-based API. And now hook up this function after the loading and before the index is closed. For that, replace the Promise.all(…) call with:

Promise.all(promises).then(async () => {
  await queryNoise('find {} return count()');
  console.log("Done.");
  index.close();
});

It’s a really simple query, it just returns the number of documents that are in there (644). After all this hard work, it’s time to make a more complicated query on this dataset to show that it was worth doing all this. Let’s return the total net savings of all agencies in 2017. Replace the query find {} return count() with:

find {fy2017: {netOrGross: == "Net"}} return sum(.fy2017.amount)

That’s $845m savings. Not bad at all!

You can learn more about the Noise Node.js API from the README at the corresponding repository. If you want to learn more about possible queries, have a look at the Noise Query Language reference.

Happy cost saving!

Categories: en, Noise, Node, JavaScript, Rust

Exploring data with Noise

2017-12-12 13:11

This is a quick introduction on how to explore some JSON data with Noise. We won’t do any pre-processing, but just load the data into Noise and see what we can do with it. Sometimes the JSON you get needs some tweaking before further analysis makes sense. For example you want to rename fields or numbers are stored as string. This exploration phase can be used to get a feeling for the data and which parts might need some adjustments.

Finding decent ready to use data that contains some nicely structured JSON was harder than I thought. Most datasets are either GeoJSON or CSV masqueraded as JSON. But I was lucky and found a JSON dump of the CVE database provided by CIRCL. So we’ll dig into the CVEs (Common Vulnerabilities and Exposures) database to find out more about all those security vulnerabilities.

Noise has a Node.js binding to get started easily. I won’t dig into the API for now. Instead I’ve prepared two scripts. One to load the data from a file containing new line separated JSON. And another one for serving up the Noise index over HTTP, so that we can explore the data via curl.

Prerequisites

As we use the Node.js binding for Noise, you need to have Node.js, npm and Rust (easiest is probably through rustup) installed.

I’ve created a repository with the two scripts mentioned above plus a subset of the CIRCL CVE dataset. Feel free to download the full dataset from the CIRCL Open Data page (1.2G unpacked) and load it into Noise. Please note that Noise isn’t performance optimised at all yet. So the import takes some time as the hard work of all the indexing is done on insertion time.

git clone https://github.com/vmx/blog-exploring-data-with-noise
cd blog-exploring-data-with-noise
npm install

Now everything we need should be installed, let’s load the data into Noise and do a query to verify it’s installed properly.

Loading the data and verify installation

Loading the data is as easy as:

npx dataload circl-cve.json

For every inserted record one dot will be printed.

To spin up the simple HTTP server, just run:

npx indexserve circl-cve

To verify it does actually respond to queries, try:

curl -X POST http://127.0.0.1:3000/query -d 'find {} return count()'

If all documents got inserted correctly it should return

[
1000
]

Everything is set up properly, now it’s time to actually exploring the data.

Exploring the data

We don’t have a clue yet, what the data looks like. So let’s start with looking at a single document:

curl -X POST http://127.0.0.1:3000/query -d 'find {} return . limit 1'
[
{
  "Modified": "2017-01-02 17:59:00.147000",
  "Published": "2017-01-02 17:59:00.133000",
  "_id": "34de83b0d3c547c089635c3a8b4960f2",
  "cvss": null,
  "cwe": "Unknown",
  "id": "CVE-2017-5005",
  "last-modified": {
    "$date": 1483379940147
  },
  "references": [
    "https://github.com/payatu/QuickHeal",
    "https://www.youtube.com/watch?v=h9LOsv4XE00"
  ],
  "summary": "Stack-based buffer overflow in Quick Heal Internet Security 10.1.0.316 and earlier, Total Security 10.1.0.316 and earlier, and AntiVirus Pro 10.1.0.316 and earlier on OS X allows remote attackers to execute arbitrary code via a crafted LC_UNIXTHREAD.cmdsize field in a Mach-O file that is mishandled during a Security Scan (aka Custom Scan) operation.",
  "vulnerable_configuration": [],
  "vulnerable_configuration_cpe_2_2": []
}
]

The query above means: “Find all documents without restrictions and return it’s full contents. Limit it to a single result”.

You don’t always want to return all documents, but filter based on certain conditions. Let’s start with the word match operator ~=. It matches document which contains those words in a specific field, in our case "summary". As “buffer overflow” is a common attack vector, let’s search for all documents that contain it in the summary.

curl -X POST http://127.0.0.1:3000/query -d 'find {summary: ~= "buffer overflow"}'
[
"34de83b0d3c547c089635c3a8b4960f2",
"8dff5ea0e5594e498112abf1c222d653",
"741cfaa4b7ae43909d1da153747975c9",
…
"b7419042c9464a7b96d3df74451cb4a7",
"d379e9fda704446982cee8638f32e72b"
]

That’s quite a long list of random characters. Noise assigns Ids to every inserted document if the document doesn’t contain a "_id" field. By default Noise returns such Ids of the matching documents. So no return value is equivalent to return ._id. Let’s return the CVE number of the matching vulnerabilities instead. That field is called "id":

curl -X POST http://127.0.0.1:3000/query -d 'find {summary: ~= "buffer overflow"} return .id'
[
"CVE-2017-5005",
"CVE-2016-9942",
…
"CVE-2015-2710",
"CVE-2015-2666"
]

If you want to know how many there are, just append a return count() to the query:

curl -X POST http://127.0.0.1:3000/query -d 'find {summary: ~= "buffer overflow"} return count()'
[
61
]

Or we can of course return the full documents to see if there are further interesting things to look at:

curl -X POST http://127.0.0.1:3000/query -d 'find {summary: ~= "buffer overflow"} return .'
…

I won’t post the output here, it’s way too much. If you scroll through the output, you’ll see that some contain a field named "capec", which is probably about the Common Attack Pattern Enumeration and Classification. Let’s have a closer look at one of those, e.g. from “CVE-2015-8388”:

curl -X POST http://127.0.0.1:3000/query -d 'find {id: == "CVE-2015-8388"} return .capec'
[
[
  {
    "id": "15",
    "name": "Command Delimiters",
    "prerequisites": …
    "related_weakness": [
      "146",
      "77",
      …
    ],
    "solutions": …
    "summary": …
  },
  …

This time we’ve used the exact match operator ==. As the CVEs have a unique Id, it only returned a single document. It’s again a lot of data, we might only care about the CAPEC names, so let’s return those:

curl -X POST http://127.0.0.1:3000/query -d 'find {id: == "CVE-2015-8388"} return .capec[].name'
[
[
  "Command Delimiters",
  "Flash Parameter Injection",
  "Argument Injection",
  "Using Slashes in Alternate Encoding"
]
]

Note that it is an array of an array. The reason is that in this case we only return the CAPEC names of a single document, but our filter condition could of course match more documents, like the word match operator did when we were searching for “buffer overlow”.

Let’s find out all CVEs where the CAPEC name “Directory Traversal”.

curl -X POST http://127.0.0.1:3000/query -d 'find {capec: [{name: == "Command Delimiters"}]} return .id'
[
"CVE-2015-8389",
"CVE-2015-8388",
"CVE-2015-4244",
"CVE-2015-4224",
"CVE-2015-2265",
"CVE-2015-1986",
"CVE-2015-1949",
"CVE-2015-1938"
]

The CAPEC data also contains references to related weaknesses as we’ve seen before. Let’s return the related_weakness of all CVEs that have the CAPEC name “Command Delimiters”.

curl -X POST http://127.0.0.1:3000/query -d 'find {capec: [{name: == "Command Delimiters"}]} return {cve: .id, related: .capec[].related_weakness}'
[
{
  "cve": "CVE-2015-8389",
  "related": [
    [
      "146",
      "77",
      …
    ],
    [
      "184",
      "185",
      "697"
    ],
    …
  ]
},
{
  "cve": "CVE-2015-8388",
  "related": [
  …
  ]
},
…
]

That’s not really what we were after. This returns the related weaknesses of all CAPECs and not just the one named “Command Delimiters”. The solution is a so called bind variable. You can store an array element that matches a condition in a variable which can then be re-used in the return value.

Jut prefix the array condition with a variable name separated by two colons:

find {capec: commdelim::[{name: == "Command Delimiters"}]}

And use it in the return value like any other path:

return {cve: .id, related: commdelim.related_weakness}

So the full query is:

curl -X POST http://127.0.0.1:3000/query -d 'find {capec: commdelim::[{name: == "Command Delimiters"}]} return {cve: .id, related: commdelim.related_weakness}'
[
{
  "cve": "CVE-2015-8389",
  "related": [
    [
      "146",
      "77",
      …
    ]
  ]
},
{
  "cve": "CVE-2015-8388",
  "related": [
    [
      "146",
      "77",
      …
    ]
  ]
},
…
]

The result isn’t that exciting as it’s the same related weaknesses for all CVEs, but of course the could be completely arbitrary. There’s no limitation on the schema.

So far we haven’t done any range requests yet. So let’s have a look at all CVEs that were last modified on December 28th with “High” severity rating according to the Common Vulnerability Scoring System. First we need to determine the correct timestamps:

date --utc --date="2016-12-28" "+%s"
1482883200
date --utc --date="2016-12-29" "+%s"
1482969600

Please note that the "last-modified" field has timestamps with 13 characters (ours have 10), which means that they are in milliseconds, so we just append three zeros and we’re good. The severity rating is stored in the field "cvss”, “High” severity means a value from 7.0–8.9. We need to put the field name last-modified in quotes as it contains a dash (just as you’d do it in JavaScript). The final query is:

curl -X POST http://127.0.0.1:3000/query -d 'find {"last-modified": {$date: >= 1482883200000, $date: < 1482969600000}, cvss: >= 7.0, cvss: <=8.9} return .id'
[
"CVE-2015-4199",
"CVE-2015-4200",
"CVE-2015-4224",
"CVE-2015-4227",
"CVE-2015-4230",
"CVE-2015-4234",
"CVE-2015-4208",
"CVE-2015-4526"
]

This was an introduction into basic querying of Noise. If you want to know about further capabilities you can have a look at the Noise Query Language reference or stay tuned for further blog posts.

Happy exploration!

Categories: en, Noise, Node, JavaScript, Rust

Printing panics in Rust

2017-12-05 16:19

This blog post is not about about dealing with normal runtime errors, you should really use the Result Type for that. This is about the case where some component might panic, but that shouldn’t bring the whole system to halt.

I was debugging some issue in the Node.js binding for Noise. It is using the noise_search crate which might panic if there’s an unrecoverable error. Though the Node.js binding should of course not crash, but handle it in a more graceful way. Hence it is catching the panics.

The existing code was only printing that there was some panic, but it didn’t contain the actual cause. I wanted to improve that.

I thought it would be easy and I could just print the debug version of the panic. So I changed the println!() to:

println!("panic happend: {:?}", result)

But that resulted only in a:

panic happened: Err(Any)

Which isn’t really that meaningful either. In the documentation about catch_unwind I read

…and will return Err(cause) if the closure panics. The cause returned is the object with which panic was originally invoked.

I didn’t really understand what this meant. Is the object that invokes the panic the function where the panic happens? I wanted the text I was putting into the panic!() call.

Thanks to rkruppe on IRC I learnt that panic!() can take any object, not just strings. Now the documentation made sense. He also mentioned that I can downcast Any if I know that type. As I always only use strings for panics that was easy:

if let Err(panic) = result {
    match panic.downcast::<String>() {
        Ok(panic_msg) => {
            println!("panic happened: {}", panic_msg);
        }
        Err(_) => {
            println!("panic happened: unknown type.");
        }
    }
}

If you want to play a bit around with it, I’ve created a minimal example for the Rust Playground. Happy panicking!

Categories: en, Noise, Rust

Introducing Noise

2017-09-19 11:00

I meant to write this blog post for quite some time. It's my view on the new project I'm working on called Noise. I work together with Damien Katz on it full-time for already about a year now. Damien already blogged a bit about the incarnation of Noise.

I can't recall when Damien first told me about the idea, but I surely remember one meeting we had at Couchbase, were plenty of developers were packed in a small room in the Couchbase Mountain View office. Damien was presenting his idea on how flexible JSON indexing should work. It was based on an idea that came up a long time ago at IBM (see Damien's blog post for more information).

Then the years passed without this project actually happening. I've heard again about it when I was visiting Damien while I was in the Bay Area. He told me about his plan actually doing this for real. If I would join early i would become a founder of the project. It wasn't a light-hearted decision, but I eventually decided to leave Couchbase to work full-time on Noise.

Originally Damien created a prototype in C++. But as I was really convinced that Rust is the future for systems programming and databases, I started to port it to Rust before I visited him in the US. Although Damien was skeptical at first, he at least wanted to give it a try and during my stay I convinced him that Rust is the way to go.

Damien did the hard parts on the core of Noise and the Node.js bindings. I mostly spent my time getting an R-tree working on top of RocksDB. It took several attempts, but I think finally I found a good solution. Currently it's a special purpose implementation for Noise, but it could easily be made more generic, or adapted to other specific use cases. If you have such needs, please let me know. At this year's Global FOSS4G conference I presented Noise and its spatial capabilities to a wider audience. I'm happy with the feedback I got. People especially seem to enjoy the query language we came up with.

So now we have a working version which does indexing and has many query features. You can try out Noise online. There's also basic geospatial bounding box query support, which I'll blog more about once I've cleaned up the coded-in-rush-for-a-conference mess and have merged into the master branch.

There are exciting times ahead as now it's time to get some funding for the project. Damien and I don't want to do the venture capital based startup kind of thing, but rather try to find funding through other channels. This will also define the next steps. Noise is a library so it can be the basis for a scaled up distributed system, and/or to scale down into a nice small analytics system that you can run on your local hardware when you don't have access to the cloud.

So in case you read this, tried it out and think that this is exactly what you've been looking for, please tell me about your use case and perhaps you even want to help funding this project.

Categories: en, Noise, RocksDB, Rust, geo

By Volker Mische

Powered by Kukkaisvoima version 7