Exploring data with Noise

About me

My name is Volker Mische and I'm an open source enthusiast and hacker. You can reach me via email, or Bluesky (@vmx.cx). Find me also on GitHub.

2017-12-12 22:35

This is a quick introduction on how to explore some JSON data with Noise. We won’t do any pre-processing, but just load the data into Noise and see what we can do with it. Sometimes the JSON you get needs some tweaking before further analysis makes sense. For example you want to rename fields or numbers are stored as string. This exploration phase can be used to get a feeling for the data and which parts might need some adjustments.

Finding decent ready to use data that contains some nicely structured JSON was harder than I thought. Most datasets are either GeoJSON or CSV masqueraded as JSON. But I was lucky and found a JSON dump of the CVE database provided by CIRCL. So we’ll dig into the CVEs (Common Vulnerabilities and Exposures) database to find out more about all those security vulnerabilities.

Noise has a Node.js binding to get started easily. I won’t dig into the API for now. Instead I’ve prepared two scripts. One to load the data from a file containing new line separated JSON. And another one for serving up the Noise index over HTTP, so that we can explore the data via curl.

Prerequisites

As we use the Node.js binding for Noise, you need to have Node.js, npm and Rust (easiest is probably through rustup) installed.

I’ve created a repository with the two scripts mentioned above plus a subset of the CIRCL CVE dataset. Feel free to download the full dataset from the CIRCL Open Data page (1.2G unpacked) and load it into Noise. Please note that Noise isn’t performance optimised at all yet. So the import takes some time as the hard work of all the indexing is done on insertion time.

git clone https://github.com/vmx/blog-exploring-data-with-noise
cd blog-exploring-data-with-noise
npm install

Now everything we need should be installed, let’s load the data into Noise and do a query to verify it’s installed properly.

Loading the data and verify installation

Loading the data is as easy as:

npx dataload circl-cve.json

For every inserted record one dot will be printed.

To spin up the simple HTTP server, just run:

npx indexserve circl-cve

To verify it does actually respond to queries, try:

curl -X POST http://127.0.0.1:3000/query -d 'find {} return count()'

If all documents got inserted correctly it should return

[
1000
]

Everything is set up properly, now it’s time to actually exploring the data.

Exploring the data

We don’t have a clue yet, what the data looks like. So let’s start with looking at a single document:

curl -X POST http://127.0.0.1:3000/query -d 'find {} return . limit 1'
[
{
  "Modified": "2017-01-02 17:59:00.147000",
  "Published": "2017-01-02 17:59:00.133000",
  "_id": "34de83b0d3c547c089635c3a8b4960f2",
  "cvss": null,
  "cwe": "Unknown",
  "id": "CVE-2017-5005",
  "last-modified": {
    "$date": 1483379940147
  },
  "references": [
    "https://github.com/payatu/QuickHeal",
    "https://www.youtube.com/watch?v=h9LOsv4XE00"
  ],
  "summary": "Stack-based buffer overflow in Quick Heal Internet Security 10.1.0.316 and earlier, Total Security 10.1.0.316 and earlier, and AntiVirus Pro 10.1.0.316 and earlier on OS X allows remote attackers to execute arbitrary code via a crafted LC_UNIXTHREAD.cmdsize field in a Mach-O file that is mishandled during a Security Scan (aka Custom Scan) operation.",
  "vulnerable_configuration": [],
  "vulnerable_configuration_cpe_2_2": []
}
]

The query above means: “Find all documents without restrictions and return it’s full contents. Limit it to a single result”.

You don’t always want to return all documents, but filter based on certain conditions. Let’s start with the word match operator ~=. It matches document which contains those words in a specific field, in our case "summary". As “buffer overflow” is a common attack vector, let’s search for all documents that contain it in the summary.

curl -X POST http://127.0.0.1:3000/query -d 'find {summary: ~= "buffer overflow"}'
[
"34de83b0d3c547c089635c3a8b4960f2",
"8dff5ea0e5594e498112abf1c222d653",
"741cfaa4b7ae43909d1da153747975c9",
…
"b7419042c9464a7b96d3df74451cb4a7",
"d379e9fda704446982cee8638f32e72b"
]

That’s quite a long list of random characters. Noise assigns Ids to every inserted document if the document doesn’t contain a "_id" field. By default Noise returns such Ids of the matching documents. So no return value is equivalent to return ._id. Let’s return the CVE number of the matching vulnerabilities instead. That field is called "id":

curl -X POST http://127.0.0.1:3000/query -d 'find {summary: ~= "buffer overflow"} return .id'
[
"CVE-2017-5005",
"CVE-2016-9942",
…
"CVE-2015-2710",
"CVE-2015-2666"
]

If you want to know how many there are, just append a return count() to the query:

curl -X POST http://127.0.0.1:3000/query -d 'find {summary: ~= "buffer overflow"} return count()'
[
61
]

Or we can of course return the full documents to see if there are further interesting things to look at:

curl -X POST http://127.0.0.1:3000/query -d 'find {summary: ~= "buffer overflow"} return .'
…

I won’t post the output here, it’s way too much. If you scroll through the output, you’ll see that some contain a field named "capec", which is probably about the Common Attack Pattern Enumeration and Classification. Let’s have a closer look at one of those, e.g. from “CVE-2015-8388”:

curl -X POST http://127.0.0.1:3000/query -d 'find {id: == "CVE-2015-8388"} return .capec'
[
[
  {
    "id": "15",
    "name": "Command Delimiters",
    "prerequisites": …
    "related_weakness": [
      "146",
      "77",
      …
    ],
    "solutions": …
    "summary": …
  },
  …

This time we’ve used the exact match operator ==. As the CVEs have a unique Id, it only returned a single document. It’s again a lot of data, we might only care about the CAPEC names, so let’s return those:

curl -X POST http://127.0.0.1:3000/query -d 'find {id: == "CVE-2015-8388"} return .capec[].name'
[
[
  "Command Delimiters",
  "Flash Parameter Injection",
  "Argument Injection",
  "Using Slashes in Alternate Encoding"
]
]

Note that it is an array of an array. The reason is that in this case we only return the CAPEC names of a single document, but our filter condition could of course match more documents, like the word match operator did when we were searching for “buffer overlow”.

Let’s find out all CVEs where the CAPEC name “Directory Traversal”.

curl -X POST http://127.0.0.1:3000/query -d 'find {capec: [{name: == "Command Delimiters"}]} return .id'
[
"CVE-2015-8389",
"CVE-2015-8388",
"CVE-2015-4244",
"CVE-2015-4224",
"CVE-2015-2265",
"CVE-2015-1986",
"CVE-2015-1949",
"CVE-2015-1938"
]

The CAPEC data also contains references to related weaknesses as we’ve seen before. Let’s return the related_weakness of all CVEs that have the CAPEC name “Command Delimiters”.

curl -X POST http://127.0.0.1:3000/query -d 'find {capec: [{name: == "Command Delimiters"}]} return {cve: .id, related: .capec[].related_weakness}'
[
{
  "cve": "CVE-2015-8389",
  "related": [
    [
      "146",
      "77",
      …
    ],
    [
      "184",
      "185",
      "697"
    ],
    …
  ]
},
{
  "cve": "CVE-2015-8388",
  "related": [
  …
  ]
},
…
]

That’s not really what we were after. This returns the related weaknesses of all CAPECs and not just the one named “Command Delimiters”. The solution is a so called bind variable. You can store an array element that matches a condition in a variable which can then be re-used in the return value.

Jut prefix the array condition with a variable name separated by two colons:

find {capec: commdelim::[{name: == "Command Delimiters"}]}

And use it in the return value like any other path:

return {cve: .id, related: commdelim.related_weakness}

So the full query is:

curl -X POST http://127.0.0.1:3000/query -d 'find {capec: commdelim::[{name: == "Command Delimiters"}]} return {cve: .id, related: commdelim.related_weakness}'
[
{
  "cve": "CVE-2015-8389",
  "related": [
    [
      "146",
      "77",
      …
    ]
  ]
},
{
  "cve": "CVE-2015-8388",
  "related": [
    [
      "146",
      "77",
      …
    ]
  ]
},
…
]

The result isn’t that exciting as it’s the same related weaknesses for all CVEs, but of course the could be completely arbitrary. There’s no limitation on the schema.

So far we haven’t done any range requests yet. So let’s have a look at all CVEs that were last modified on December 28th with “High” severity rating according to the Common Vulnerability Scoring System. First we need to determine the correct timestamps:

date --utc --date="2016-12-28" "+%s"
1482883200
date --utc --date="2016-12-29" "+%s"
1482969600

Please note that the "last-modified" field has timestamps with 13 characters (ours have 10), which means that they are in milliseconds, so we just append three zeros and we’re good. The severity rating is stored in the field "cvss”, “High” severity means a value from 7.0–8.9. We need to put the field name last-modified in quotes as it contains a dash (just as you’d do it in JavaScript). The final query is:

curl -X POST http://127.0.0.1:3000/query -d 'find {"last-modified": {$date: >= 1482883200000, $date: < 1482969600000}, cvss: >= 7.0, cvss: <=8.9} return .id'
[
"CVE-2015-4199",
"CVE-2015-4200",
"CVE-2015-4224",
"CVE-2015-4227",
"CVE-2015-4230",
"CVE-2015-4234",
"CVE-2015-4208",
"CVE-2015-4526"
]

This was an introduction into basic querying of Noise. If you want to know about further capabilities you can have a look at the Noise Query Language reference or stay tuned for further blog posts.

Happy exploration!

Categories: en, Noise, Node, JavaScript, Rust

Comments are closed after 14 days.

By Volker Mische

vmx

About me

Categories

Archives