vmx

the blllog.

JSON, CBOR and numeric types

2026-04-20 14:05

CBOR makes sense as a canonical representation format as that’s very hard to do with JSON (search for “canonical JSON” and you’ll find a lot of prior art). But having a JSON representation often makes sense, e.g. at HTTP API boundaries. Sending the data in a human readable string format almost every programming language natively supports is a big win. While the JSON doesn’t necessarily needs to be canonical, the data must survive a full CBOR -> JSON -> CBOR round-trip. One core problem are numeric types and this is exactly what this blog post is about.

Numbers in CBOR

We restrict the CBOR to a subset called DAG-CBOR/DRISL. The supported numeric types are integers and floats. Integers must always be encoded in the smallest representation possible (8, 16, 32 and 64-bits are supported). Floats are always in IEEE 754 double-precision binary floating-point format.

Numbers in JSON

Converting from JSON to CBOR is similar to converting it into the native types of a programming language. There is the same problem of having only a single numeric type in JSON. But most programming languages follow the same convention to distinguish between floats and integers. If the number consists of digits only, it’s considered an integer, else it’s a float.

We can use the same convention going from JSON to CBOR as well. Suddenly we have support for integers and floats in JSON.

Numbers in JavaScript

When converting between JSON and CBOR, there’s usually an intermediate step in your favourite programming language that converts it to its native types. So most of the time it’s really JSON -> programming language native types -> CBOR and vice versa.

In most languages that’s not a problem, as their native types also have support for integers and floats. The elephant in the room is JavaScript. Historically there’s only a single numeric type, which is a float. So a naive JSON -> JSON.parse() -> CBOR approach would fail, as all numbers would end up as floats in CBOR.

There a two ways out of this. One is to use JavaScript’s BigInt type that was introduced in 2020. You would use BigInt for integers and then use the native JavaScript number type for floats. For JSON parsing a [custom reviver] would need to be used that outputs a BigInt for every number that only consists of digits:

const reviver = (_, value, context) => {
  // When it's a number and the original text is digits only.
  if (typeof value === 'number' && /^\d+$/.test(context.source)) {
    return BigInt(value)
  }
  return value
}

For encoding it as CBOR again, you would need a CBOR library that treats BigInts as CBOR integers and encodes them into their minimal representation and encodes all native numbers as 64-bit floats. That library would then also be responsible for CBOR -> native JavaScript types conversion using BigInts.

For native Javascript types to JSON, you would need to make sure that floats are always represented with a decimal point. By default a number like 2.0 becomes just 2. This can be done with a custom replacer:

const replacer = (_, value) => {
  if (typeof value === 'bigint') {
    return JSON.rawJSON(value.toString())
  }

  if (typeof value === 'number') {
    // NaN / Infinity are not valid JSON.
    if (!Number.isFinite(value)) {
      return null
    }

    // Force a decimal point even when the value is mathematically integral.
    if (Number.isInteger(value)) {
      return JSON.rawJSON(`${value}.0`)
    }
  }

  return value
}

Another way, which might be simpler, is to skip the native JavaScript object part and use a custom JSON tokenizer that then creates the CBOR directly (and vice versa). cborg can be used for that.

Conclusion

It’s possible to have a well defined, clean round-trip between JSON and CBOR for numeric types. Integers and floats can be distinguished in JSON by treating digit-only numbers as integers and all others as floats. No additional schema information is needed.

Categories: en, ATProto

Atmospheric data portals reply

2026-04-02 23:02

This is a reply to David Gasquez’ blog post Atmospheric Data Portals. As there’s so much in it and much of it overlaps with future plans, I thought it makes sense to write a proper public reply instead of following up in a private conversation.

First of all, read his blog post and follow the many links, there is so much to discover.

One re-occurring thing in the documents linked from the “issues on the earlier stages of the Open Data pipeline” section is that for most portals a static site should be sufficient. I fully agree with that. When it’s done properly, an automated rebuild of some parts when new data is added should work well. These days even powerful client-sided search is possible.

It’s a bit off-topic, but David’s Barefoot Data Platforms page links to Maggie Appleton’s Home Cooked Software and Barefoot Developers talk linked. I highly recommend watching it, it was one of my favourite talks at the Local-first Conference 2024. I always wanted to blog about it, but never found the time.

But now to the concrete points David mentions. If anyone has ideas on how to make those things happen with Matadisco, please open issues on the main Matadisco repo.

Take inspiration from existing flexible standards like Data Package, Croissant, and GEO ones for the core fields. Start with the smallest shared lexicon while leaving room for specialized extensions (sidecars?).

I don’t think Matadisco should go into too much detail on specifying what the metadata should look like. Making one metadata standard to rule them all is destined to fail from my experience (ISO 19115/19139 anyone?). Though there might be a lowest common denominator, similar to what Standard.site is doing for long-form publishing. In order to find out what that looks like, I propose that individual communities start by specifying Lexicons for their own needs. This could be done through tags, which I’ve outlined in the Matadisco issue “Introducing tags for filtering and extension point”.

Split datasets from “snapshots”. Say, io.datonic.dataset holds long-term properties like description and points to io.datonic.dataset.release or io.datonic.dataset.snapshot, which point to the actual resources.

Some kind of hierarchical relationship would be useful. FROST, which Matadisco drew a lot of inspiration from, is centred around IceChunk, which also has the concept of snapshots. But I don’t think we should stop at the concept of snapshots. In my original demo, I scrape a STAC catalogue for Sentinel-2 imagery. Every new image is a new record. They are all part of the same STAC collection, so we could use a similar concept in Matadisco as well.

Add an optional DASL-CID field for resources so we “pin” the bytes.

Yes, that’s something @mosh is keen to have. It’s not only useful for pinning things to a specific version, but also to make it possible to verify that the data you received is the one you expected. It sounds trivial, but the problem would be where to put it. Do you only hash the metadata record it points to? Do you hash the data container (if there’s one)? Or each resource a metadata record points to?

Core lexicon should be as agnostic as possible!

As mentioned above, it might be out of scope for Matadisco and for now it’s left to the individual communities.

Bootstrap the catalog. There are many open indexes and organizations. Crawl them!

Indeed! My first two Matadisco producers are sentinel-to-atproto crawling Element 84’s Earth Search STAC catalogue and gdi-de-csw-to-atproto crawling the GeoNetwork instance of the official German geo metadata catalogue.

Integrate with external repositories. E.g., a service that creates JSON-LD files from the datasets it sees appearing on the Atmosphere so Google Datasets picks them up. The same cron job could push data into Hugging Face or any other tool that people are already using in their fields.

At first this would need to happen for each individual type of record, see the tags proposal above.

Convince and work with high quality organizations doing something like this! I’d definitely collaborate with source.coop for example.

That surely is the goal!

Categories: en, Matadisco, ATProto, geo

FOSSGIS 2026

2026-04-01 13:24

This is a short write-up on the FOSSGIS 2026 conference. It’s a German speaking conference on free and open source geographic information systems and OpenStreetMap. So maybe a blog post in English spreads the word even wider.

While being the biggest edition ever (1000 registrations on-site, 300 online) it was well run and organized as every year. It didn’t even feel larger than usual. The CCC video team streamed live and published the cut videos the same day in outstanding quality as always.

I split this post into two sections, one about interesting talks for the geo world in general and then follow up discussions on my Matadisco talk and ATProto in general.

Talks

I’ve spent most of my time in hallway chatting with people as this is what matters most to me when I’m attending a conference in person. Nonetheless I’ve still managed to see some excellent talks.

Panel discussion on digital sovereignty in the cloud

The conference started with a high-class panel discussion on digital sovereignty in the cloud. The public discussion on that topic is often centered around where servers are located. Though that doesn’t actually matter. US companies can be forced by their government to give access to the data independent of their physical location.

Other topics touched were best practices on switching from proprietary to open source systems.

Barrier-free travelling thanks to paid mappers

Public transport in Germany must be accessible to disabled individuals (reality is far away from that). For routing, you need the data basis for it. This talk gets into the details on how Baden-Württemberg, a federal state in south Germany, works on enabling barrier-free travelling. They decided to add that information of all their 1100 train stations directly to OpenStreetMap. In order to achieve the required high quality they’ve hired through a third party company several experienced mappers from the community.

I really like the idea that OpenStreetMap can now be used as source of truth for that data set. I hope other federal states follow this lead.

Routing talks

I’ve seen two talks about routing. The one about Valhalla routing engine with MapLibre Native was interesting because it was about a special case, where you want to re-route bus lines in case of construction. Although the resulting system is not open source, they’ve contributed upstream to Valhalla, to make it work well with MapLibre Native. Those contributions can be more valuable than a one time source code dump of forked repositories, just to call it open source.

Another one was about Real-time mobility analytics for disaster relief operations, which was interesting to see how routing is used in such cases. The limitations and how such systems really help on the ground.

Matadisco and ATProto

My talk on Matadisco was about the current status of metadata catalogues, the problems and how ATProto can make things better. What I should have made clearer is what Matadisco actually is. I didn’t make it clear that it’s just a schema/convention people would use to announce their data on ATProto. It could’ve been mistaken as a piece of software or a service. You would use Matadisco in order to implement something for your pipeline.

Nonetheless people got the idea and I had good conversations afterwards. I talked with Olivia Guyot about the possible ways on how to integrate Matadisco record publishing into GeoNetwork. With Christian Willmes about creating a portal for combining paleoenvironmental and archaeological data.

While chatting about ATProto at one of the social events Klaus Stein talked about how he would like a social network to be. Users would just put static files somewhere. I agree that having static webspace somewhere without any server component is not only cheap, but also the easiest to get. He is not bothered about other components being operated by other parties, e.g. for indexing. That kept me thinking how far ATProto is away from that. I’d like to build a prototype that is like a static site generator for ATProto records. It won’t be able to act as a full PDS, you would need a WebSocket connection to get the data to a relay. But there could be a minimal service operated by a third party that polls those static PDS for updates and forwards them to a relay.

Categories: en, Matadisco, ATProto, conference, geo

By Volker Mische

Powered by Kukkaisvoima version 7