vmx

the blllog.

JSON, CBOR and numeric types

2026-04-20 14:05

CBOR makes sense as a canonical representation format as that’s very hard to do with JSON (search for “canonical JSON” and you’ll find a lot of prior art). But having a JSON representation often makes sense, e.g. at HTTP API boundaries. Sending the data in a human readable string format almost every programming language natively supports is a big win. While the JSON doesn’t necessarily needs to be canonical, the data must survive a full CBOR -> JSON -> CBOR round-trip. One core problem are numeric types and this is exactly what this blog post is about.

Numbers in CBOR

We restrict the CBOR to a subset called DAG-CBOR/DRISL. The supported numeric types are integers and floats. Integers must always be encoded in the smallest representation possible (8, 16, 32 and 64-bits are supported). Floats are always in IEEE 754 double-precision binary floating-point format.

Numbers in JSON

Converting from JSON to CBOR is similar to converting it into the native types of a programming language. There is the same problem of having only a single numeric type in JSON. But most programming languages follow the same convention to distinguish between floats and integers. If the number consists of digits only, it’s considered an integer, else it’s a float.

We can use the same convention going from JSON to CBOR as well. Suddenly we have support for integers and floats in JSON.

Numbers in JavaScript

When converting between JSON and CBOR, there’s usually an intermediate step in your favourite programming language that converts it to its native types. So most of the time it’s really JSON -> programming language native types -> CBOR and vice versa.

In most languages that’s not a problem, as their native types also have support for integers and floats. The elephant in the room is JavaScript. Historically there’s only a single numeric type, which is a float. So a naive JSON -> JSON.parse() -> CBOR approach would fail, as all numbers would end up as floats in CBOR.

There a two ways out of this. One is to use JavaScript’s BigInt type that was introduced in 2020. You would use BigInt for integers and then use the native JavaScript number type for floats. For JSON parsing a [custom reviver] would need to be used that outputs a BigInt for every number that only consists of digits:

const reviver = (_, value, context) => {
  // When it's a number and the original text is digits only.
  if (typeof value === 'number' && /^\d+$/.test(context.source)) {
    return BigInt(value)
  }
  return value
}

For encoding it as CBOR again, you would need a CBOR library that treats BigInts as CBOR integers and encodes them into their minimal representation and encodes all native numbers as 64-bit floats. That library would then also be responsible for CBOR -> native JavaScript types conversion using BigInts.

For native Javascript types to JSON, you would need to make sure that floats are always represented with a decimal point. By default a number like 2.0 becomes just 2. This can be done with a custom replacer:

const replacer = (_, value) => {
  if (typeof value === 'bigint') {
    return JSON.rawJSON(value.toString())
  }

  if (typeof value === 'number') {
    // NaN / Infinity are not valid JSON.
    if (!Number.isFinite(value)) {
      return null
    }

    // Force a decimal point even when the value is mathematically integral.
    if (Number.isInteger(value)) {
      return JSON.rawJSON(`${value}.0`)
    }
  }

  return value
}

Another way, which might be simpler, is to skip the native JavaScript object part and use a custom JSON tokenizer that then creates the CBOR directly (and vice versa). cborg can be used for that.

Conclusion

It’s possible to have a well defined, clean round-trip between JSON and CBOR for numeric types. Integers and floats can be distinguished in JSON by treating digit-only numbers as integers and all others as floats. No additional schema information is needed.

Categories: en, ATProto

Atmospheric data portals reply

2026-04-02 23:02

This is a reply to David Gasquez’ blog post Atmospheric Data Portals. As there’s so much in it and much of it overlaps with future plans, I thought it makes sense to write a proper public reply instead of following up in a private conversation.

First of all, read his blog post and follow the many links, there is so much to discover.

One re-occurring thing in the documents linked from the “issues on the earlier stages of the Open Data pipeline” section is that for most portals a static site should be sufficient. I fully agree with that. When it’s done properly, an automated rebuild of some parts when new data is added should work well. These days even powerful client-sided search is possible.

It’s a bit off-topic, but David’s Barefoot Data Platforms page links to Maggie Appleton’s Home Cooked Software and Barefoot Developers talk linked. I highly recommend watching it, it was one of my favourite talks at the Local-first Conference 2024. I always wanted to blog about it, but never found the time.

But now to the concrete points David mentions. If anyone has ideas on how to make those things happen with Matadisco, please open issues on the main Matadisco repo.

Take inspiration from existing flexible standards like Data Package, Croissant, and GEO ones for the core fields. Start with the smallest shared lexicon while leaving room for specialized extensions (sidecars?).

I don’t think Matadisco should go into too much detail on specifying what the metadata should look like. Making one metadata standard to rule them all is destined to fail from my experience (ISO 19115/19139 anyone?). Though there might be a lowest common denominator, similar to what Standard.site is doing for long-form publishing. In order to find out what that looks like, I propose that individual communities start by specifying Lexicons for their own needs. This could be done through tags, which I’ve outlined in the Matadisco issue “Introducing tags for filtering and extension point”.

Split datasets from “snapshots”. Say, io.datonic.dataset holds long-term properties like description and points to io.datonic.dataset.release or io.datonic.dataset.snapshot, which point to the actual resources.

Some kind of hierarchical relationship would be useful. FROST, which Matadisco drew a lot of inspiration from, is centred around IceChunk, which also has the concept of snapshots. But I don’t think we should stop at the concept of snapshots. In my original demo, I scrape a STAC catalogue for Sentinel-2 imagery. Every new image is a new record. They are all part of the same STAC collection, so we could use a similar concept in Matadisco as well.

Add an optional DASL-CID field for resources so we “pin” the bytes.

Yes, that’s something @mosh is keen to have. It’s not only useful for pinning things to a specific version, but also to make it possible to verify that the data you received is the one you expected. It sounds trivial, but the problem would be where to put it. Do you only hash the metadata record it points to? Do you hash the data container (if there’s one)? Or each resource a metadata record points to?

Core lexicon should be as agnostic as possible!

As mentioned above, it might be out of scope for Matadisco and for now it’s left to the individual communities.

Bootstrap the catalog. There are many open indexes and organizations. Crawl them!

Indeed! My first two Matadisco producers are sentinel-to-atproto crawling Element 84’s Earth Search STAC catalogue and gdi-de-csw-to-atproto crawling the GeoNetwork instance of the official German geo metadata catalogue.

Integrate with external repositories. E.g., a service that creates JSON-LD files from the datasets it sees appearing on the Atmosphere so Google Datasets picks them up. The same cron job could push data into Hugging Face or any other tool that people are already using in their fields.

At first this would need to happen for each individual type of record, see the tags proposal above.

Convince and work with high quality organizations doing something like this! I’d definitely collaborate with source.coop for example.

That surely is the goal!

Categories: en, Matadisco, ATProto, geo

FOSSGIS 2026

2026-04-01 13:24

This is a short write-up on the FOSSGIS 2026 conference. It’s a German speaking conference on free and open source geographic information systems and OpenStreetMap. So maybe a blog post in English spreads the word even wider.

While being the biggest edition ever (1000 registrations on-site, 300 online) it was well run and organized as every year. It didn’t even feel larger than usual. The CCC video team streamed live and published the cut videos the same day in outstanding quality as always.

I split this post into two sections, one about interesting talks for the geo world in general and then follow up discussions on my Matadisco talk and ATProto in general.

Talks

I’ve spent most of my time in hallway chatting with people as this is what matters most to me when I’m attending a conference in person. Nonetheless I’ve still managed to see some excellent talks.

Panel discussion on digital sovereignty in the cloud

The conference started with a high-class panel discussion on digital sovereignty in the cloud. The public discussion on that topic is often centered around where servers are located. Though that doesn’t actually matter. US companies can be forced by their government to give access to the data independent of their physical location.

Other topics touched were best practices on switching from proprietary to open source systems.

Barrier-free travelling thanks to paid mappers

Public transport in Germany must be accessible to disabled individuals (reality is far away from that). For routing, you need the data basis for it. This talk gets into the details on how Baden-Württemberg, a federal state in south Germany, works on enabling barrier-free travelling. They decided to add that information of all their 1100 train stations directly to OpenStreetMap. In order to achieve the required high quality they’ve hired through a third party company several experienced mappers from the community.

I really like the idea that OpenStreetMap can now be used as source of truth for that data set. I hope other federal states follow this lead.

Routing talks

I’ve seen two talks about routing. The one about Valhalla routing engine with MapLibre Native was interesting because it was about a special case, where you want to re-route bus lines in case of construction. Although the resulting system is not open source, they’ve contributed upstream to Valhalla, to make it work well with MapLibre Native. Those contributions can be more valuable than a one time source code dump of forked repositories, just to call it open source.

Another one was about Real-time mobility analytics for disaster relief operations, which was interesting to see how routing is used in such cases. The limitations and how such systems really help on the ground.

Matadisco and ATProto

My talk on Matadisco was about the current status of metadata catalogues, the problems and how ATProto can make things better. What I should have made clearer is what Matadisco actually is. I didn’t make it clear that it’s just a schema/convention people would use to announce their data on ATProto. It could’ve been mistaken as a piece of software or a service. You would use Matadisco in order to implement something for your pipeline.

Nonetheless people got the idea and I had good conversations afterwards. I talked with Olivia Guyot about the possible ways on how to integrate Matadisco record publishing into GeoNetwork. With Christian Willmes about creating a portal for combining paleoenvironmental and archaeological data.

While chatting about ATProto at one of the social events Klaus Stein talked about how he would like a social network to be. Users would just put static files somewhere. I agree that having static webspace somewhere without any server component is not only cheap, but also the easiest to get. He is not bothered about other components being operated by other parties, e.g. for indexing. That kept me thinking how far ATProto is away from that. I’d like to build a prototype that is like a static site generator for ATProto records. It won’t be able to act as a full PDS, you would need a WebSocket connection to get the data to a relay. But there could be a minimal service operated by a third party that polls those static PDS for updates and forwards them to a relay.

Categories: en, Matadisco, ATProto, conference, geo

Matadisco

2026-03-23 16:58

Open data is only as useful as it is discoverable. Finding datasets, whether it’s satellite imagery, scientific research, or cultural archives involves navigating dozens of siloed portals, each of them with different interfaces and APIs. Project Matadisco tries to solve this by using ATProto to create an open, decentralized network for data discovery. Anyone can publish metadata about their datasets. You can then pick the records that matter to you and build views for the specific needs of your community. By focusing on metadata rather than the data itself, the system works with any dataset format, keeps records lightweight, and remains agnostic about storage.

It’s early stage and experimental, but the potential is significant. To see it in action, visit the matadisco-viewer demo. It listens to the incoming stream of ATProto events and renders them. At the moment it’s satellite images only, but that will hopefully change soon.

Satellite image

Above is an example of what crossed my screen while developing (metadata, download at full resolution (253MiB)).

Motivation

Metadata records can be very diverse. They might describe geodata, your favourite news site or your favourite podcasts. What they all have in common is that users usually rely on centralized platforms in order to find them. For geodata, this is often a government-run open data or geo portal. These platforms decide which data gets published.

You might generate a derived dataset or clean up an existing one. If you operate from outside of the original creators, you probably won’t even be able to get your data linked from there. So how will anyone find out about it? That’s a problem of metadata discovery.

The other side of the problem is that even when metadata is available, it can be hard to find. There are large metadata aggregation portals like the portal for European data, with almost 2 million records. How do you find what exactly you are looking for. What if there were specialized portals tailored to specific communities?

For even more details, see the companion blog post of the IPFS Foundation.

The idea

The idea is to support both: an easy way for anyone to publish discoverable metadata, and a way to make that metadata widely accessible to build both large aggregators and specialized portals tailored to specific communities.

The central building block is ATProto. It allows anyone to publish and subscribe to records. Rather than defining a single metadata schema to rule them all, the approach here is more meta-meta. Each record contains a link to the actual metadata. That’s the absolute minimum. Though it could make sense to go beyond this minimalism and store additional information to make it easier to build custom portals.

One example of such additional information is a preview. It’s nice to get a quick sense of the underlying data that the metadata describes. For satellite imagery, this could be a true color thumbnail of the scene. For long form articles, a summary or excerpt. For podcasts it may be a brief audio snippet or trailer.

The implementation

As part of my work at the IPFS Foundation, I started with geodata. The first prototype focuses on [Copernicus Sentinel-2 L2A satellite images].

The metadata is sourced from Element 84’s Earth Search STAC catalogue. It provides free, publicly accessible HTTP links to the images (the official Copernicus STAC does not). A [Cloudflare Worker] checks the STAC instance every few minutes for updates. When new records appear, a link to the metadata along with a preview is ingested into ATProto. The source code for the worker is available at https://github.com/vmx/sentinel-to-atproto/.

Below is the Lexicon schema for this ATProto meta-metadata record, which I call Matadisco. To improve readability, the MLF syntax is used:

/// A Matadisco record
record matadisco {
    /// The time the original metadata/data was published
    publishedAt!: Datetime,
    /// A URI that links to resource containing the metadata
    resource!: Uri,
    /// Preview of the data
    preview: {
        /// The media type the preview has
        mimeType!: string,
        /// The URL to the preview
        url: Uri,
    },
}

Once records are available on ATProto, they can be processed and displayed. I built a simple viewer that renders records conforming to the cx.vmx.matadisco Lexicon schema defined above. A Bluesky Jetstream instance streams newly added records directly into the browser. A demo is available at https://vmx.github.io/matadisco-viewer/.

If you’re interested in seeing the raw records, you can find them on my ATProto dev account.

Prior art

This work builds on ideas by Tom Nicholas, who started a project called FROST. His motivating blog post is an excellent read about data-sharing challenges in a scientific context. His presentation on FROST explains why such a system should remain simple, with the metadata URL as the only required field.

Edward Silverton, who works in the GLAM space, explored a similar idea for publishing IIIF data. We refined his approach to align it more closely with FROST. He published further details on the complete workflow for his use case, which has a broader scope.

There was also a discussion thread on Bluesky about metadata for long-form content to build cross-platform discovery.

What’s next

Possible future steps I want to look into:

As mentioned in the introduction, this is deliberately experimental, things may break or change dramatically. The upside is that no one needs to worry about breakage. Please experiment with these ideas and let us now about them at the Matadisco GitHub repository. Publish records under your own namespace, or even reuse the one I am currently using.

Categories: en, ATProto, geo

By Volker Mische

Powered by Kukkaisvoima version 7