Atmospheric data portals reply
2026-04-02 23:02
This is a reply to David Gasquez’ blog post Atmospheric Data Portals. As there’s so much in it and much of it overlaps with future plans, I thought it makes sense to write a proper public reply instead of following up in a private conversation.
First of all, read his blog post and follow the many links, there is so much to discover.
One re-occurring thing in the documents linked from the “issues on the earlier stages of the Open Data pipeline” section is that for most portals a static site should be sufficient. I fully agree with that. When it’s done properly, an automated rebuild of some parts when new data is added should work well. These days even powerful client-sided search is possible.
It’s a bit off-topic, but David’s Barefoot Data Platforms page links to Maggie Appleton’s Home Cooked Software and Barefoot Developers talk linked. I highly recommend watching it, it was one of my favourite talks at the Local-first Conference 2024. I always wanted to blog about it, but never found the time.
But now to the concrete points David mentions. If anyone has ideas on how to make those things happen with Matadisco, please open issues on the main Matadisco repo.
Take inspiration from existing flexible standards like Data Package, Croissant, and GEO ones for the core fields. Start with the smallest shared lexicon while leaving room for specialized extensions (sidecars?).
I don’t think Matadisco should go into too much detail on specifying what the metadata should look like. Making one metadata standard to rule them all is destined to fail from my experience (ISO 19115/19139 anyone?). Though there might be a lowest common denominator, similar to what Standard.site is doing for long-form publishing. In order to find out what that looks like, I propose that individual communities start by specifying Lexicons for their own needs. This could be done through tags, which I’ve outlined in the Matadisco issue “Introducing tags for filtering and extension point”.
Split datasets from “snapshots”. Say,
io.datonic.datasetholds long-term properties likedescriptionand points toio.datonic.dataset.releaseorio.datonic.dataset.snapshot, which point to the actual resources.
Some kind of hierarchical relationship would be useful. FROST, which Matadisco drew a lot of inspiration from, is centred around IceChunk, which also has the concept of snapshots. But I don’t think we should stop at the concept of snapshots. In my original demo, I scrape a STAC catalogue for Sentinel-2 imagery. Every new image is a new record. They are all part of the same STAC collection, so we could use a similar concept in Matadisco as well.
Add an optional DASL-CID field for resources so we “pin” the bytes.
Yes, that’s something @mosh is keen to have. It’s not only useful for pinning things to a specific version, but also to make it possible to verify that the data you received is the one you expected. It sounds trivial, but the problem would be where to put it. Do you only hash the metadata record it points to? Do you hash the data container (if there’s one)? Or each resource a metadata record points to?
Core lexicon should be as agnostic as possible!
As mentioned above, it might be out of scope for Matadisco and for now it’s left to the individual communities.
Bootstrap the catalog. There are many open indexes and organizations. Crawl them!
Indeed! My first two Matadisco producers are sentinel-to-atproto crawling Element 84’s Earth Search STAC catalogue and gdi-de-csw-to-atproto crawling the GeoNetwork instance of the official German geo metadata catalogue.
Integrate with external repositories. E.g., a service that creates
JSON-LDfiles from the datasets it sees appearing on the Atmosphere so Google Datasets picks them up. The same cron job could push data into Hugging Face or any other tool that people are already using in their fields.
At first this would need to happen for each individual type of record, see the tags proposal above.
Convince and work with high quality organizations doing something like this! I’d definitely collaborate with
source.coopfor example.
That surely is the goal!