Matadisco
2026-03-23 16:58
Open data is only as useful as it is discoverable. Finding datasets, whether it’s satellite imagery, scientific research, or cultural archives involves navigating dozens of siloed portals, each of them with different interfaces and APIs. Project Matadisco tries to solve this by using ATProto to create an open, decentralized network for data discovery. Anyone can publish metadata about their datasets. You can then pick the records that matter to you and build views for the specific needs of your community. By focusing on metadata rather than the data itself, the system works with any dataset format, keeps records lightweight, and remains agnostic about storage.
It’s early stage and experimental, but the potential is significant. To see it in action, visit the matadisco-viewer demo. It listens to the incoming stream of ATProto events and renders them. At the moment it’s satellite images only, but that will hopefully change soon.
Above is an example of what crossed my screen while developing (metadata, download at full resolution (253MiB)).
Motivation
Metadata records can be very diverse. They might describe geodata, your favourite news site or your favourite podcasts. What they all have in common is that users usually rely on centralized platforms in order to find them. For geodata, this is often a government-run open data or geo portal. These platforms decide which data gets published.
You might generate a derived dataset or clean up an existing one. If you operate from outside of the original creators, you probably won’t even be able to get your data linked from there. So how will anyone find out about it? That’s a problem of metadata discovery.
The other side of the problem is that even when metadata is available, it can be hard to find. There are large metadata aggregation portals like the portal for European data, with almost 2 million records. How do you find what exactly you are looking for. What if there were specialized portals tailored to specific communities?
For even more details, see the companion blog post of the IPFS Foundation.
The idea
The idea is to support both: an easy way for anyone to publish discoverable metadata, and a way to make that metadata widely accessible to build both large aggregators and specialized portals tailored to specific communities.
The central building block is ATProto. It allows anyone to publish and subscribe to records. Rather than defining a single metadata schema to rule them all, the approach here is more meta-meta. Each record contains a link to the actual metadata. That’s the absolute minimum. Though it could make sense to go beyond this minimalism and store additional information to make it easier to build custom portals.
One example of such additional information is a preview. It’s nice to get a quick sense of the underlying data that the metadata describes. For satellite imagery, this could be a true color thumbnail of the scene. For long form articles, a summary or excerpt. For podcasts it may be a brief audio snippet or trailer.
The implementation
As part of my work at the IPFS Foundation, I started with geodata. The first prototype focuses on [Copernicus Sentinel-2 L2A satellite images].
The metadata is sourced from Element 84’s Earth Search STAC catalogue. It provides free, publicly accessible HTTP links to the images (the official Copernicus STAC does not). A [Cloudflare Worker] checks the STAC instance every few minutes for updates. When new records appear, a link to the metadata along with a preview is ingested into ATProto. The source code for the worker is available at https://github.com/vmx/sentinel-to-atproto/.
Below is the Lexicon schema for this ATProto meta-metadata record, which I call Matadisco. To improve readability, the MLF syntax is used:
/// A Matadisco record
record matadisco {
/// The time the original metadata/data was published
publishedAt!: Datetime,
/// A URI that links to resource containing the metadata
resource!: Uri,
/// Preview of the data
preview: {
/// The media type the preview has
mimeType!: string,
/// The URL to the preview
url: Uri,
},
}
Once records are available on ATProto, they can be processed and displayed. I built a simple viewer that renders records conforming to the cx.vmx.matadisco Lexicon schema defined above. A Bluesky Jetstream instance streams newly added records directly into the browser. A demo is available at https://vmx.github.io/matadisco-viewer/.
If you’re interested in seeing the raw records, you can find them on my ATProto dev account.
Prior art
This work builds on ideas by Tom Nicholas, who started a project called FROST. His motivating blog post is an excellent read about data-sharing challenges in a scientific context. His presentation on FROST explains why such a system should remain simple, with the metadata URL as the only required field.
Edward Silverton, who works in the GLAM space, explored a similar idea for publishing IIIF data. We refined his approach to align it more closely with FROST. He published further details on the complete workflow for his use case, which has a broader scope.
There was also a discussion thread on Bluesky about metadata for long-form content to build cross-platform discovery.
What’s next
Possible future steps I want to look into:
- Adding another, different geodata source, such as metadata from the German geodata catalogue.
- Including another image-based source, for example GLAM catalogues that use IIIF.
- Integrating a completely different data source, such as metadata from the podcasts of the German public broadcasting.
As mentioned in the introduction, this is deliberately experimental, things may break or change dramatically. The upside is that no one needs to worry about breakage. Please experiment with these ideas and let us now about them at the Matadisco GitHub repository. Publish records under your own namespace, or even reuse the one I am currently using.
