Robin Berjon 0:31 Hi everyone, welcome to a talk about Matadisco, which is very, very disco. 0:40 The first thing you need to know is that you might think of me as Robin Burgeon. 0:45 I don't know who has been spreading that rumor. 0:48 My name is actually Volker, and I am giving that talk because I I am the creator of Matadisco. 0:54 Are we clear on this? 0:57 Yes? 0:57 OK, thank you. 0:59 So one thing that's pretty cool about— have you heard of the AT protocol? 1:08 Some of you call it @proto for some reason— is that it does social stuff, but it's not just social media. 1:15 And in fact, science is itself a social process and an entirely democratic one, and so we should also use social internet for science, and we could disco with data. 1:28 What is the problem that this is looking at? 1:30 Well, it turns out that a lot of what happens around science, apart from the science itself, sucks. 1:39 Data in the science universe is pretty bad. 1:42 There's a ton of really cool data. 1:46 Lots of people are sharing it in the open in all kinds of ways, but like that, it's all siloed. 1:51 It's really hard to discover. 1:53 A lot of the servers are terribly slow. 1:55 The APIs don't work. 1:57 The portals are designed by, I think, grad students, maybe not scientists. 2:02 So, you know, some, some of those things are pretty, pretty bad. 2:05 And, and one of the big classics is, of course, like all these scientists putting all their data on GitHub. 2:11 Because we all know that nothing bad happens on proprietary social media platforms, right? 2:16 So anyway, we're trying to like, you know, Tom Nicholas from Frost wrote this like really cool blog post about it, and it's important to think about this because you never know when you're going to need it. 2:30 So one of the things I've worked on before, I was a small part of the New York Times team that collected data for the— you've probably sadly seen this, the COVID tracker. 2:42 And what we had to do for that to work was scrape data from like all over. 2:49 And that actually, the team got a Pulitzer Prize for web scraping, which is quite interesting in and of itself. 2:57 But like the level at which this data was bad is terrible. 3:01 So like Italy, for instance, did a good job. 3:02 They put all their numbers on GitHub. 3:04 It was really easy. 3:05 It was JSON, et cetera. 3:07 In the US, most of the time data was per county and those counties all had different systems. 3:15 So you had to have different scrapers per county at the US level. 3:20 One of my favorite ones was France, which I worked on the scraper for, where like every day someone would get all the numbers from all of France, paste them on the JPEG with like a nice textured background and put the JPEG on the web. 3:35 Such that scraping that involved a lot of processing to clean it up enough to OCR it to get the data out. 3:41 And so this is the kind of situation where you do need stuff. 3:45 And like more generally, even without COVID, you know, like if you imagine the impact of making research better, like every time we make it slightly better, scientists can science more. 3:56 And so what does like good look like? 3:59 Well, so in that blog post I referenced earlier, You can think of a number of properties that would be pretty cool to have in the scientific data space in terms of discovery of data repositories, datasets, etc. 4:13 And you want it unified, everyone can find it, of course it's free, it works at any scale. 4:20 I'm not going to read through all of it, but all of these properties are pretty interesting and Volker has made a tool with the IPFS Foundation called Matadisco and what Matadisco does is Extremely simple, it tracks data that gets published to all kinds of different datasets and data sources. 4:41 Currently, it's limited to just a few because it's still ramping up. 4:44 But the idea is to put all the metadata around everything that happens, everything that's published directly on AT Proto, such that then you can write apps that just listen to whatever Jetstream, whatever's convenient in terms of firehoses for you, and then react to that and do things with that data, pull it down, et cetera. 5:05 Knowing of course that all of this is nicely verified. 5:09 And so if you think of the list of properties that I produced earlier, well, it hits quite a number of them, right? 5:18 You have something that's inherently social. 5:21 You can subscribe to it. 5:22 You have a right to exit because it's all on the PDS. 5:24 That's part of the protocol. 5:25 And sustainably fundable, well, it's not very expensive, but we could figure it out. 5:32 And there's a few properties that it doesn't completely hit. 5:34 But in general, like simply by pushing all this metadata about datasets to the protocol, it's getting a lot of these things working already. 5:45 And so I made, you know, a very simple demo of this. 5:50 You can go on my GitHub, vmx.github.io/matadisco-viewer. 5:55 It's all on the matadisco.org thing. 5:58 And this is just like tracking satellite images as they get published, as they get pushed out. 6:05 And they come in directly on the firehose, and you can display them directly. 6:10 Of course, this is not super interesting in and of itself, but we can build stuff on it. 6:16 And we're currently in the process of integrating further sources. 6:21 This is like stuff from IIIF, and so datasets, when we talk about scientific datasets, people immediately think satellite images, genomics, whatever. 6:32 But we also have museum pieces, cultural things, IIIF can do sync between video and notation stuff. 6:43 And so, yep, all of this brought together. 6:48 Go to metadata-scope.org. 6:49 It's brought to you by the lovely IPFS Foundation. 6:53 And Ted has a question. Speaker B 6:55 Hi, Volker. 6:56 Thanks for this. 6:56 This is really cool. 6:58 I was wondering if you'd seen the sort of lensing work that Nick Jarachyniws and Blaine Cook and other folks have done. 7:05 Because if you get all this data up on protocol, like being able to quickly translate into other formats would be wild. Robin Berjon 7:12 So I haven't seen that specific thing, but I knew about the old Cabria stuff and Blaine has been what's a nice word for harassing, encouraging me to, to, to have a conversation with him, which is always a pleasure, of course, specifically about that. 7:28 So yeah, I really want to look at it because we pump the source metadata in and of course those differ. 7:37 So yeah, being able to lens them between, between, between formats would be amazing. 7:41 Yep. Speaker B 7:41 Nick has it currently deployed actually. Robin Berjon 7:43 So let's, let's, let's do it. 7:45 We can just do things. 7:47 We can just science. 7:48 And that's it. 7:49 Any other questions? 7:53 Yeah. 7:54 I just had a comment. 7:55 Right in the mic. Speaker C 7:56 Yeah, I'm a climate scientist, and distributing data is a huge deal for us. 8:00 And we can talk more about it. 8:02 But there are customers who want climate data, and it would be really great to do that. Robin Berjon 8:07 Yeah, bring it our way. 8:10 One of the things that's great— so as you know, there's ongoing epistemicide in the US. 8:16 And a lot of people have been trying to save datasets as fast as possible by putting them on BitTorrent. 8:22 What's difficult with that is you can't know if they're publishing the real original thing or if it's been tampered with because it's very difficult. 8:30 I mean, there's currently no source to know what it was like. 8:33 With this, because ATproto is built on Dazzle, brought to you by the IPFS Foundation, everything can have a CID, so everything can be verified, so it becomes possible to load data that you know and have a public description of that gives you verifiability built in. 8:52 So I think this is a win for the next wave of epistemicidal maniacs. Speaker B 8:59 Have you thought about a TLOG over it? Robin Berjon 9:01 And we could probably do a TLOG over it. 9:03 All these things compose, right? Speaker B 9:07 Hey, how much data have you uploaded to various PDSs? Robin Berjon 9:12 Enough to knock over EuroSky. 9:14 PDFs. 9:14 We're only pushing the metadata out, right? 9:17 We're not pushing because like some of these datasets are insanely massive. 9:20 We would like knock over everything. 9:22 Yeah, but like we've pushed, I mean, I don't know, like at least millions of records. 9:28 I think if you look on UFO, MetaDisco is regularly in the top 10 lexicons even though no one knows what it is at this point. 9:40 So Yeah. Speaker B 9:45 Any other takers? Robin Berjon 9:49 Just go use it. 9:50 And yeah, if you want to onboard stuff, talk to us. 9:52 Okay. 9:53 Thank you, Volker. 9:54 Thank you very much.