Robin Berjon 0:01 Hi everyone. 0:03 Welcome to a talk about Matadisco, which is very, very disco. 0:10 The first thing you need to know is that you might think of me as Robin Bergen. 0:15 I don't know who has been spreading that rumor. 0:18 My name is actually Volker and I am giving that talk because I am the creator of Matadisco. 0:25 Are we clear on this? 0:27 Yes? 0:27 Okay, thank you. 0:29 So, you know, one thing that's pretty cool about— have you heard of the AT protocol? 0:38 Some of you call it @proto for some reason— is that it does like social stuff, but it's not just social media. 0:45 And in fact, science is itself a social process and an entirely democratic one. 0:50 And so we should also use social internet for science, and we could disco with data. 0:58 What is the problem that this is looking at? 1:00 Well, it turns out that a lot of what happens around science, apart from the science itself, sucks. 1:09 Data in the science universe is pretty bad. 1:12 There's a ton of really cool data. 1:16 Lots of people are sharing it in the open in all kinds of ways, but like that it's all siloed. 1:21 It's really hard to discover. 1:23 A lot of the servers are terribly slow. 1:25 The APIs don't work. 1:27 The portals are designed by, I think, grad students, maybe not scientists. 1:31 So, you know, some of those things are pretty bad. 1:35 And one of the big classics is, of course, like all these scientists putting all their data on GitHub because we all know that nothing bad happens on proprietary social media platforms, right? 1:46 So anyway, we're trying to like, Tom Nicholas from Frost wrote this like really cool blog post about it. 1:55 And it's important to think about this because you never know when you're going to need it. 2:00 So one of the things I've worked on before, I was a small part of the New York Times team that collected data for the, you've probably sadly seen this, the COVID tracker. 2:12 And what we had to do for that to work scrape data from like all over. 2:19 And that actually, that the team got a Pulitzer Prize for web scraping, which is quite interesting in and of itself. 2:26 But like the level at which this data was bad is terrible. 2:30 So like Italy, for instance, did a good job. 2:32 They put all their numbers on GitHub. 2:34 It was really easy. 2:35 It was JSON, etc. 2:37 In the US, most of the time, Data was per county and those counties all had different systems. 2:45 So you had to have different scrapers per county at the US level. 2:50 One of my favorite ones was France, which I worked on a scraper for where like every day someone would get all the numbers from all of France, paste them on the JPEG with like a nice textured background and put the JPEG on the web such that scraping that involved a lot of processing to clean it up enough to OCR it to get the data out. 3:11 And so this is the kind of situation where you do need stuff and like more generally even without COVID, you know, like if you imagine the impact of making research better, like every time we make it slightly better scientists can science more. 3:26 And so what does like good look like? 3:29 Well, so in that blog post I referenced earlier, like you can think of like a number of properties that would be pretty cool to have in, you know, sort of like the scientific data space in terms of discovery of data repositories, datasets, etc. 3:43 And you know, you want it unified, everyone can find it, of course it's free, it works at any scale. 3:49 I'm not going to read through all of it, but like all of these properties are pretty interesting and Volker has made a tool with the IPFS Foundation called Metadisco and what Metadisco does is extremely simple. 4:05 It tracks data that gets published to all kinds of different datasets and data sources. 4:11 Currently it's limited limited to just a few because it's still like ramping up. 4:14 But the idea is to put all the metadata around all everything that happens, everything that's published directly on AT Proto such that then you can write apps that just listen to whatever Jetstream, whatever's convenient in terms of firehoses for you, and then react to that and do things with that data, pull it down, etc., knowing of course that all of this is nicely verified. 4:39 And so If you think of the list of properties that I produced earlier, well, it hits quite a number of them, right? 4:48 You have something that's inherently social. 4:50 You can subscribe to it. 4:52 You have a right to exit because it's all on a PDS as part of the protocol. 4:55 And, you know, sustainably fundable. 4:57 Well, it's not very expensive, but like we could figure it out. 5:02 And there's a few properties that it doesn't completely hit, but in general, Simply by pushing all this metadata about datasets to the protocol, it's getting a lot of these things working already. 5:15 And so, I made a very simple demo of this. 5:20 You can go on my GitHub, vmx.github.io/matadisco-viewer. 5:25 It's all on the matadisco.org thing and this is just like tracking satellite images as they get published, as they get pushed out, and they come in directly on the firehose and you can display them directly. 5:40 Of course, this is not super interesting in and of itself, but we can build stuff on it. 5:45 And, you know, we're currently in the process of integrating further sources. 5:51 This is like stuff from IIIF. 5:54 And so, you know, datasets— when you talk about scientific datasets, people immediately think like satellite images, genomics, whatever. 6:02 But like, We also have museum pieces, cultural things. 6:08 IIIF can do sync between video and notation stuff. 6:13 And so, yep, all of this brought together. 6:18 Go to metadata Disco.org. 6:20 It's brought to you by the lovely IPFS Foundation. 6:23 And Ted has a question. 6:25 Hi, Volker. Speaker B 6:26 Thanks for this. 6:26 This is really cool. 6:28 I was wondering if you'd seen the sort of lensing work that Nick Jarachyniws and Blaine Cook and other folks have done. 6:35 Because if you get all this data up on protocol, like being able to quickly translate into other formats would be wild. Robin Berjon 6:42 So I haven't seen that specific thing, but I knew about the old Cabrio stuff. 6:47 And Blaine has been— what's a nice word for harassing— encouraging me to have a conversation with him, which is always a pleasure, of course. 6:57 Specifically about that. 6:58 So yeah, I really want to look at it because we pump the source metadata in and of course those differ. 7:07 So yeah, being able to lens them between, between, between formats would be amazing. Speaker B 7:11 Yep. 7:11 Nick has it currently deployed actually. Robin Berjon 7:13 So let's, let's, let's do it. 7:15 We can just do things. 7:17 We can just science. 7:18 And that's it. 7:19 Any other questions? 7:23 Yeah. 7:24 I just had a comment. 7:25 Wait, in the mic. Speaker C 7:26 Yeah, I'm a climate scientist and distributing data is a huge deal for us and we can talk more about it, but the customers who want climate data and it'd be really great to do that. Robin Berjon 7:37 Yeah, bring it our way. 7:40 One of the things that's great, so like, as you know, there's ongoing epistemicide in the US and like a lot of people have been trying to save datasets as fast as possible by putting them on BitTorrent. 7:52 What's difficult with that is you can't know if they're publishing the real original thing or if it's been tampered with because it's very difficult. 7:59 I mean, there's currently no source to know what it was like. 8:03 With this, because ATproto is built on Dazzle brought to you by the IPFS Foundation, everything can have a CID, so everything can be verified. 8:13 So it becomes possible to load data that you know and have like a public description of that that gives you verifiability built in. 8:22 So I think this is a win for the next wave of epistemicidal maniacs. Speaker B 8:29 Have you thought about a TLOG over it? Robin Berjon 8:31 And we could probably do a TLOG over it. 8:33 All these things compose, right? 8:37 Hey, how much data have you uploaded to various PDSs? 8:41 Enough to knock over EuroSky PDS. 8:44 We're only pushing the metadata out, right? 8:47 We're not pushing— because some of these data sets are insanely massive, we would knock over everything. 8:53 But we've pushed, I mean, I don't know, millions of records, I think. 9:00 If you look on UFO, MetaDisco is regularly in the top 10 lexicons, even though no one knows what it is at this point. 9:10 So yeah. Speaker B 9:15 Any other takers? Robin Berjon 9:19 Just go use it. 9:20 And yeah, if you want to onboard stuff, talk to us. 9:22 Okay. 9:23 Thank you, Volker. 9:24 Thank you very much.