Maxine Levesque 2:46 Hi everyone, my name is Maxine. 2:49 I am the co-founder and CTO of Forecast.bio, and I, uh, as— thank you for the opportunity to speak here. 3:00 It's really exciting for me to be a part of this overall community over the last, uh, the a little bit more than a year, and I'm really excited about the possibilities that @proto has for providing us agency over our social experience, over the way that we are able to shape our own science, our own investigation, and our own collective sense-making, as I'm sure was talked about a lot at the @science meeting before the main meeting for this. 3:33 So I'm going to talk about what we're building at Forecast to facilitate this vision of a future of AI built on App Proto using the tools that are available built into the infrastructure for the protocol to be able to facilitate collective sharing and understanding of datasets, large datasets, and also of model weights. 3:59 So first, an introduction to forecast because it's probably a little bit out of left field to have a like.bio domain show up at this, at this setting and not in the context of like biographic, but actually biology. 4:16 So Forecast is a drug discovery and development company, actually. 4:22 And so our vision is really focused around providing people with cognitive agency and recognizing that historically we've done a very bad job of developing central nervous system therapeutics. 4:35 In the pharmaceutical space. 4:37 So what we want to do is to be able to say, look, you— there is this vast space of where your mind could be at any given moment, and a vast way in which your mind can evolve in its instantaneous state across this space of all the different things that a mind can be. 4:55 So you're somewhere in this vast sea of all the potential mental states that are possible, and what we want to do is to be able to produce therapeutic interventions. 5:06 We're starting with drugs, with small molecule drugs, but we'd really like to see this expand into all types of different interventions that allow you to say, okay, no matter which spot that you're at in this space, I want to be able to move to any other spot in this space to give you control of the way that your mind operates over time. 5:28 And so we think that this overall vision provides a really like a really different way of thinking about like how drugs are developed in neurology and psychiatry, where historically we've had a really difficult time bringing new tools to market. 5:43 And so we are using artificial intelligence models both on like the language side, on the image side, on the multimodal side, all of these different techniques that have really been radically transformative in every field of science and computation over the last few years to be able to build out this map of every cognitive state that a person could be in over time, and then to understand the biology by growing mini brains in the lab and then watching the videos of their brain activity over time to be able to map out how pharmaceutical interventions impact these things. 6:20 So the difficulty that we run into with this is that right now in the ecosystem for AI, these models are really largely produced from a small subset of orgs, as I'm sure a lot of people in the audience are familiar with from this. 6:36 And this provides really a number of like practical and like, I think for the field, for science as a whole, existential problems that come from like, what are the long-term impacts of having like the incentive structures that are present for individual large orgs producing these models the way that even just having any concentration of only a small sparse subset of models makes people anchor their understanding. 7:01 So even, you know, like if you have a large number of individual startups that are sort of like building things based off of these, like say in the biotech space, if you have image models that show you biological imaging, but they're all based off of the same sort of like underlying base. 7:15 And like the, the knowledge that you're seeing actually when you're doing scientific investigation is really, really concentrated around whatever was the semantics that was put into the specific training regime of that, right? 7:26 So there's this outsized power that comes in being able to, like, be the one that controls that. 7:32 And there's also, like, an outsized effect of, like, sort of quashing the variability in the entire ecosystem of the way investigation works. 7:39 And this is something that actually a lot of people, like Anthropic, has even published on, on, on the overall effect of this for language models in the way that we interact with text and the language that we produce. 7:50 So this really, this concentration of like the training, the infrastructure into only a handful of players is really, really deleterious for the field overall long-term. 8:00 But App Proto is this incredible, this incredible ecosystem that provides this infrastructure for solving really this exact problem of being able to have like data that is distributed, that people have ownership, real genuine ownership. 8:17 It's not locked into like a particular monolithic like space, uh, that is like schematized in order to be like mutually understandable and exchangeable, um, which is something that for scientific data is something that's incredibly important. 8:32 Like an image is not the same as any other image, like the actual experimental detail really matters. 8:38 And so being able to like bookkeep on that is incredibly important. 8:42 Identity is incredibly important in this context, like being able to understand where data is coming from in such a way that like doesn't lead to the same concentrating effects but allows people to like like contribute, but in a way that is like trust is built out, like is currently done in the Blue Sky ecosystem with like verifications and the verification graph kind of architecture that's there. 9:02 And then in like building out in a way that's composable, that actually has like open protocols that allow applications to interchange. 9:10 And so like this has already been really started to flourish amazingly on the side for social, but for scientific data, I think that people were talking about on sort of like the knowledge side, at the @Science meeting, like, is really starting to blossom. 9:26 And I think that there is an incredible opportunity even outside of, like, the social and sort of, like, direct text knowledge side to actually how we share interoperable, just full-on datasets. 9:37 So what we are building is based off of this model that, like, positive sum is really important for providing a meaningful way for small orgs like our startup, like other biotech startups, to, like, meaningfully be able to produce AI models that compete with the large labs. 9:53 Like, this This happens because of coordination between everybody in the field, us sharing data, others sharing data, us sharing weights, others sharing weights, that's done in a way that's interoperable so that the sum total of all of our small startups in the ecosystem is something that's greater than the sum of its parts. 10:10 This is what we see as really a path forward built off of the infrastructure for App Proto for a small 10-person team, say, to be able to outcompete massive organizations. 10:22 And that's because the network, the infrastructure, is the force multiplier for our efforts in training, like building robust AI systems. 10:31 And so what we've built at Forecast over the last bit is our own pass at really like treating this for datasets. 10:41 We found this to be something that's like really important for our own infrastructure, for how we're building our like image models, video models, like multimodal models. 10:50 That we're using internally for our own discovery and development programs. 10:56 And so, well, we're like, we decided to really try and put some work into making that generalizable so that we can share it with the community and have this larger buy-in around sharing data. 11:05 So what AppData is really based on is the fundamental technology called Web Dataset, which is a way that like individual large datasets can be sharded up. 11:17 It's very simple. 11:17 It's just like tar files, essentially. 11:19 They get put into shards. 11:22 But because of the wrapper that you can sort of do on top of that, it provides a really good— you can schematize things very easily by having standardized way that those files are stored, and you can stream things really easily and have sidecar manifest files that let you do querying and other operations really quickly. 11:39 So we're trying to build out essentially an ecosystem in which There's storage infrastructure for the actual large datasets itself that's off-protocol. 11:48 As like a funny thing, I kind of built like a version of it that does work for PDS blobs, but like, so like, I think I have like a few versions of like MNIST tested as like PDS blobs on my own, like maxine.science. 12:00 But, and then we use App Proto PDS records to actually be like the index, the metadata, the schemas that give like the interpretation of what each individual sample is in a given dataset and then lens transformations that allow you to actually move back and forth between schematizations of individual samples. 12:21 So you can aggregate across datasets that are actually made with like different sample schemas, but that are interconvertible. 12:28 And so we are like, like we still, we're, we have a version of this that's up on PyPI. 12:33 I'm pushing the Rust SDK for this right now today. 12:37 Thank you, Claude. 12:39 So we're, but we're really, really excited to get like community feedback and sort of like iterate on the way that we've designed this. 12:45 So as I was mentioning, like the design of this overall is that it is like really is centrally done through code. 12:53 Like this, I think different than a lot of things that we see in like the app proto ecosystem and that there's really no like, I mean, I think over time it'll be nice to have like a web front end for allow, to allow people to browse datasets and things like that. 13:06 But really this is something that like we've envisioned as being something that you reference and your own Jupyter notebooks or doing data science and something that you have your autonomous agents on Cloud Code or whatever, Codex or what have you, actually be able to just build things that query out to our App View that's aggregating all of the datasets that are posted in all the different fields. 13:29 Genomics, neural recordings of voltage dynamics in neurons, imaging datasets for medical imaging, for biological imaging, all these different things that can be posted have entries as like PDS records that give like the metadata of the dataset and what schema it's using. 13:49 All of that be like aggregated and filtered in real time in order to provide like people who are working on different data like fields with like real-time streams of when new data gets posted out. 14:01 And then specifically like having, we have a, we're building out a very similar like trust a network architecture that was inspired really by the Blue Sky verification system to have it so that there can be like in a given field like trust labels that are given to individual data providers that allow people to like actually hone in on that data that's being provided by really, really reliable sources for particular fields. 14:25 Um, and then when individual clients look out to what's present in the atmosphere as far as the records that give the index on like all the data that's present out in the world, then the client has all the information that it needs to actually then go out to to the storage mechanism and actually like pull the streaming samples for that data. 14:43 And we also— there is not depicted here, but we can like proxy that connection through the AppView also. 14:48 There's like a lot of different ways that you can plug all of these things together, but this is the overall vision, is like the client is able to look out into the entire atmosphere and say, what has science done in like mouse brain MRI? 15:04 Or something, like whatever it is. 15:06 And like any dataset that has a lens that lets you convert it to something that I can interpret as mouse brain MRI, I see index records from the atmosphere that tell me where to look for all of those datasets. 15:21 And that let me code gen like transformations for my own data science pipeline that let me ingest all of it streaming from out in the larger like web. 15:31 So we've built all of this off of, as I mentioned, sort of like, not like Lexicon was really specific for the, like, the way that App Proto is set up, but we wanted to have our own schema system. 15:41 So we have like a lightweight way of like doing this across platform. 15:46 Right now it sort of is like using JSON schema as an intermediary, but we wanna support other things for that also. 15:51 But essentially, like, when you have a dataset that you're loading out from the world, you want to be able to interpret as each sample comes in, like, what do the fields in this actually mean? 16:01 And so it was very important for us to actually publish as records, like, abstractions of individual sample schemas in the same way that individual lexicons are also published, like, on protocol, so that everybody can have, like, a consensus understanding of, like, what this data actually means semantically. 16:18 So we just wrap that really thin on the existing platform infrastructure for whatever language it's in. 16:26 And then we've, we've used, we built out like a, just a really simple interface for how like actual, like when you're developing with this, it works. 16:35 That's based off of the Hugging Face datasets API. 16:39 So essentially like there's some magic underneath that allows you to resolve very similar to like the way that Hugging Face like labels on individual datasets would be set up, but just based off of like the Proto handle that's associated with where the index entries are. 16:55 And then automatically it just does all the things under the hood to build out like a PyTorch DataLoader that actually does all of the streaming under the hood that lets you like actually build out like batched tensors in real time or like whatever data form that you want to see it from. 17:15 So we're very excited about like just making it as simple as possible to be able to go out and load an individual dataset. 17:21 And then also we have, similarly, like interfaces for doing queries based off of like type conversion using our lens system. 17:28 Like I wanna find all the things of a particular schema, I wanna find all the things that are convertible to a particular schema, et cetera. 17:37 So under the hood, like I mentioned, like the, the, the way that this has been built is like foundationally on web dataset, which is very, very simple. 17:45 It's just like files, like, it's like message pack in a tar file, essentially. 17:52 Like, it's very, very simple, but it's very powerful in being able to provide, like, the streaming and, like, well, you can build a lot on top of it pretty easily with really little overhead. 18:06 Like, that we use, like, really simple decorators or macros to be able to make it so that the developer experience is pretty simple on setting up schema types for whatever data you're working with in order to be able to publish publish it with like one function call to your, um, to your PDS for the index entry, and then either like, you know, publish to like FTP, S3-compatible storage, like whatever you're using on your particular backend for the actual large file storage, PDS blobs if you're nuts like me, um, like the, the, like try and make that interoperable as much as possible. 18:43 And then also to support like a number of like binary serialization formats to make things really convenient on that end. 18:50 Like right now we're using— I'm using like NumPy, like a byte serialization because it kind of works on the backend, but we're building that out to be like really, really generalizable to like support the interop for the overall community. 19:04 And I kind of hinted at this earlier, but I really wanted to dig into this because I think it's a larger point that was really apparent to me at last year's conference, I think, and I like got ADHD distracted on like doing other stuff and like the stuff we're building in forecasts. 19:19 Um, but I do think that this is overall like a very important point for the larger community and particularly for like the lexicon.community efforts of like building out standards. 19:29 Like, what's really powerful about data that is schematized is not just that you can look at everything that's in your particular schema, but if you actually book— and part of this is like, for those nerds in the audience, I'm like really deep into like applied category theory, so there's like a lot of abstract nonsense of why this is like hella cool. 19:49 But essentially, like the thing that really matters more than the schema is the interconversions between schemas. 19:56 And so if you can really keep track of like, like update and, uh, like view operations between two different type types, um, you actually can do even more than just understanding data that's provided to you in a given lexicon, say for App Proto. 20:13 Like you can actually query out and build tooling that automatically is able to ingest like things from any lexicon that has coherent like view and update operations to a specified lexicon that you care about. 20:28 So in AppData, we're really building this explicitly. 20:31 Like we have PDS records in our lexicon that are like about like Lens code that actually does interconversions between our AppData schemas. 20:43 And these are also like subject to the verification systems that we can have like trusted lenses that are not like doing arbitrary code injection and like all the crazy nonsense that you probably have people do if you just like have them reference arbitrary code that their Claude agent pulls or their OpenClaw pulls or whatever. 21:00 So, but I think that like in addition to the application for AppData, which is really, really cool being able to be like, Oh yeah, my lab published this schema and it's like totally insane because it's got this weird like metadata about like whether the lab technician wore Axe body spray that day because like that influences the mouse behavior in crazy ways that like only our lab cares about. 21:19 Like you can actually just say, okay, it's neural recordings, I don't care. 21:23 And you can sort of like just project that onto the schema that you care about. 21:26 But I think similarly for the larger App Proto ecosystem, this is really important because like We've seen a lot of things of like, what's the right move? 21:34 Is it to like have everybody for different app, like different applications, like make their own lexicon namespace so that you have like app to app separation? 21:43 Or do we want to come together and have like a lexicon.community thing where like everything of a specific type sort of centralizes around an individual lexicon and you have some other way to specify what apps are putting into it? 21:53 And I think it's like a why not both kind of a situation. 21:56 Like individual apps can actually make their own lexicons for the types that they're working with. 22:03 But if there is a centralized, like, lens lexicon that is, like, just the abstraction around, like, what it is to interconvert between two app proto lexicons, then that enables, like, developers building out tooling that's like, okay, I wanna make a blog app and I want it to have this particular data type for posts that has this content that's specific to me. 22:28 But I also want to be able to reason about leaflet, and I also want to be able to reason about whitewind. 22:32 And I'd like— like, if you have the ability to just define what are the lens transformations to each of those individual ones, you can automatically pull in and aggregate all of that data across many different, like, even sometimes kind of divergent lexicons, as long as the lenses that, like, give you the view operations and the update operations to each one of those are, like, coherent enough for what you wanna do. 22:56 So I think we're trying to like build our own demo version of this for the way that AppData works with its own internal schematized sample types. 23:04 But I think that this is like a larger like community point that could be really, really cool that I wanted to spend a little bit of time to hammer home 'cause I'm like a big Lens evangelist. 23:15 And so all of this put together, we're like deploying, we're in the process of like deploying out our own app view that sort of is like canonical AppView implementation for the AppData ecosystem and for handling all of the— like, we're publishing all of our lexicons on science.alt.dataset as sort of like the reverse DNS NSID. 23:38 So that was a cool move 4 years ago, me, to register alt.science. 23:45 So the overall vision of this is that we want to have something that allows, like, people to have control over filtering feeds of datasets that they care about, but can, like, weakly cohere enough to be able to, like, facilitate interchange and collective sensemaking for large scientific datasets. 24:08 Um, and the next step, once we've built that app data that we have, that we're, that we're prototyping right now in-house at Forecast, is actually to move from the, like, social data for being able to stream in across the entire atmosphere different datasets from all over the world, but to also then move that into social training to actually go to the other side of the Hugging Face API around model weights. 24:35 We're very excited about the possibilities there of having lexicons defining the actual training phylogeny of model weights. 24:47 You can say, I want this to be associated with the hyperparameters that it was trained with, the datasets that it uses input, the weights that it started with when I was doing this fine-tune, the code that I actually used for doing that, the evaluation metrics. 25:02 And when lots of people are doing that and publishing the metadata and then also in a sidecar service like the weights themselves, then we start building a situation where we can actually like autonomously build out, uh, like collectively the space of all different training trajectories that we can do for AI models, right? 25:22 It, it allows this sort of weak cohesion of all of the different, like, biotech startups, language model startups, people that are working on small teams. 25:31 Like, it allows that weak coherence between those to really, like, amplify how much that we can do the search space that are sometimes astronomical for these things as a collective. 25:42 And I think really have, like, a transformative impact on the way that like distributed AI work in smaller independent teams is able to like catch up, compete with, and even like outpace some of what's possible in larger labs through this. 25:57 So like, I like, like the summary on that is just like, with like the vision, I think that's very important, especially with this like semantic concentration that's present in the AI field with like the power consolidation that is happening inside of the AI field with this large— with the large labs is there is like a, an alternative thesis of what the future of AI model training looks like that's not one lab with a giant cluster, but it's instead like the community with their own independent hardware, with their own independent resources, and enough coherence because of the infrastructure, because of the protocol to be able to actually like build synergistically off of each other's work. 26:43 So this is the next phase of what we're building after @data, @forecast. 26:47 So stay tuned. 26:49 And that gives like a full picture of this, of like an ecosystem that's built on like, like at the every level from the datasets, the like way that those datasets are able to interoperate, the training of the models, the actual publication of the weights that like builds out a larger ecosystem where we can trace where things originated from, what code was used to generate them, and actually share all of those details to build off of one another. 27:15 So, oh no, the AI for my slides made a goof. 27:19 Oh well, that's how it goes sometimes. 27:22 Um, just to wrap up on the final thing, um, like, well, as I mentioned at the very beginning, Forecast is centrally built around our vision of providing people with cognitive agency. 27:35 And for what we are working on We are focused in our business around drug development and discovery as a means to provide that because of our research experience. 27:48 I'm a neuroscience by training. 27:50 I did my PhD down the street from here at UCSF in neuroscience, and I did also some medical training there about it. 27:57 And I care deeply about neurology and psychiatry, and this is our focus. 28:01 But I think that that overall thesis has a direct link with some of the trends that we're seeing in how AI models, how the overall shape of the AI ecosystem is really moving us toward a possibility of a future with a lot of homogenization that's very unhealthy for individuals' mental health, for the health of our scientific discourse that really benefits enormously from the noise. 28:32 From the divergence of opinions that come from people not just seeing the same semantics that comes through from the model, but actually their own situated individual versions of that that come in a distributed form from individuals' experiences and the particular things that they bring to the table. 28:48 And so, like, all of this is really shaped fundamentally from the, the tools that we have, the infrastructure that we have for how these systems that we're building are created and how they interoperate. 29:03 And I think that App Proto overall has an incredible possibility, not just on the side of like creating a like decentralized approach to our social media data, but also a decentralized approach that in a similar way is incredibly empowering to like groups that are producing artificial intelligence. 29:25 And that I think this has outsized potential to really transform the future of the way that that industry looks. 29:31 Um, so I think, uh, like, I, I, I, I believe fully in the promise of what is provided by the protocol work that's here and believe in, like, protocols, not labs, for, like, the, the future of AI development. 29:48 So, um, that's— that wraps up everything that I have, uh, about AppData and what we're working on, distributed AI forecast. 29:55 And also, we're hiring, so if you, uh, if you like solving problems, let me know. Speaker B 30:00 Thank you so much, Maxine. 30:02 Can we get a round of applause for Maxine? 30:09 Thanks so much. 30:10 I've got two mics on right now, one for the stream, one for Maxine. 30:13 Maxine, can you hear me all right? Maxine Levesque 30:15 Yeah, perfectly. Speaker B 30:15 Fantastic. 30:16 We've got about 5 minutes for questions, and I'm going to take the prerogative as the moderator to first say, Maxine, you should check out what Nick Diakinas is working on right now along with Blaine Cook, because Nick has actually just implemented lensing into Lexicon Garden. 30:30 So you can, you can lens between literally any arbitrary lexicon. 30:34 So you guys can just skip right to the next step if you'd like. 30:38 Blaine Cook also, they talked about this yesterday in an unconference session around this, and Blaine and Aaron Steven White have a much crazier arbitrary data lensing system that they're also building. 30:50 So I'll show you. Maxine Levesque 30:51 Oh, I'm so excited. 30:52 Yeah, it's— we talked about that last year. 30:54 That's so cool that that's happening. 30:55 Amazing. Speaker B 30:57 Yeah, so it's very cool. 30:58 So yes, any other questions that folks have? 31:05 Everybody's minds are blown. 31:07 Okay. 31:08 Yeah, come on over. 31:13 I'm interested to hear how you've progressed as far as documentation for how the public could get involved with the things you're working on. Maxine Levesque 31:25 Oh yeah, I mean, we— this has been— I mean, like, we have so many things. 31:29 We're a very early stage startup. 31:30 We're sort of like running around and there's a fair few of us to do like a ton of tasks. 31:35 And so I think that like I'm very, very excited about that stuff that has happened with Claude over the last few months as far as like providing ways to get documentation out there for things. 31:44 So definitely like right now, like actually we've been— I've been doing a lot of build out on our AI tooling, which is called CrossLink, which some people might have seen in the community. 31:54 And so that's like been a little bit of the focus there, but definitely I'm like trying to move back over to build out some of the documentation on like how to get started with AppData. 32:04 And particularly for people who are using like AI agents for doing coding, like I, one of the cool things that we've tried to do in Crosslink is actually make it so that there's like a knowledge repository per like Git repository for the specifically for like coding agents that it can pull up like repo to repo. 32:23 So it's something to like check out for that of like our Last Link knowledge tool, but like my hope is that I can build out some of the knowledge that's like specific to AppData in itself in its own like, like separate, like orphan branch on Git, and then that your agent can pull from that in order to like know how to do all the things basically. 32:42 But no, it's good, it's a very good point that like the docs are extremely important. Speaker B 32:46 Yeah, cool. 32:48 Any other questions? 32:52 Okay, well then we'll just say thank you again, Maxine. 32:55 Uh, really appreciate you being here remotely with us. Maxine Levesque 32:58 Amazing. 32:58 Thank you so much for having me. 33:00 Have a fantastic conference. Speaker B 33:06 Okay.