Emily Gorcenski 2:04 Excellent. 2:08 Thank you very much. 2:10 I know that we're running weird on some time somewhere, so people can— Perfect. 2:16 I won't need it all. 2:17 I'll go through this quickly, hopefully. 2:19 But yeah, thank you for being here. 2:21 Thank you for joining this talk. 2:23 Enterprise data is maybe not the most exciting topic, but I hope that I will make this interesting for you. 2:32 There we go. 2:32 All right. 2:33 So the title of this talk, it's the thing where you— when the headline is a question and it can be answered no, maybe it's not worth your time. 2:40 So I'm going to tell you a little bit in advance that the answer of, did Lexicon accidentally solve the enterprise data problem? 2:46 The answer is probably no. 2:47 But I want to give this talk because I really want it to be yes. 2:50 And I'm going to explain why over the next, say, 29 minutes and 16 seconds. 2:58 So a little bit about me. 2:59 My name's Emily Gorcenski. 3:01 I'm a data scientist by background. 3:04 I was actually— R&D is— is sort of where I started my career. 3:10 I realized at some point that all of the algorithms I was doing for biotechnology and aerospace and all that good stuff was in this new and upcoming field that they were calling data science and that those people get paid a lot more money than I was making. 3:22 So I just said, I'm a data scientist now, and it stuck. 3:25 But you can find me on the internet here. 3:29 And how did I get into AT Proto stuff? 3:31 Well, it started because— I think that the internet is horrible, and I really wanted to curate my own presence on the internet. 3:41 So if you've ever had a tweet or a skeet go viral, you know it's like the worst thing in the world. 3:46 Like, it's not actually something you want to happen. 3:50 So I created a little script called Skeeter Deleter, which I use to curate my own feed. 3:54 So I get to pick the things that I want to keep, so that if anyone visits me, they can't go back in my timeline and find things to cancel me over. 4:02 But I will tell you, I will say some things that you'll probably cancel me over today. 4:06 Data things. 4:07 I also run a labeler called BrandBlock Online because the internet is horrible. 4:11 The worst part about it is the evil brands who are coming in trying to be funny. 4:15 So I built a labeler to block the brands. 4:19 So if you don't want to see on your timeline Arby's trying to joke with Taco Bell, this is your tool. 4:27 And this is something new that I just recently built, because I finally figured out how to do OAuth, kind of, for AT Proto. 4:35 This is just like a little wishlist/register/mutual aid compilation tool. 4:42 I'm gonna be opening this up for beta, so if you want to test it out, I'm gonna try to launch this by the end of the conference. 4:50 But you do need to follow me on Bluesky. 4:52 Not intentionally manipulative. 4:54 It's just the easiest way that I have to limit the number of people using it right now. 4:59 I don't feel like doing an invite system. 5:03 And then my professional world. 5:05 I'm a CTO of a startup. 5:07 The title is fancier than the actual, you know, what we're like, what I'm doing day to day. 5:13 What I'm doing day to day is actually building just sort of like accounting software. 5:17 So if you have like QuickBooks data and you want dashboards, I can do that. 5:25 But before I joined the startup world, I was doing consulting for 8 years, which is why I talk about enterprise data. 5:33 So the thing that's interesting about the enterprise data space— and I told you that I'm a data scientist. 5:38 I came from an R&D background. 5:39 I peaked in Fortran. 5:41 I'm not one of these web people. 5:44 And if you look at the AT Proto— or I'm sorry, @proto— I learned that it's @proto. 5:48 I thought it was actually Austria Proto for a while. 5:53 If you look— woo! 5:56 If you look at the tooling, it's in TypeScript, it's in Go, and it's in— if you're really cool, it's in Rust. 6:02 And those are like the hip languages, right? 6:05 And those aren't the languages that data people use. 6:08 But it's also reflective of the mindset that @proto developers have. 6:14 Which is that when you're building web applications, you're treating data like a hot potato. 6:18 Like, data comes in, you want to, like, get rid of it as quickly as you can, right? 6:21 Because the longer that you're holding on to data and processing it and doing things to it, the harder your system becomes to maintain and operate at scale, right? 6:30 And so web developers have built all of these really cool tools and languages, and they're, like, really hip. 6:35 We've got these really cool conferences in places like the University of British Columbia, and they wear jeans to work, and they're really, like, Awesome, right? 6:43 And data people are kind of different than that. 6:48 Like, they kind of emerge from the world of database administrators, and database administrators are famous for being grumpy, and they don't like change, and they don't like you, and they don't like your fancy languages. 6:58 Like, we're all toiling in the SQL mines. 7:00 We might be writing some Python. 7:03 If you're really, like, fancy, you might do a little Scala or Spark in the data space. 7:08 And a lot of these folks are sitting there maintaining systems that have been around for 25 years, and they're, like, not wearing jeans and hoodies to work. 7:16 They're wearing, like, khakis and polos, because they have serious jobs, and they're holding serious data that all of these, like, multi-billion companies will stop operating if anything happens to it. 7:26 So they don't want to do change. 7:27 They don't want to do a lot of fancy technology development. 7:31 And this is a little bit of hyperbole. 7:32 I have known database administrators who, from time to time, do wear jeans. 7:38 The thing about enterprise data is that it's not really technology development. 7:42 It's anthropology. 7:44 Because when you're dealing with a company that has been around for 20, 30, 50, 100, 150, sometimes longer years, the information that you have is all historical. 7:56 And it reflects the relationships of a business, its departments, its entities, its customers, and its suppliers. 8:05 You all are probably familiar with Conway's Law, that the architecture of a system reflects the communication patterns of a company. 8:14 Well, that all comes through in enterprise data. 8:17 I've actually worked with data systems that you can pinpoint the exact day that two teams stopped talking to each other. 8:24 And you can do that because the fields change, and then you have to carry around this conditional with a magically coded date because on June 16th, 2007, these two teams went through a reorg, and then all of your logic has to carry that through for all time. 8:44 Right? 8:44 And so this is like an anthropological problem, which makes sharing and using this data and doing anything meaningful with it very, very difficult. 8:55 So what is the problem with enterprise data? 8:57 The problem is that data access is slow, it's difficult, it's expensive. 9:01 So is the analysis. 9:03 So is the data engineering. 9:05 I've gone to data engineering teams as a consultant. 9:07 I said, tell me about your problems. 9:09 They said, well, here's our backlog. 9:10 And I said, you know, how many tickets do you do each week? 9:14 And they said, week? 9:15 That's ambitious. 9:17 We do maybe 20 tickets per quarter. 9:20 And I said, well, how many new tickets per quarter do you get? 9:22 They say 40. 9:25 So that's not a solution. 9:26 That's not like a situation that's getting better, right? 9:29 And then you also see things like data science teams, like mid, early, mid-2000s. 9:35 Everyone's like, oh, you got to have data scientists because AI is coming. 9:39 So they went out and they pulled all of these people out of academia, a lot of neuropsych people, a lot of astronomy, physics, all of that, people who know good mathy stuff, and they put them in a team. 9:51 They didn't really explain to them how to work in an enterprise environment. 9:55 And so they built a lot of stuff that was really cool, but it wasn't in phase of product development. 9:59 And so then you have these really cool algorithms that nobody knows how to deploy. 10:03 Nobody knows how to run the code for it. 10:06 And by the time you actually get it, they're like 6 months out of date anyways, so it doesn't actually add value. 10:13 And then data folks don't really do agile. 10:18 Version control, they're still clicking tools. 10:21 Building ETL pipelines with drag and drop, right? 10:24 Continuous delivery isn't a thing. 10:25 CD, it's like a thing that you put music on. 10:28 And DevOps is a fancy team to data folks that nobody can explain, right? 10:33 The data folks, they don't really understand how to do rapid, fast, agile software development. 10:38 Again, hyperbole, because a lot of things are getting better. 10:41 But if you go into an enterprise, there's still a lot of stuff where they don't even know how to use Git. 10:46 I've actually gone into clients at big companies and had, like, day 1, I'm like, all right, let's talk about your architecture. 10:52 And by day 3, I'm like, OK, here's how to do git status. 10:58 And the problem is that within software development, we favored software developers. 11:03 We've given them a lot of tools to build software really quick. 11:07 And a lot of what we've done is we've absolved them of the duty and the responsibility to care about things like data quality, data semantics, the relevance of the data. 11:15 We're just like, here, keep— Get the data quick, throw it somewhere. 11:18 And then they're pushing it to Kafka streams and stuff like that. 11:21 And it's all great. 11:22 It's great for them. 11:23 And then some data engineer has to sort out that mess. 11:29 And we've tried solving this with, like, lots of different architectures. 11:32 If you talk to data people, they love talking about architecture. 11:35 We've gone from data warehouses to data lakes. 11:37 Data lakes became data swamps. 11:40 We've built things like data vaults, which is, I guess, if you really want to make your data model complicated, data vault is great. 11:47 Added Greek letters to it, a Lambda architecture or a Kappa architecture. 11:52 And this has all gone sort of in a cyclical pattern, right? 11:56 This shift between centralization and decentralization. 12:01 And so what happens is most enterprises are like, we need a big database. 12:05 So they build a big database and they're like, here's our data warehouse. 12:08 It's OLAP, it's all this stuff. 12:10 And then there's a bottleneck and there's a backlog. 12:13 And so they say, this isn't working. 12:14 And then finally, somebody with enough political clout in the organization is like, screw you all. 12:18 I'm building my own database. 12:20 And then that happens once. 12:21 And then the next team's like, well, they did it, so I'm going to do it. 12:24 And so the next thing you know, you're 3, 4, or 5 years later, and now you've got the shadow IT situation going on. 12:29 There's all of these different data systems. 12:31 And somebody goes, why are we spending all this money on data systems? 12:34 Let's do another big consolidation. 12:36 I've come in on the back end of $100 million failed data architecture consolidation projects, right? 12:42 Tons of money go into this. 12:46 It's all been very expensive and nothing has worked. 12:49 Nothing has worked. 12:51 Sometimes things get a little bit better, but every organization I talk to has the same exact problems, right? 12:58 And so when I was at ThoughtWorks, we came up with this solution that I'm gonna talk about, but a little bit about why it doesn't work. 13:06 Sorry, I almost skipped a slide here. 13:09 You just have this situation of data governance is a mess, data catalogs are all kind of terrible, schematics are discombobulated, semantics are even worse. 13:18 Systems don't talk to each other. 13:19 People don't talk to each other. 13:22 It's just a bad situation for most of the time. 13:25 And what we actually need is a scalable and consistent way to define the exchange of data between organizations, systems, parties, companies, whatever. 13:36 We need platform independence, right? 13:37 So we have systems, lots of companies are like, we're gonna be on Azure and AWS and Google because we don't wanna put all our eggs in one basket. 13:47 And now you have 3 different types of ecosystems. 13:49 It's like Tower of Babel. 13:50 Nobody's speaking the same data language. 13:53 And we need lightweight-oriented models, right? 13:57 So when I was at ThoughtWorks, we came up with this idea called data mesh. 14:00 Data mesh is a bit like communism. 14:02 It came out of conditions. 14:05 The conditions that we talked about were all of this fragmented ecosystem of not having this common way of defining what it is that you mean when you're talking about data. 14:17 If you don't know what you're talking about when it comes to data, you don't know how to even describe a way to share and exchange it. 14:26 Data mesh came up with— it's a decentralized model of working with data that focuses on data as a product. 14:33 So we actually wanted to think about, what if you worked with big data in the same way that you work with microservices? 14:39 Which is kind of a nonsensical way of thinking about it, because microservices are designed to be really, really small, and big data is, by definition, very, very big. 14:49 So we came up with these ideas of, OK, let's make data a product. 14:54 Well, when you're doing microservices development, you're usually domain-oriented. 14:57 You usually have a platform like Kubernetes or something deploy your service on. 15:02 And then, of course, everyone said, well, that doesn't work. 15:04 We need to govern it. 15:05 And so then we stapled on federated computational governance to make them happy. 15:10 And then tried to figure out what that was going to be. 15:15 And this is a really great theory. 15:16 Again, data mesh is like communism. 15:18 It's really great in theory. 15:20 The practice has been a little bit less stellar in some cases. 15:25 Because domain-oriented data products They're a good idea, but we haven't given anyone the tools of how to actually do it and implement it correctly. 15:37 So what if we just pretended like those challenges didn't exist? 15:41 Like, what if we just decided that we're gonna start from scratch, we're not gonna talk about all of your semantic layer, we're not gonna talk about all of your legacy systems, we're just gonna build a microservice around data? 15:52 What would that, What would that look like? 15:54 It's actually not that bad of an idea. 15:57 The problem is, once you start to try to implement that, everything falls apart. 16:02 The tooling sucks. 16:04 The data catalogs suck. 16:06 It requires a ton of platform engineering. 16:08 Nothing is really set up for this, right? 16:10 There's no standard language for defining what a data product is or how it should interact. 16:16 There's no standard way of joining the semantics from two different domains together. 16:22 And there's a lot of languages that have come up, like OWL and RDF and all this other stuff, to try to define how data looks and how data should be shaped. 16:32 But they have all of these gaps about defining how we're supposed to work with data. 16:39 So when you actually dive into— and this is where we get to lexicon. 16:42 There's a little bit of Chekhov's gun that I'm setting up here. 16:46 If you look at the way that we model data, we care about really two things. 16:51 We look at the nouns and the verbs. 16:54 The nouns are what the data means, how it's structured, who owns it, who can access it, what are the permissions, et cetera. 17:02 And sometimes those have adjectives like the type of the data or the frequency that it's updated, the freshness of it, things like that. 17:11 And then we also have So I'll jump into the verbs in a second, but this is a great example of talking about the nouns. 17:18 So I pulled this random owl2 definition off of, I don't know, the internet somewhere, which is great. 17:27 It's talking about the physical quality of the thermal energy of a system, and it defines all of this really specific stuff. 17:34 Half of the characters— solidly more than half of the characters in here are metadata, like meta-meta-data. 17:39 About the lengths of where to find the actual thing that you're seeing. 17:44 And so this is really cool if you want to know that this number is about the physical quality of the thermal energy of a system, but it doesn't tell you what that means in any sort of context. 17:55 It doesn't tell you what you should do about it or why you should care about it. 18:02 The problem with that type of approach, and the problem with all of those types of data specification languages that we've seen, is that they tend to be very static. 18:11 It assumes that data is fully self-contained, that you can just kind of like look at a thing and describe its properties and be like, aha, I now have a perfect Platonic ideal of a chair. 18:20 I've defined exactly what a chair is. 18:23 But you don't talk— like you don't do anything about like how a chair should be used, right? 18:29 And then on top of that, it becomes really difficult to start to contextualize it because In reality, you need to start to think about how these nouns become composed with each other, how they interplay. 18:42 Right? 18:42 And so there's some stuff out there working on how to overlay different data definition languages in a practical systems context. 18:51 But we're still missing the whole, like, action element of it. 18:56 The thing is, with, like, building a microservice-style architecture, you also need to define those verbs. 19:02 You need to define how they how these services interact. 19:05 That's actually the critical thing about it. 19:07 If you look at something like the OpenAPI spec, it's really great. 19:12 It talks about how you access the data, what to do with the data, what are the things that you can— like, how do you write it, update it, delete it, whatever. 19:21 You define your API spec. 19:23 It tells you exactly, like, this is going to be a PUT. 19:25 This is going to be a GET. 19:26 This is going to be a POST, whatever. 19:29 How to request permission for the data is often part of that broader spec and how systems should handle the data. 19:38 And so you can do that with those tools. 19:43 But if you start looking at things like OpenAPI, then they start to sort of fall away on the definitions of the nouns. 19:51 So OpenAPI is great because it gives you a a little bit of a shape of the data, but beyond sort of the basic sort of JSON types— and it's really limited in that case— it doesn't really tell you about what the data means in a broader context, right? 20:09 So all of this sort of ecosystem of defining data through these sort of specification languages, all of this also omits this concept of domain ownership whatsoever. 20:24 You're basically just We're saying data exists, here it is, or systems exist, here they are. 20:29 We're not really talking about how to access them, who owns them, what they should do about them. 20:35 So if you're gonna build a good data product language, you need to have a concrete encoding of your domain ownership. 20:40 So who owns this? 20:42 What do they own it for? 20:44 You need to have a clear and concise, an extensible definition of the nouns, and a flexible, versionable definition of the verbs. 20:52 So you need to be able to encode change over time. 20:55 You need to be able to encode what it is that you're describing and who owns it. 20:59 So those three things are key. 21:02 As I mentioned, data mesh was a great idea for trying to put together these data products, but it really suffered from four key flaws. 21:14 The first of it is, if you know ThoughtWorks at all, ThoughtWorks tended to be a little bit dogmatic. 21:18 It was a criticism of the company as long as I worked there. 21:22 That existed even before I worked there. 21:24 And data mesh was like, we were trying to do everything with microservices. 21:29 And so we just sort of almost naively applied that to a really big data architecture project. 21:36 And this existed. 21:37 There's no tooling for it. 21:38 There's no equivalent to Kubernetes for standing up a massive, large-scale data processing system aside from Kubernetes, which is really difficult if you've ever actually tried to do big data processing on Kubernetes. 21:51 Running Spark on Kubernetes is really difficult. 21:55 A lot of change, like, again, data mesh, it requires a lot of reeducation, right? 22:01 The way that you work with data, the way that your data people are actually doing things, you have to teach them DevOps. 22:06 You have to teach them Agile. 22:08 You have to teach them domain ownership. 22:11 Really difficult. 22:12 And then we also spent the last decade decoupling software devs from any downstream responsibility for their data. 22:18 And now we're turning around saying, oh wait, no, no, no. 22:21 Now you have to be responsible for that again. 22:23 And they didn't really like that. 22:27 So I mentioned that we need, you know, all of these definitions around nouns and verbs and domain ownership and stuff like that. 22:37 And yes, this is now Chekhov's gun going off, because if you look at Lexicon, you look at the spec, it actually kind of elegantly solved all three of those problems, and I don't even think that it was trying to. 22:49 The first thing that it will do is this reverse DNS addressing is brilliant. 22:56 This solves a huge problem for the enterprise data space, because now you can actually encode your domain ownership literally in a domain, like a domain name solution. 23:11 And you can do that across an entire organization. 23:14 So you can actually set up an organization, give every department, every team, whatever, a path in that domain name, reverse it, and then they can host their own lexicons. 23:27 It does come up with the definition of the nouns. 23:29 This is actually— I don't know if you noticed when I started it. 23:33 I was really bored. 23:34 So I decided yesterday, instead of doing it like Google Slides, I would build my own slideshow system on @proto. 23:42 So I vibration coded a slideshow. 23:46 Slideshow system. 23:47 This is the lexicon for the slideshow. 23:50 And so you can see that there's— the nouns are defined here. 23:54 I need a title. 23:55 I need a short description. 23:56 I need maybe a timestamp when it's created. 23:59 All of that is there, and it's contextual. 24:01 It's not just the data type, like the JSON data type. 24:05 Of course, that is in there. 24:07 But it tells me what the meaning of that data is and why it's relevant and why I should care about it. 24:12 And if I had gone a little bit farther, I would have been able to put the verbs in here as well for how to author a slideshow, how to assign permissions to a slideshow, how to revoke permissions to a slideshow. 24:24 So I could turn this into essentially a Google Docs or the Google Slides type of situation or PowerPoint type of situation by encoding what the actions are directly in the lexicon. 24:40 So again, I mentioned this kind of casually solved all three of those problems, which would be a really big benefit for an enterprise. 24:48 Because not only does that mean that if you can define a lexicon for your data products, you can start sharing that data across your organization, but also outside of your organization if you want to publish those lexicons openly. 25:01 And this is, by the way, not even talking about using the PDS as your platform. 25:06 Like, I'm not necessarily talking about like, oh, replace your data platform with a PDS and everything will be solved. 25:12 I'm actually just talking about use a flexible data definition language that gets away from all of the sort of dogmatism of ontologies and all of that stuff and focus on actually shareability rather than like fully fleshing out the definition. 25:31 This is a standard way of doing it. 25:32 I think Emily Hunt spoke about this yesterday, like sharing exploding stars data through a globally accessible standardized feed that every astronomy lab in the world can subscribe to and then see the exact data format for that data without having to go through and set up their own Kafka consumers, without having to run a Kafka cluster, all of that stuff. 25:56 That's a really compelling idea. 25:58 It's a really, really, really brilliant way of being able to make that data available in a way that people can not just access it, but understand what it means. 26:11 So then this— like, when I look at Lexicon, actually the first time I saw Lexicon, I said, I've been trying to develop this as a data product specification language for the last 3 years, and here it is. 26:22 So it's worth exploring. 26:25 I think that this is actually relevant, because if you know me from any of my other work, you know that I don't really like fascists. 26:32 And one of the problems in the data space is that I've mentioned this sort of cyclical behavior. 26:40 We've gone from data warehouses to data lakes to streaming architectures and back again. 26:45 And the trend right now is towards consolidation. 26:48 And so you see companies like Palantir who go up to these big organizations and government agencies and they say, "We can solve your data consolidation problems." And I've never known anyone who's used Palantir who likes it. 27:02 Right? 27:03 But they sell it and they make a fuckload of money selling it because they're promising companies to solve exactly this problem. 27:10 That we can make all of your data accessible in one place really easily. 27:16 And so my argument is, if we want to be able to break the stranglehold that these rent-seeking platform organizations like Palantir, like Oracle, like all of these other companies have, we need to have a decentralized way of sharing data, and it needs to be consistent, and we need to get away from trying to overly define the languages and the specs and all of that stuff, the ontologies, and we need to focus on the ability for sharing and decentralizing that data. 27:46 That is necessary to fight fascism. 27:48 It's not just good business. 27:51 So what do we need to make that work? 27:55 The problem with— I mentioned going back to the beginning is that software developers love working with cool technologies, TypeScript and Go and Rust and all of that stuff. 28:03 And us data folks, we don't know any of that. 28:06 We don't work in those languages. 28:08 We need better Python tooling. 28:09 The Marshalllex library is great. 28:11 But I couldn't use it to build my own lexicons. 28:13 Maybe I'm not smart enough, but I just couldn't get it to work to build my own stuff that wasn't built around Bluesky. 28:20 So I had to write my own lexicon data class generator, which was hard because I don't know— I'm not a computer science person. 28:29 I don't know what an abstract syntax tree is. 28:31 I think that's like the thing that the spotted lanternflies eat. 28:36 But if you actually look at some of the stuff, like lexicon.garden, is an amazing— like, it's a global data catalog. 28:42 Like, it solved a problem that I've watched companies pour millions of dollars into, and it just solved it. 28:49 Like, oh, that's cool. 28:51 You can register your lexicons, and then we can make a data catalog for them. 28:55 And it just did it, right? 28:57 So applying that tooling and making that tooling available at the enterprise level or the business level would be really great. 29:05 Then we also need a mindset. 29:06 A lot of the conversation I have been having here is really wrapped around social internet, which is awesome because I really believe in the social internet, and Blake was moved to tears for Aaron's talk today. 29:17 But this is so much bigger than just social internet. 29:20 This is data exchange at scale, which is more than just people talking to people. 29:26 It's also about exchanging information. 29:29 We need better permissions management And private data, I know all of that is in progress, but right now you can't have somebody publishing their internal data to a PDS and then putting it out there. 29:41 And then we also need to have the tooling to be able to run that stuff inside closed ecosystems. 29:48 And then also brave data people to start building radical shit because, again, there's too much conservatism in the data space. 29:57 Unwillingness to try new approaches, try new technologies. 30:01 If more data people can start doing things like, hey, look, I just solved a globally distributed data problem with this cool protocol and some Python and some lexicons, then I think that we'll be able to do some really great stuff. 30:15 So I'll wrap it up there, just about on time. 30:17 Thank you so much. 30:18 Let's chat about data.