Jasper Rädisch 24:02 Hello everybody, thanks for sticking around. 24:05 I'm Jasper. 24:07 Today I'm going to talk about keywords versus embeddings. 24:10 I can tell you already, of course, it's keywords and embeddings. 24:15 Yeah, let's get into it. 24:19 Of course, I'm going to start with a couple of disclaimers. 24:24 My level is ELI 12. 24:26 So if I talk to Claude, that's usually how I let it explain math problems to me. 24:33 So expect ELI 12, maybe ELI 15, not ELI 5. 24:39 And since I'm scraping a lot of the atmosphere, I felt like it's important to say I try to be respectful. 24:47 Like, the way I'm doing it right now is if you have the "No unauthenticated" label, that means if you do not wish to see or be seen by unauthenticated users, then I will not show you on any of my pages. 25:01 Also, you can always address me directly and tell me, "Please, I don't want to be part of this." I'm doing— I'm doing no LLM training yet, although it's kind of interesting. 25:14 Maybe in the future I might, but I would communicate that. 25:20 So I do not consider myself a thought leader, at least in this realm. 25:27 And so I was thinking about what could I actually contribute. 25:31 I assume there are some developers here around you. 25:34 So I thought maybe the thing that's most interesting is my stack, because it's kind of low budget, as low budget can be in this kind of space. 25:48 And you see here my small, old Mac Studios, and they are doing the heavy lifting. 25:55 Never mind, I will explain later. 26:00 One wall of text is allowed, right? 26:04 This is my stack. 26:07 These are the important bits. 26:09 So if you later want to recreate this, these are probably the yellow ones, the things you might want to get or emulate. 26:24 Yeah. 26:26 So let's start with a little explanation. 26:29 What I like about BlueSky or what I liked about Twitter. 26:34 I think the coolest thing about BlueSky and the coolest thing that was with Twitter was you can pretty quickly see whether you match with someone, right? 26:46 Platonically. 26:47 You go to the profile, you see like 10 tweets, and it's like bam. 26:51 'That's a cool person,' or not, right? 26:56 I like the quantity over quality approach, right? 26:58 Because each message is like this high without images. 27:02 You can go through thousands of them in minutes. 27:06 Okay, it's a little— maybe hundreds. 27:09 And by that you can kind of get this long tail discovery effect going, right? 27:16 Because you see so much You can also discover the small accounts. 27:20 So it's not all Taylor Swift, but you might want to rather discover approachable people in your area or in your area of expertise or something. 27:30 Also for content like movies, books, whatever. 27:34 And then by that you have this kind of lateral navigation. 27:38 So shout out to all the people who do this threads kind of navigating style where you can jump between threads by topic. 27:46 But that's kind of embedded in Twitter or Bluesky already, right? 27:49 And the coolest thing about Bluesky that was kind of cool about Twitter is you have data access and that's where the tech comes in. 27:59 So I was thinking about how to boost these functions and of course the first idea I had was do some fancy follower graph visualization so you know Theo's work, like this beautiful green shiny galaxy and you can dive in and it's pretty cool. 28:20 And this is like my humble beginnings some couple of years back. 28:23 This is actually the Nostr user space and I did some clustering. 28:28 I don't even know anymore what the coloring was, but this is a force-directed graph around the center so you could hover over it. 28:36 But I always thought it's It's very beautiful and it gives you kind of a feeling of how things— like the shape of the sphere. 28:45 But it's hard to navigate, right? 28:48 So how do I use this now to find interesting stuff? 28:51 I mean, I can zoom in somewhere and then hover, but I always found that a little bit cumbersome. 28:57 So what I did is I got more— instead of the follow graph, I got into the content graph a little more. 29:05 And so there's this really cool technology from the '70s or '80s maybe, and it's called TF-IDF. 29:13 Does anybody know from when it is? 29:15 Yeah, from when is it? 29:17 Yeah. 29:20 Early '90s even. 29:20 Okay, so— but it's old and tested, battle-tested, and it's really cool technology to extract keywords basically from a set of documents. 29:30 So, and in— In BlueSky, it's easy to just say, okay, every user account is basically a document. 29:38 So the collection of all your posts, that's your document. 29:41 And then you can do these relative term counts and compare and then you know, okay, this user is mentioning a certain topic much more often, like Star Trek, than all the other users. 29:52 So Star Trek must be important for them. 29:57 One question I wanted to ask to the audience, you don't have to answer it now, but maybe after the talk you can come to me and explain to me. 30:05 I am building still a highly individual tokenizing pipeline. 30:10 So tokenizing is the process of cutting the text into the smaller parts which you are actually counting, and you have to do lots of cleanup and other stuff like stemming, so bring them down to a common word stem so you can compare the words. 30:25 And I just wanted to ask, is this normal? 30:27 Is everybody doing that, or is there kind of a black box approach where you just put in data and it gives you the tokens? 30:33 I haven't found that yet. 30:35 Also, this is interesting. 30:37 I found I'm doing my own language detection on top of the language declaring by posters because there are many multilingual posters like me from Germany. 30:48 I switch between English and German all the time. 30:51 And a couple of times I will miss it, and then I will post German post with English declaration or the other way around. 30:58 So I wanted to show you this beautiful chart, which got zero likes when I posted it. 31:04 So here it is again, which illustrates that a little bit. 31:09 So the blue one, it's kind of lateral, but still. 31:14 The blue line is the ratio from non-EN to non-English to English text. 31:20 So for example, you see this nice spike here. 31:22 That's when all the Brazilians came in because Portuguese— or Portuguese speakers came in because Brazil shut down X for a while. 31:32 You have this nice slope or whatever you call it. 31:36 This was actually when Bluesky introduced their language detection feature in the front end, right? 31:41 They started asking you, are you actually writing in English? 31:44 So you can pretty nicely see that. 31:47 And then here, this is actually a bug. 31:49 So they had a bug. 31:50 So that's why this is not going up, but this is going up. 31:53 And what's also interesting is this is where the US election happened. 32:00 And you don't see anything because it affected the whole world. 32:04 And everybody was talking about it. 32:05 And the mix of non-EN English to English speakers did not change in this time. 32:11 I thought this quite interesting. 32:12 But still, you have like 3% constant errors and I didn't want to have that in my terms basically. 32:19 I wanted to have pristine English terms or pristine German terms. 32:23 So, okay, oh, I've got one minute left. 32:27 This is where Cloud took over. 32:30 So my development, not this presentation, I wrote most of it really by hand. 32:36 Embeddings. 32:36 Embeddings are magic. 32:37 It's like a black box. 32:38 You throw a text in, you get a number out. 32:42 It gives you a test. 32:42 This text is here, and this space, this is here. 32:46 So I embedded all the text. 32:48 Still doing that in real time, which is why I need the Macs. 32:51 They do that. 32:52 If you pay for that, like get a cloud solution, you will be poor in one week. 32:59 The naive approach, you are the average of all your posts. 33:03 Had some nice results already, but of course everyone contains multitudes. 33:07 So this is what I use now. 33:10 There is a Divepool discovery feed. 33:12 You can try it out in the Bluesky app. 33:15 And this means I have multiple clusters for you and I compare them to multiple clusters of other users. 33:21 So if you connect to the Divepool feed in Bluesky, I will know you. 33:26 I will know what you posted. 33:28 I can cluster that. 33:29 And I can compare it to the clusters of everybody else. 33:33 So of course, I will skip this. 33:37 The thing missing is the labels. 33:39 I still need labels. 33:40 So by embeddings, I only get the clusters, but when I want to show them to you, I want to say this cluster is about these topics. 33:48 So this is why I still need TF-IDF, but those labels are shit. 33:54 So to get slightly better labels, I embed all the terms based basically, all the possible keywords and I use that information to find the terms that are closest to the clusters. 34:07 And so I guess, thank you for your time. 34:10 Check it out, diveboo.social. 34:12 Just go to the page. 34:13 It's not self-explanatory because I designed it but you will get a feeling for it. 34:21 Outlook, I might train QN at some time. 34:25 And that's it. 34:28 That's my handle. 34:28 Please follow me, ask me everything, anything. 34:32 I'm here at Monday too. 34:33 So for those of you who are still here at Monday, please contact me and reach out.