Jim Calabro 0:37 All right. 0:37 I'm going to go real fast because we've got a lot to get through. 0:39 But I'm Jim. 0:40 I run the platform team at Bluesky. 0:43 Platform team is the team that runs our infrastructure, our data centers, cloud stuff, write a lot of backend code as well. 0:50 And yeah, we have a lot to get through. 0:51 So I'm going to go super fast, but hit me up after if you have any questions or want to talk more. 0:55 Who am I? 0:55 I'm Jim. 0:56 I live in Boston. 0:56 I run the platform team, as I said. 0:58 I've been at Bluesky for about a year. 0:59 And I'm here today to share some information with you on how we do stuff, our atmospheric systems, our app view. 1:06 I want to talk about what's going well, what could be improved, and maybe give some recommendations, or at least food for thought for you as well. 1:13 All right, let's get into it. 1:14 Who's this for? 1:14 So there's a lot of people here who are in the weeds of atproto. 1:17 A lot of people have different needs and wants out of atproto. 1:20 And so this is really a talk for people who want to achieve high scale. 1:24 Such as running a big whole world app view, such as the Blue Sky app view, right? 1:28 Another persona might be running a large fleet of PDSs. 1:32 Euro Sky is coming online. 1:33 That's awesome. 1:34 Black Sky. 1:35 Like, there's so much movement on this and it's really cool. 1:38 And I wanna talk about, well, okay, so not all projects are in that, this category. 1:42 Totally rad. 1:43 It's super awesome. 1:45 And yeah, I'll tell you a little bit about what we do. 1:46 This is just like fact dump. 1:47 So we're just gonna get through it real quick. 1:49 PDS fleet. 1:50 We run about 110. 1:51 Wow, that's small. 1:52 Run about 110 rented bare metal cloud hosts in US East and US West. 1:57 They range from 16 to 64 cores depending on what year we spun them up. 2:02 They have 256 gigs of RAM. 2:04 They have about a gig up, a gig down, and they have various disk configs. 2:08 This is kind of fun and goofy. 2:09 I learn about new disk configs every now and then. 2:12 Most of them are XFS set up in RAID 1. 2:15 One of them I learned was running ZFS. 2:18 That's a little nifty experiment. 2:19 It's been live for a long time. 2:20 That's cool. 2:21 We also run one against our will on Ceph, so that's kind of cool. 2:27 We had an emergency situation where we had to lift and shift a PDS, and all we had was a Ceph array in our data centers. 2:33 That was right after Christmas. 2:35 They cost about $600 a month each on OVH and i3d are the two suppliers that we use there. 2:41 And I'd say, you know, one thing on this, if you are looking to set up a big fleet of PDSs, we are way over-provisioned on these things. 2:48 So here's a btop and that is not showing up at all, but you can see we're using somewhere around like 5% CPU, 5% of the RAM. 2:56 So that's like 12.5 gigs. 2:58 It's doing relatively little network I/O as well, even though you do kind of want to have at least a little bit of a beefy setup there. 3:07 Most of it actually is like disk storage. 3:09 We're kind of coming up on like actually starting to fill up disks. 3:12 We're going to have to expand and stuff. 3:13 And so you can do a lot with a little with the PDS. 3:16 I'll also say we have about 500,000 users per host. 3:20 I'll say also that was my production PDS. 3:21 So like that's dapperling. 3:23 There's 500,000 users running on that box. 3:25 There's no special sauce. 3:26 We're just running the open source code. 3:28 There's nothing behind the scenes. 3:29 There's no like rate limit bypass tokens or anything. 3:31 It's just the code. 3:33 And yeah, we do, I'll put a note on that. 3:36 We have our own auth server, but our config looks like this. 3:39 We run an HAProxy on each one. 3:41 There's 16 PDS containers. 3:43 Each one of them has, sorry, each user has a SQLite and we back those SQLite up with Rclone once a day and we do a Lightstream for live. 3:53 We have some Redis, some Datadog for monitoring, Tailscale for network auth. 3:57 It's all fully automated, zero-touch provisioning. 3:59 So like you just run one Ansible command, boom, you got a new one. 4:02 It's really fast to stand up new ones. 4:04 Second major topic is our POPs. 4:07 A POP is a point of presence and it's basically like a, it's a colocation center. 4:11 It's a small data center. 4:12 We run our own hardware that we own. 4:13 We bought it. 4:14 We run the relay in our data centers. 4:17 We run the AppView, primarily powered by ScyllaDB in the data centers. 4:21 We run Discover in the data centers. 4:23 There's a few other things. 4:24 Our search cluster's in there. 4:26 Yeah, there's two of them, one in California, one in Ashburn, Virginia. 4:30 And there's about 80 very large servers in each that we own and operate. 4:35 We have super duper fast networks, and it's easy to add more. 4:40 Super fast disks, and a lot of them. 4:42 It's really high bandwidth, and we have active-active everything. 4:45 So there's like 2 ISPs. 4:46 There's 2 copies of pretty much everything. 4:48 And yeah, we want that high degree of redundancy so you can provide really solid service. 4:53 Here's some pictures. 4:54 This is the posting factory. 4:55 You can see me and Austin. 4:56 Austin's over there down on the back. 4:58 My friend Patrick's actually behind Austin. 5:00 Sorry, Patrick. 5:02 But yeah, this is one of them. 5:03 This is what a data center looks like. 5:05 So you get a cage that's like your full suite, and in it you put a bunch of racks. 5:09 Here's one of the racks. 5:10 Within the rack you have a couple of switches, and then you have a bunch of compute servers, and then you have two of those, right? 5:15 So two of everything. 5:16 It's kind of what it looks like. 5:19 It's pretty bog standard. 5:21 Next is our AWS account. 5:22 I'm not going to talk to you too much about AWS, but it's where we run a bunch of, like, singleton stuff. 5:26 Lots of stuff in there, super important. 5:28 Some of it's kind of chill. 5:30 PLC is like obviously really important. 5:31 The main beesky.app website, like the HTML, the CSS, the JavaScript, the assets are served out of there, a few other things. 5:37 And we have a ton of Postgres in there. 5:39 Postgres is really hard to run. 5:40 It's really annoying. 5:41 RDS is great. 5:42 It's very expensive. 5:44 What's going well? 5:46 Pops are goated. 5:47 Pops are the GOAT. 5:48 They're super duper cheap and they're extremely high performance. 5:51 The PDS fleet is working quite well actually. 5:54 It's really easy to add more servers. 5:55 So as we're growing, we can just chuck, new servers up. 5:59 They're very reasonably priced. 6:01 AWS is AWS. 6:02 It's fine. 6:04 Yeah, RDS is real. 6:05 It's so good. 6:06 It's worth every penny. 6:07 Renting GPUs is quite convenient. 6:08 We do run a bunch of GPUs up there for various things. 6:11 Besides that, it is very, very expensive. 6:14 Do some very rough napkin math, and I'm really not gonna get into this too, too much, but it costs us probably about $800 grand a year to run the PoPs, amortizing for depreciation of the assets over 4 years. 6:26 The roughly equivalent AWS install is literally impossible because our total switching capacity in the PoPs is just shocking. 6:32 It's crazy. 6:34 And you couldn't do our AWS setup, you couldn't do our PoP setup in AWS, but if you did, it'd probably be about 10x that. 6:39 So about $8 million a year. 6:42 And that's with heavily negotiated long-term reservations, probably about $14 million if you were doing on-demand. 6:47 Heavy asterisk, I vibe-coded all that. 6:49 So. 6:51 Yeah, Vibe Finance. 6:53 That being said, the PoPs are an absolute shitload of work. 6:55 It requires deep expertise. 6:57 Austin has gone absolutely crazy on trying to do our reprovisioning of all this stuff and make it sane and make it easy to work with. 7:05 It has really slow iteration cycles. 7:07 Once you're getting new hardware, it takes a while to get it online unless you have excellent operational practices. 7:11 Again, kudos, Austin. 7:12 RAM and storage also is up and to the right, unfortunately. 7:16 And so we bought a bunch of stuff Thanks to Jazz, like, here-ish. 7:22 Yeah, about here. 7:23 So Jazz is the GOAT. 7:28 So what's next? 7:31 And I'm running out of time, so I'm going to go fast. 7:33 POPS, make them easier to operate. 7:35 As I said, Austin's been doing Yeoman's work on this. 7:38 Increase our compute density as well. 7:40 Previously, we basically were assigning one service to a box. 7:43 And oftentimes, the service would need less than 1% of one of the CPUs. 7:46 And each one of them has like 256 CPUs. 7:50 And so I'm gonna say the evil Kubernetes word. 7:54 So yeah, trying to improve our density there. 7:57 We're improving our provisioning, making it faster to get new stuff online. 8:00 We migrated our network architecture from a single switch into a spine and leaf close network topology. 8:05 It's really fun. 8:06 It's like a network of networks essentially. 8:08 This is what everybody does as well. 8:10 A lot of people do this at least. 8:12 Kubernetes for compute density. 8:14 The net result is Higher engineering velocity, robust high availability systems, and we can actually reclaim a lot of cloud spend and bring that back to our POPS. 8:22 We're also going to improve the PDS hosting in some way. 8:25 We're still talking about this, but when you have 110 servers that you rent, those servers are bare metal servers. 8:31 They're ours. 8:32 They're not virtual. 8:33 I literally have like IPMI login on all those things. 8:37 They all fail independently and OVH is not sending their best. 8:41 And so as you add more servers, your mean time between failure increases, meaning your on-call burden goes up a lot. 8:47 You're at the mercy of your hosting providers. 8:49 When a server goes down, we're waiting for like 6 hours to get notice from OVH. 8:53 And it's like, in the meantime, it's like, OK, we can restore to a different server, or we're just going to eat it. 8:58 And so that sucks. 9:00 Shared storage also is a big thing that we're talking about. 9:02 SQLite on the server is rough. 9:04 PDS is like the best case of this. 9:06 But I am kind of a SQLite hater, so I'm just going to leave it at that. 9:09 And we will chat. 9:10 Yeah, boo you. 9:12 So I'm thinking about virtual PDS with shared storage, whatever that looks like. 9:16 I'm just kind of hand-waving. 9:18 And then more interesting PDS implementations are coming online. 9:21 I'd love to talk about it if you have weird PDS ideas. 9:25 Advice or lessons learned. 9:27 I'm going to start with the do-nots. 9:29 You must have two of everything. 9:31 Single points of failures will die, and you will be sad, and your users will be sad. 9:35 Reputation is hard-earned and quickly lost. 9:38 Skip the SQLite layer in your app view. 9:41 Go with my personal favorite, MySQL, or Postgres if you don't like good things. 9:47 And then only move past it when you're sure you need it. 9:50 Start simple, basically. 9:52 Don't accept local maximums. 9:53 You can do hard stuff. 9:54 You can do great things. 9:55 That's a do not. 9:56 Now dos. 9:58 First is think really hard about your data access patterns. 10:00 So we've really optimized the absolute daylights out of the BlueSky data plane. 10:05 You want to have this notion of mechanical sympathy, like be in tune with your hardware and like try and optimize the shit out of it because you'll pay for it otherwise in dollars and also your sanity. 10:16 Bloom filters are your friend. 10:17 Memcache is your friend. 10:18 Redis is not your friend. 10:21 You should have elastic compute and storage, even on-prem. 10:23 You should use cattle, not pets. 10:24 That kind of comes back to two of everything, right? 10:26 You should have very, very thorough observability, be able to answer any question dead about your systems and do it before you have an outage. 10:34 So here's some recs on that. 10:35 And then finally, one more thing of dos real quick. 10:39 Build a team that's very strong operationally. 10:40 You can't do it alone. 10:42 And then come talk to us. 10:43 Let's organize. 10:44 Brian posted a while ago about what does NANOG of that proto look like? 10:47 Let's talk about it. 10:48 I want to— I posted a minute ago, L7 BGP. 10:51 Let's go talk to each other and figure out the right way to do this. 10:54 And there's no silver bullet. 10:55 It's really hard work. 10:58 That's it.