Webinar Replay: The 4 Biggest Challenges of Scaling Cloud-Native AI Workloads

When working with AI in cloud environments, traditional data provisioning and software testing methods don’t work because of the behavior of AI and LLM APIs. In this Cloud Native Computing Foundation (CNCF) webinar recording, we discuss the top 4 challenges of scaling cloud-native AI workloads, and the solutions developers are turning to instead.

Watch the full webinar:

https://www.youtube.com/watch?v=ANa9CpT4os4

Read the transcription

[00:00:00] Ken: Hi, my name is Ken. And in this webinar, we’re going to talk about one of the most popular topics right now is working with AI, AI models, working with GPTs. And I’m going to focus on the second part of working with these, which is AI. How do we get them up and running part of our applications and rolled out into production? [00:00:24] And there’s a lot of challenges in getting through this process and i’m going to focus on Primarily the the end part of that process. I’ve got some models. I want to get them running in production How do I do that? And how can I figure out if things are working? Well, so You may have noticed in March, 2024, a great paper came out from CNCF about cloud native AI and all of the capabilities that are part of the CNCF ecosystem. [00:00:53] I highly recommend that you take a look at this paper. There’s a ton of great resources in there, and I kind of want to riff [00:01:00] off of some of the things that are in there. And especially like I said at the end part, now I’ve got some ideas and I want to get them up and running in production and I’ll share an open source example of how I do this as part of the process. [00:01:14] So for a little bit about me, I’m one of the co founders and CEO of SpeedScale. We’re a CNCF member. And I’ve presented at KubeCon and we’ve attended KubeCon for a number of years, and we primarily work with customers who have workloads that are in Kubernetes environments and trying to figure out how to how to improve the performance of their own code. [00:01:37] And you might say, Hey, Ken, what are your own personal qualifications? Well, I consider myself a prompt engineer, kind of like everyone nowadays. I’ve familiarized myself with working with a number of the different AI tools. You can see here from Perplexity, where they actually have a lab, by the way, you can go and use and test out a bunch of the different models. [00:01:57] Chat GPT. [00:02:00] It’s pretty well known from open AI as one of the first really easy to use models, you can go and start to ask it questions and that kind of thing, as well as tools like anthropic and you know, Google’s models, there’s a ton of different things and a lot of these offer a free tier you can sign up and easily start using and working with them, but in all seriousness, I’ve spent most of my career. [00:02:25] In the application performance area, helping companies figure out why is this running slow? Where are the slow parts? How can we improve it? How can we load test and validate that things are running better? So one of the things I definitely recommend you drill into and do some research on is hugging face. [00:02:44] They have a free account, so you can go and sign up and you’ll find really quickly, there’s hundreds of thousands of models that are already out there. This is a screenshot from the this, the leaderboard where they’re saying, Hey, these are some of the best models for a variety of different tasks. [00:03:03] And you can see how fast this space is moving. So just in the past six to eight months, things have improved from 50 percent to over 80 percent and getting close to human levels of intelligence. Obviously you want to take advantage of this and say, What gen AI capabilities can I take and add to my own application? [00:03:24] And I recommend take a look at some of these models. These were the trending models at this time in the kind of April 2024 when I’m doing this webinar, but it’s always changing over time. So you want to have flexibility in how you work with these. So when you start working with AI models, And you get some feedback and you go, okay, I got some ideas. [00:03:46] I want to add a recommendation system. I want to add some generative capabilities to my product. You’re going to find there’s a, there’s actually a pretty long process for doing this kind of thing from making sure the data you give the models is [00:04:00] really clean and doesn’t have the wrong information in it. [00:04:03] Training the models is where almost all of the information you’ll find about AI is about, is about training. And I’m going to cover that real briefly. There’s not as much about model serving, or you might hear it called inference, which just means running it, making sure that it’s, it stays up and running and that kind of thing. [00:04:22] I’m going to focus a lot in this webinar about model serving. And of course, once you’ve got something it’s up and running. Is it running properly? Is it crashing? Is it erroring for people? And there’s new ways to implement monitoring observability for these kinds of AI models. So drilling into each one of these data preparation is super critical. [00:04:45] If you feed garbage data into your models, they’re going to give you garbage results. This is in a couple of different areas. So one is the, the data you use to train the model, and you obviously don’t want to train it on bad data, wrong responses, [00:05:00] that kind of thing, but also the way you prompt the model. [00:05:02] So take time to test out a bunch of different prompts, try different variations on how you send data to it so that you can get the best kind of results. I highly recommend researching a capability called RAG. Okay. Where you take an off the shelf model that already is good at a variety of different tasks and you augment it with your own proprietary data. [00:05:26] So this is kind of a sweet spot where you don’t have to take the huge expense of training the model yourself. But you can take something that already works really well and add your own responses to it. This can help you cut down on hallucination and weird responses that come out of the middle of nowhere. [00:05:43] So model training, this is obviously really well known as the big challenge nowadays. Everyone in the world is fighting over the GPUs. Everyone’s trying to get access to these kinds of things. Personally, I skipped some of this. If there’s 500, 000 models on [00:06:00] Hugging Face, maybe I don’t need to train my own model from scratch. [00:06:03] Maybe I can take someone else’s model and just rag my responses into it. So this lets you get something up and running. Faster. I’m all about moving fast, getting an MVP going. And so I’ll be honest, I skipped this whole stage. And for most people, you don’t need to train your own model from scratch, unless that’s your business model serving and getting this thing up and running. [00:06:29] You’ve got to figure out how do I package it? How do I build a container? What kind of infrastructure do I need to use? So it turns out running the models might require a GPU, but way less than training it, and you’ll quickly find one of the challenges is you have a lot of micro services that are calling your model and everyone has this new dependency. [00:06:50] So you can use a technique called service mocking where you record the responses that come across this API and you create [00:07:00] a mock, Which will repeatedly send those same kind of responses back. This is a way you can provide these mocks to the development teams so that they get what looks like realistic responses without necessarily having to have everyone run a giant model on their own machine. [00:07:17] That’s not always feasible. So. I highly recommend that you take a look at how to do service mocking. I’ll try to show some examples as part of this as well. And then monitoring and observability you spent all this time to get the model into production. Is it crashing? Is it, is it running really slow? [00:07:35] So you may be familiar with SRE golden signals that came out of Google’s SRE handbook. They are, what is the latency of this service? What is the throughput that it can handle that the saturation, how much infrastructure is required to run it? And the fourth one being the errors. Is it even responding by the way, errors sometimes have really good performance that responds really fast with an error message is not that [00:08:00] helpful. [00:08:01] In addition for AI models, there’s a couple of different things that you’re going to want to include. Are the answers accurate? So you may have it returning really inaccurate answers. This is actually the most common thing you see coming from the AI folks. When they talk about performance, they’re actually talking about prediction accuracy. [00:08:19] That’s great. And another one you want to include is how many tokens are used. There’s a good correlation between the more tokens, the higher the latency. So you’re gonna want to work on that and tinker with it. What’s the smaller number of tokens that you can use for your query so that you can get good latency you know, without breaking the bank. [00:08:40] So I’m all about applied engineering, actually building and running these things. So what I did for myself is I designed an experiment. Let me go and get a Kubernetes cluster, stand it up and start to deploy some of these things. I selected from Hugging Face, a container called [00:09:00] TGI, which I’ll show you. [00:09:01] It is a way that you can run these models and it’s just wrapped in a Docker container. So I put that in a Kubernetes manifest. You need a cluster that you can run these things in that has a GPU, setting up your node groups takes a lot of time for starters. I used an autopilot cluster that way. It says, Hey, this, this workload needs a GPU. [00:09:23] It spins up a GPU. No, the other workloads may not. And it spins up a regular arm or X 86, whatever kind of node that’s required again, this helps you get things up and running experiment, testing out the model so you can get feedback. So I have an open source project that I’ll share here. And it’s got a couple of different containers. [00:09:46] There is a react user interface that’s being served off of Nginx. There’s an ingress into the cluster. So you can see it from the browser. The backend API is written in node, node JS. [00:10:00] A lot of the examples are either in Python or node JS. I like working with a node. Now I’ve got kind of the same node code for my front end and for my backend. [00:10:09] And then, like I said, the TGI is coming from hugging face. So there’s an existing container that we can take advantage of, and I’ll show you how you flag the manifest to say, Hey, this needs an NVIDIA GPU. So let’s jump in and take a look at this and I’ll show you how you can run these in your own cluster. [00:10:33] Okay. So this is the documentation page for the TGI project from hugging face. It’s a pretty active project. You can see here, thousands of GitHub stars from HuggingFace. And it’s a great way that you can come and run these open source LLMs. I highly recommend it as a way to get started and get something running. [00:10:57] There’s a in addition to this, by the way, it, it [00:11:00] already has a capabilities like we talked about monitoring and observability. And it also is set up for servers, server side events. That’s where you see the responses get streamed back to the user, kind of one token at a time. And it can enable you to have instant data. [00:11:18] The first part of the response comes back fast, and then the entire response is sent over time. So you can see by the way the details on the different models that are available from this. So for my own testing, I was, I’ve been using Mixedrel. There are so many models available here that There’s, there’s plenty that you’re going to be able to work with and figure out and get working for your application. [00:11:44] The API on how you call it is in the current versions, which is the one I’m using, it looks like the V1 chat completions and point that has the same shape as you see from open AI. So this makes it [00:12:00] easy. So that you can switch the different backends out. And I highly recommend that you test different backends. [00:12:06] What does it happen if I use chat GPT? What happens if I use Anthropic? What if I want to call a TGI? All of these can be done without having to rewrite your code. So speaking of the code, let me show you my GitHub repo here. So this is a really simple project just to show you how to get things up and running here are the components that are part of it. [00:12:31] So the UI code in the, in the UI directory is a little react GUI. Okay. And it’ll, it’ll get built into a container. The API tier is a express Node. js app. When it gets an API request, it goes and hits the backend TGI system. And because I want to keep this information around, I hooked a little database up to it [00:13:00] right now. [00:13:00] The database is sort of self contained. Inside the container in a future version, I might break this out, but I can hold all of the responses and details like the the latency of the response and how many tokens were used. And then to deploy this, I’ve got a couple of Kubernetes manifests. You can see, this is pretty simple. [00:13:20] There’s I’m using customization. And so there’s just three. This is very vanilla stuff service and a deployment, but this is the one that’s probably the more interesting. How do you run the TGI? Well, this is, this is pretty simple here. You put in the image details. I’ve, because I’m using an autopilot cluster, I need to define everything. [00:13:43] The CPU memory storage required. And I just say, give me one GPU. And then you can pass in the model that you want to. Use. So here is the specific model that I’m running. You might need to put in your hugging face [00:14:00] token. So that’s, you’ve got to figure that out for yourself. Set up a token, by the way, you can create a free account and then add this node selector down here if you’re running it in GKE, which I am so that you make sure that you get a GPU that’s created. [00:14:16] As part of this and that’s it. And when you go and deploy this, it will spin up in in your cluster. And one thing to note is it does take a little bit of time for it to download the model. I recommend. When you productize this, you’re going to want to put you know storage associated with it because you can store the model so it doesn’t have to download it. [00:14:39] That kind of thing. Again, this is the quick and dirty fastest way to get something up and running for proof of concept. And for the other components, you’ll see that I’m, I’m building containers for the AI sorry, for the API and for the UI, then these are I’m also building them with GitHub actions. [00:14:59] So [00:15:00] again, it’s a, you can do this on your free account and then put a service in front of it and then I’m hosting this right now on one of the domains I have traffic replay. com. And so this, this part setting up the ingress that’s on you, if how do you want to get the interface deployed? [00:15:20] So here’s what it looks like. My little. demo app. If you hit my URL, by the way, you need to log in as John Doe. The password is in my repo. If you can’t find it, just drop me a note or ask me a question on the GitHub repo so you can come in and say, Hey, generate a brief poem. About you know, kubernetes, okay, and give it a number of tokens, like a hint on how, how many tokens to use a small number of tokens. [00:15:52] Like I was saying earlier, should respond a little faster. So okay, it went and created my poem for me. [00:16:00] And you can actually see here, the poems getting cut off. So 50 tokens is not enough. It, I told it how many tokens to use my prompts required 16. And this took over three seconds. So let’s try another poem. [00:16:16] And you know, we’ll do this again, write a poem about Kubernetes, but we’ll give it more tokens and see how this one works. So. Obviously it’s going to take a little bit longer because I’ve given him more tokens. And by the way three seconds is a pretty long response time here. This one is not you know, it’s synchronous, so it is going to wait for the entire response to come back. [00:16:38] That took 12 seconds and it actually still, it still ran out the quality of the poem, I will leave up to you to decide, but you can see a really direct correlation here between. The number of tokens that it uses and you know, the response time. This has not been tuned for performance or anything.[00:17:00] [00:17:00] That is for a later stage. We’ve got to start by at least getting visibility into it. So let me try, you know, one more poem about Kubernetes. And we’re going to say like, go crazy, give this a ton of tokens. So 25, so now we can see, we just got some kind of error. And this is the most common thing in the world, right? [00:17:22] What something happened in my. Environment here, and I don’t know what it is, right? So this is where API observability and understanding what’s going on comes in. So I did go ahead and hook this up to SpeedScale. So I can see the different calls that are happening in the environment here. So I can see, You know getting the list of poems I got a, a 304 response means it hasn’t changed. [00:17:50] Here is that specific poem that was sent. This is exactly how my the, the TGI model [00:18:00] responded and how long it took and again, the, the tokens and that kind of thing. So this is the, this is the inbound call. That went to the API here. I can see the outbound call that was where how TGI responded in the specific details. [00:18:19] And I can actually see also the prompt that was used very basic you know, for, for this example, and then you know, moving up, we can see our one that took 12 seconds and the specific details about that. And then you know, interestingly, this. This other one where it was trying to trying to post a that, that other one we don’t know what the error was. [00:18:45] So we can see here actually the backend replied with a 422. So this is that, this is coming from that hugging face TGI server. Input validation error. You’re [00:19:00] not allowed to send a 2, 500 tokens. That’s too many. So this is the kind of thing that is not always obvious to a developer, to someone who’s working with this system, what’s going on in the environment. [00:19:12] So tracking these kinds of details is really important. And if as a developer, if I want to try to work with these APIs, again, it’s kind of hard to get this up and running. in my environment. So using a tool like speed scale or another capability for service mocking, you can come and save this data. And simply easily build a mock that will allow you to run this locally on your own machine. [00:19:43] And you can see the mock responses. Include the successful responses of what, you know, when when, when it was properly returned as well as the error conditions and this lets developers. Easily [00:20:00] work locally without having to deploy giant GPUs and spend a ton of money on infrastructure because these can be run just with a single command line let me show you you know, as, as simple as this can be set up, it’s just running it on your desktop like this. [00:20:16] With with a single command, go and run this locally on my own machine. Let me hook my no JS code up to that and develop and iterate so that I don’t get that server communication error or whatever kind of thing. But I handle it in a cleaner way and show that response back to the user. So this was just a quick overview of some of the capabilities in the cloud native space. [00:20:42] And please take a look at my GitHub repo. Open you know, comments, send me some issues and this kind of thing. So we can work together. I would love a chance to help out. I’m always interested in what folks are doing around cloud native AI models nowadays and getting them up and running in [00:21:00] Kubernetes as easy and fast. [00:21:02] And this is definitely something you can check out. So if you have any questions, feel free to reach out. Thank you very much.

Webinar Replay: The 4 Biggest Challenges of Scaling Cloud-Native AI Workloads

Watch the full webinar:

Read the transcription

Table of Contents

Ship AI-generated code with confidence