Oops! Something went wrong while submitting the form.
Like what you’re reading
Share it with the world
As we approach the end of the year, much has changed and evolved in the world of AI infrastructure, especially since we published our last report on the state of Serverless GPUs six months ago. Our last guide captured a lot of attention from across the globe from helping developers take the right decision before choosing their serverless provider, to gaining a lot of traction on hacker news and much more. The reason we are publishing a new version of this is because the space is currently so dynamic in nature and all the providers are trying their best to build better products, so it’s good to share and collate learnings of what’s working well or not.
Without spoiling the results, the improvements in the space made are exciting.Our anaylsis also covers the things we have learnt so far, after interviewing hundreds of engineers who are deploying machine learning models in production & insights captured from users who are deploying in production with us.
How user needs have evolved when they are looking for serverless GPUs?
In our first article, we covered the aspects of what users are looking for when it comes to serverless GPU, six months later as the market has evolved and we also have some deeper insights, here is a refresher on the user needs:
Reliability: Even if it’s a new category with a few innovative startups in this domain, let’s not forget that users will not use a broken solution for their production workloads. Production workloads demand stable solutions, and inconsistencies can be a deal-breaker for many, from startups to large enterprises.
Cold Start Performance: Real-time ML Infrastructure demands cold starts within a 3-5 second bracket. For other dynamic applications, a range of 5-10 seconds is deemed acceptable. It's not just about speed; stability, especially at scale, is crucial.
Developer Experience: The deployment of ML models is continuous, and developers want seamless experiences. Features in demand include CLI-based deployments, easy integration with model repositories, and transparent logging and monitoring mechanisms.
Apart from these other expectations remain the same like Effortless Scalable Infrastructure (0→1→n) and (n→0), Multiple Model support, cost efficiency, security etc.
What is "True Serverless" ?
"True Serverless" encapsulates the essence of on-demand computing without the burden of infrastructure management. Unlike traditional always-on servers, Serverless platforms like AWS Lambda provision resources only when a request arrives, handling the task and then spinning down. This efficient, pay-as-you-go model, however, can lead to "cold starts", a latency observed when activating machines for sporadic workloads particularly noticeable in GPU workloads, which large providers like AWS Lambda currently doesn't support, presenting a challenge for high-performance needs.
Methodology For Testing Performance
We tested the Runpod, Replicate, Inferless, Hugging Face Inference Endpoints that we have seen gaining popularity under the domain of “serverless GPU” domain. We took two custom models:
For each model we took the machines depending on the availability on these platforms.
We tested these models on three parameters:
Cold-start
Testing Variability
Autoscaling
We collected cold-start & variability data from Postman and Autoscaling with Hey.
Cold-Start
First & foremost, we tested cold-starts across all the platforms. Cold start time is calculated as latency minus inference time, represents the delay due to initializing a dormant serverless function. To learn further about it, check out this tweet.
Variability
Then we tested cold-starts across 5 different days at different times to test whether it holds it or not:
This graph displays the performance variability for two models across different providers. The blue bars show the range of performance fluctuations for the "Llama 2 - 7Bn" model, while the red bars represent the "Stable Diffusion" model. The annotated numbers on each bar indicate the exact difference in seconds between the highest and lowest performances.
Autoscaling
Now let’s talk about another very important part of serverless offerings, linear autoscaling. Let me explain with an example - Let’s say you are running a customer chatbot for food ordering apps and you have bursts of traffic around meal times. So we tried the simulation on what happens when we receive 200 requests with a concurrency of 5.
The boxplot displays latency distributions across platforms. The central line in each box indicates the median latency, while the box itself spans the interquartile range (IQR). Whiskers extend to show the data's range, with individual points marking potential outliers. This helps gauge the consistency and range of each platform's performance under varying loads
You can check all the time-stamps for cold-starts, autoscaling in the file here.
Technical Benchmarking: A Comparative Performance Review
When evaluating the performance of serverless platforms based on data points related to latency and request handling, here's a comprehensive breakdown of the findings:
Hugging Face: - Integration: Exemplary for models within Hugging Face. However, the lack of support for external models is noticeable. - Cold Start & Variability: Hugging Face has a 15-minute scale-down time, with initial requests often facing failures. - Auto-scaling: Performance metrics aren't at their peak. Requests sometimes face throttling issues, likely contributing to the observation that as the number of requests increases, there's a corresponding rise in latency, pointing towards a linear correlation between the two for the examined data set.
Replicate: - Integration: Slightly more intricate in comparison to other platforms. - Cold Start & Variability: There's a noticeable inconsistency in cold starts and it lacks features that maintain endpoint warmth. - Auto-scaling: From the data, Replicate begins its journey with a higher latency relative to some competitors which means with more requests their latency differs in comparison to their intitial cold-starts.
Runpod: - Integration: The platform shines with its user-friendly queue system allowing for seamless container integration and endpoint formation. - Cold Start & Variability: Since they upload models in workers beforehand, you can get consistent cold-starts initially but conceptually this approach can become a bottleneck if you have highly unpredictible workloads. - Auto-scaling: The scaling is not strictly linear, but it effectively manages scale-down configurations. Runpod's data is quite revealing: starting off with some of the lowest latencies in the comparison, as the request numbers mount, the latency shows fluctuations.
Inferless: - Integration: Integration with Hugging Face models is straightforward. However, when dealing with other model types, setting up the interfacing code becomes a challenge. The setup relies heavily on the UI, but post that, if linked with GitHub, CI/CD becomes available. A change in input/output signatures necessitates UI-based re-deployment. - Cold Start & Variability: Inferless achieves consistent cold starts, offering predictable system commencement as compared to any other platform. - Auto-scaling: The inference process is silky smooth, devoid of significant hitches. Observational data for Inferless is interesting. It seems to operate at a consistent latency level, even when juggling a high volume of requests. Minor variations in latency appear as request counts alter.
Reflecting on the advancements in the space over the past half-year is invigorating. We've witnessed sweeping improvements across all metrics. The most notable change has been in cold-start durations, which have become significantly more efficient and cost-effective.
Decoding Serverless Pricing: A Quick Guide
Serverless architectures let you only pay for what you use, with costs typically based on the actual function runtime. However, pricing can vary between providers. To help you understand, let's break it down using hypothetical scenarios.
Scenario 1 - Llama 2-7Bn
You are looking to deploy a Llama 2-7Bn model for a document processing use-case with spiky workloads.
Inference Time: All models are hypothetically deployed on A100 80GB, taking 12 seconds per document across platforms.
Scale Down Timeout: Uniformly 60 seconds across all platforms, except Hugging Face, which requires a minimum of 15 minutes. This is assumed to happen 100 times a day.
Key Computations:
Inference Duration: On A100 (80GB), processing happens at 100 tokens/second, translating to 12 seconds/document. Daily requirement: 12 seconds x 1,000 docs = 3.3 hours/12000 seconds
Idle Timeout Duration: Post-processing idle time before scaling down: (60 seconds - 12 seconds) x 100 = 1.34 hours/4800 seconds
Cold Start Time: Specific to each platform. Calculated as: Cold start time x Number of scale-down events (100).
TL;DR Pricing Comparison for Scenario 1
Let’s now dive into the pricing for various platforms:
Hugging Face:A100 per minute rate: $0.108333333 (Billed by the minute). Note that there minimum scale down delay is 15 minutes.
Total Bill/Day: $39.957
Replicate: A100 pricing: $0.001400/sec. Note, we tried Replicate in their new plan where now they charge for cold-boots/cold start and timeout cost.
Total Bill/Day: $41.44
Runpod: Pricing for A100: $0.000552777/sec (Availability might be limited)
Total Bill/Day: $11.4
Inferless V1: Shared A100 pricing: $0.000745/sec. Note, Inferless also provides fractional GPUs, works fine for both the models we deployed.
Total Bill/Day: $13.964
Inferless V2:Shared A100 pricing: $0.000745/sec. Note, Inferless also provides fractional GPUs, works fine for both the models we deployed.
Total Bill/Day: $13.919
Scenario 2 - Stable Diffusion
For the image processing (stable diffusion) use case, while the computational demand remains unchanged, only the number of processed items and cold start times differ. Instead of 1,000 documents, we're considering 1,000 images daily.
TL;DR Pricing Comparison for Scenario 2
Key Computations:
Inference Duration: Unchanged from the document scenario.
Idle Timeout Duration: Remains consistent with the document scenario.
Cold Start Time for Images: Calculated as: Cold start time x Number of scale-down events for each platform. Find the calculation below:
Summary:
Our foray into the "True Serverless" domain has underscored the revolutionary possibilities of on-demand computation, phasing out the complexities of infrastructure oversight. Platforms like AWS Lambda are pioneers in this field, yet their absence of GPU support indicates constraints for more resource-intensive operations. The "cold starts" challenge is particularly pronounced, bringing latency concerns, especially amidst erratic workloads. It's noteworthy that many serverless solutions in the world still don't cater to GPU-intensive tasks, limiting their utility for demanding applications. It's essential to realize that these evaluations are grounded in a purely serverless context. Directly contrasting the performance,costing etc in a serverless environment to a containerized approach isn't apples-to-apples. Embracing a serverless paradigm isn't just about performance—it's also about realizing substantial cost savings. As we delve into specific platforms and metrics, it becomes clear: the serverless horizon is bright, but choosing the right path is imperative for achieving the best outcomes.
Note: We're always open to feedback and updates to enhance the accuracy of this analysis. Our goal is to provide a clear picture of the serverless GPU landscape, not to favor or discredit any provider. Your insights are crucial, and we appreciate your contributions in advance.
As we approach the end of the year, much has changed and evolved in the world of AI infrastructure, especially since we published our last report on the state of Serverless GPUs six months ago. Our last guide captured a lot of attention from across the globe from helping developers take the right decision before choosing their serverless provider, to gaining a lot of traction on hacker news and much more. The reason we are publishing a new version of this is because the space is currently so dynamic in nature and all the providers are trying their best to build better products, so it’s good to share and collate learnings of what’s working well or not.
Without spoiling the results, the improvements in the space made are exciting.Our anaylsis also covers the things we have learnt so far, after interviewing hundreds of engineers who are deploying machine learning models in production & insights captured from users who are deploying in production with us.
How user needs have evolved when they are looking for serverless GPUs?
In our first article, we covered the aspects of what users are looking for when it comes to serverless GPU, six months later as the market has evolved and we also have some deeper insights, here is a refresher on the user needs:
Reliability: Even if it’s a new category with a few innovative startups in this domain, let’s not forget that users will not use a broken solution for their production workloads. Production workloads demand stable solutions, and inconsistencies can be a deal-breaker for many, from startups to large enterprises.
Cold Start Performance: Real-time ML Infrastructure demands cold starts within a 3-5 second bracket. For other dynamic applications, a range of 5-10 seconds is deemed acceptable. It's not just about speed; stability, especially at scale, is crucial.
Developer Experience: The deployment of ML models is continuous, and developers want seamless experiences. Features in demand include CLI-based deployments, easy integration with model repositories, and transparent logging and monitoring mechanisms.
Apart from these other expectations remain the same like Effortless Scalable Infrastructure (0→1→n) and (n→0), Multiple Model support, cost efficiency, security etc.
What is "True Serverless" ?
"True Serverless" encapsulates the essence of on-demand computing without the burden of infrastructure management. Unlike traditional always-on servers, Serverless platforms like AWS Lambda provision resources only when a request arrives, handling the task and then spinning down. This efficient, pay-as-you-go model, however, can lead to "cold starts", a latency observed when activating machines for sporadic workloads particularly noticeable in GPU workloads, which large providers like AWS Lambda currently doesn't support, presenting a challenge for high-performance needs.
Methodology For Testing Performance
We tested the Runpod, Replicate, Inferless, Hugging Face Inference Endpoints that we have seen gaining popularity under the domain of “serverless GPU” domain. We took two custom models:
For each model we took the machines depending on the availability on these platforms.
We tested these models on three parameters:
Cold-start
Testing Variability
Autoscaling
We collected cold-start & variability data from Postman and Autoscaling with Hey.
Cold-Start
First & foremost, we tested cold-starts across all the platforms. Cold start time is calculated as latency minus inference time, represents the delay due to initializing a dormant serverless function. To learn further about it, check out this tweet.
Variability
Then we tested cold-starts across 5 different days at different times to test whether it holds it or not:
This graph displays the performance variability for two models across different providers. The blue bars show the range of performance fluctuations for the "Llama 2 - 7Bn" model, while the red bars represent the "Stable Diffusion" model. The annotated numbers on each bar indicate the exact difference in seconds between the highest and lowest performances.
Autoscaling
Now let’s talk about another very important part of serverless offerings, linear autoscaling. Let me explain with an example - Let’s say you are running a customer chatbot for food ordering apps and you have bursts of traffic around meal times. So we tried the simulation on what happens when we receive 200 requests with a concurrency of 5.
The boxplot displays latency distributions across platforms. The central line in each box indicates the median latency, while the box itself spans the interquartile range (IQR). Whiskers extend to show the data's range, with individual points marking potential outliers. This helps gauge the consistency and range of each platform's performance under varying loads
You can check all the time-stamps for cold-starts, autoscaling in the file here.
Technical Benchmarking: A Comparative Performance Review
When evaluating the performance of serverless platforms based on data points related to latency and request handling, here's a comprehensive breakdown of the findings:
Hugging Face: - Integration: Exemplary for models within Hugging Face. However, the lack of support for external models is noticeable. - Cold Start & Variability: Hugging Face has a 15-minute scale-down time, with initial requests often facing failures. - Auto-scaling: Performance metrics aren't at their peak. Requests sometimes face throttling issues, likely contributing to the observation that as the number of requests increases, there's a corresponding rise in latency, pointing towards a linear correlation between the two for the examined data set.
Replicate: - Integration: Slightly more intricate in comparison to other platforms. - Cold Start & Variability: There's a noticeable inconsistency in cold starts and it lacks features that maintain endpoint warmth. - Auto-scaling: From the data, Replicate begins its journey with a higher latency relative to some competitors which means with more requests their latency differs in comparison to their intitial cold-starts.
Runpod: - Integration: The platform shines with its user-friendly queue system allowing for seamless container integration and endpoint formation. - Cold Start & Variability: Since they upload models in workers beforehand, you can get consistent cold-starts initially but conceptually this approach can become a bottleneck if you have highly unpredictible workloads. - Auto-scaling: The scaling is not strictly linear, but it effectively manages scale-down configurations. Runpod's data is quite revealing: starting off with some of the lowest latencies in the comparison, as the request numbers mount, the latency shows fluctuations.
Inferless: - Integration: Integration with Hugging Face models is straightforward. However, when dealing with other model types, setting up the interfacing code becomes a challenge. The setup relies heavily on the UI, but post that, if linked with GitHub, CI/CD becomes available. A change in input/output signatures necessitates UI-based re-deployment. - Cold Start & Variability: Inferless achieves consistent cold starts, offering predictable system commencement as compared to any other platform. - Auto-scaling: The inference process is silky smooth, devoid of significant hitches. Observational data for Inferless is interesting. It seems to operate at a consistent latency level, even when juggling a high volume of requests. Minor variations in latency appear as request counts alter.
Reflecting on the advancements in the space over the past half-year is invigorating. We've witnessed sweeping improvements across all metrics. The most notable change has been in cold-start durations, which have become significantly more efficient and cost-effective.
Decoding Serverless Pricing: A Quick Guide
Serverless architectures let you only pay for what you use, with costs typically based on the actual function runtime. However, pricing can vary between providers. To help you understand, let's break it down using hypothetical scenarios.
Scenario 1 - Llama 2-7Bn
You are looking to deploy a Llama 2-7Bn model for a document processing use-case with spiky workloads.
Inference Time: All models are hypothetically deployed on A100 80GB, taking 12 seconds per document across platforms.
Scale Down Timeout: Uniformly 60 seconds across all platforms, except Hugging Face, which requires a minimum of 15 minutes. This is assumed to happen 100 times a day.
Key Computations:
Inference Duration: On A100 (80GB), processing happens at 100 tokens/second, translating to 12 seconds/document. Daily requirement: 12 seconds x 1,000 docs = 3.3 hours/12000 seconds
Idle Timeout Duration: Post-processing idle time before scaling down: (60 seconds - 12 seconds) x 100 = 1.34 hours/4800 seconds
Cold Start Time: Specific to each platform. Calculated as: Cold start time x Number of scale-down events (100).
TL;DR Pricing Comparison for Scenario 1
Let’s now dive into the pricing for various platforms:
Hugging Face:A100 per minute rate: $0.108333333 (Billed by the minute). Note that there minimum scale down delay is 15 minutes.
Total Bill/Day: $39.957
Replicate: A100 pricing: $0.001400/sec. Note, we tried Replicate in their new plan where now they charge for cold-boots/cold start and timeout cost.
Total Bill/Day: $41.44
Runpod: Pricing for A100: $0.000552777/sec (Availability might be limited)
Total Bill/Day: $11.4
Inferless V1: Shared A100 pricing: $0.000745/sec. Note, Inferless also provides fractional GPUs, works fine for both the models we deployed.
Total Bill/Day: $13.964
Inferless V2:Shared A100 pricing: $0.000745/sec. Note, Inferless also provides fractional GPUs, works fine for both the models we deployed.
Total Bill/Day: $13.919
Scenario 2 - Stable Diffusion
For the image processing (stable diffusion) use case, while the computational demand remains unchanged, only the number of processed items and cold start times differ. Instead of 1,000 documents, we're considering 1,000 images daily.
TL;DR Pricing Comparison for Scenario 2
Key Computations:
Inference Duration: Unchanged from the document scenario.
Idle Timeout Duration: Remains consistent with the document scenario.
Cold Start Time for Images: Calculated as: Cold start time x Number of scale-down events for each platform. Find the calculation below:
Summary:
Our foray into the "True Serverless" domain has underscored the revolutionary possibilities of on-demand computation, phasing out the complexities of infrastructure oversight. Platforms like AWS Lambda are pioneers in this field, yet their absence of GPU support indicates constraints for more resource-intensive operations. The "cold starts" challenge is particularly pronounced, bringing latency concerns, especially amidst erratic workloads. It's noteworthy that many serverless solutions in the world still don't cater to GPU-intensive tasks, limiting their utility for demanding applications. It's essential to realize that these evaluations are grounded in a purely serverless context. Directly contrasting the performance,costing etc in a serverless environment to a containerized approach isn't apples-to-apples. Embracing a serverless paradigm isn't just about performance—it's also about realizing substantial cost savings. As we delve into specific platforms and metrics, it becomes clear: the serverless horizon is bright, but choosing the right path is imperative for achieving the best outcomes.
Note: We're always open to feedback and updates to enhance the accuracy of this analysis. Our goal is to provide a clear picture of the serverless GPU landscape, not to favor or discredit any provider. Your insights are crucial, and we appreciate your contributions in advance.