The State of Serverless GPUs

Delving into latency, variability, billing strategies, user experience, and advanced capabilities - comprehensive findings and guiding principles for the ideal serverless GPU option.

Aishwarya Goel & Nilesh Agarwal

April 10, 2023

17 mins

Executive Summary

The demand for high-performance computing resources is soaring due to rapid advancements in artificial intelligence and generative models, particularly for efficient GPU-based solutions that accelerate complex AI tasks. As serverless computing gains traction for CPU utilization, we predict a similar trend for GPUs.

This benchmarking report evaluates the current landscape of serverless GPU inference platforms for AI-driven organizations. We assessed major providers against user needs and highlight critical factors for selecting a serverless GPU platform. Our analysis includes a thorough comparison of each provider's advantages and disadvantages, enabling businesses to make well-informed decisions that align with their unique requirements.

This report serves as a valuable resource for generative AI companies seeking to leverage the potential of serverless GPU inference platforms, aiding them in navigating the ever-evolving technology landscape with confidence.

Update: We have published the Part -2 of the State of Serverless GPUs. You can read it here

What do users look for in an ideal serverless GPU offering?

So far, we have interviewed hundreds of ML Engineers & Data Scientists, and, apart from trust, system reliability, integrations, some specific features that users are looking for in their ideal serverless GPU offerings are:

  1. Cost Efficiency: Many organizations or users have less than 50% GPU utilization, resulting in expensive hourly or contractual rental fees. Serverless platforms enable dynamic scaling of GPU resources, allowing users to pay only for what they use, significantly reducing average monthly expenses.
  2. Model Support (Multiple Frameworks): Users require support for various model frameworks, such as ONNX or PyTorch, depending on their organization's needs. An ideal platform should support all major frameworks, avoiding user friction caused by forced conversions or limitations.
  3. Minimal Cold Start Latency & Inference Time: Low cold start latency and low inference time are critical aspects for optimal user experiences, except in batch processing or non-production environments. An ideal platform should offer consistently low cold start latency across all calls or loads.
  4. Effortless Scalable Infrastructure (0→1→n) and (n→0): Configuring and scaling GPU infrastructure can be a complex and time-consuming process. An ideal platform should be able to automate scaling, requiring minimal user input beyond setting limits or billing parameters.
  5. Comprehensive Logging & Visible Metrics: Users need detailed logs of API calls for analyzing loads, scaling, success vs. failure rates, and general analytics. An ideal platform should offer options for exporting or connecting users' observability stacks.

By addressing these key considerations, serverless GPU platforms can deliver high-performance, cost-effective solutions that cater to the diverse needs of organizations relying on AI and generative models.

Key Findings on Serverless GPU Options in 2023

Several startups have entered the serverless GPU market, each attempting to address specific pain points for both training and inferencing. Despite the growing demand for GPU resources, major big tech players have yet to offer serverless GPU solutions. Given the high costs associated with underutilized GPUs, users and companies are eagerly seeking more economical alternatives.

This study examines five companies pioneering serverless GPU offerings, revealing key findings after thoroughly stress-testing their products:

Provider
Founded
Website
beam.cloud(Prev Slai.io)
2022
https://www.beam.cloud
Banana.dev
2021
https://wwww.banana.dev/
Replicate
2019
https://wwww.replicate.com/
Pipeline.Ai(Mystic)
2019
https://www.pipeline.ai/
Runpod
2020
https://wwww.runpod.io
  1. True serverless computing products remain scarce in the current market
    Some providers only partially manage infrastructure and scaling, resulting in unclear scaling and costs. Others queue requests instead of offering immediate processing. Developing a truly serverless solution tailored to customer needs remains a challenge.
  2. No product currently delivers an exceptional user experience
    Crafting an outstanding user experience for serverless computing has proven difficult. While startups are striving to create serverless GPU platforms, aspects like cold start times, latency, autoscaling, and reliability still require refinement. Product enhancements have not been the primary focus, and it may take 6-12 months before more advanced features become available.
  3. Seamless infrastructure scaling is the most significant challenge for market players
    Platforms encounter numerous technical issues when attempting to scale up or down to accommodate increased loads. Such issues become more prevalent during peak times, causing users to be reluctant to adopt these solutions for production workloads.
  4. Emphasis on technical metrics, specifically lowest cold start time and lowest inference time, is crucial
    There is a considerable gap between simply offering inferencing or serverless capabilities and establishing a technical edge. Many companies have yet to develop reliable metrics for essential factors such as latency, cold start times, and scalability in large-scale deployments. As a result, extensive experimentation and exploration of various architectures are underway to identify the most effective solutions.
  5. Cost transparency and accountability are insufficient among current serverless providers
    A notable number of serverless providers lack transparent cost calculators, leading to uncertainty and concealed fees. Users may also be billed for inefficiencies or scaling delays and for keeping machines active. To enhance transparency and accountability, a standardized and transparent billing system is indispensable.

These findings underscore the challenges and opportunities within the serverless GPU market. As companies continue to innovate and refine their offerings, we anticipate more robust, user-friendly, and efficient solutions to emerge, enabling users to harness the full potential of serverless GPU platforms confidently.

TL;DR Analysis and review of all the current players.

If you wish to go through the summary sheet, you can check out the analysis below:

Review and analyze all current players.

Product 1: banana.dev

A brief overview of Banana.dev

Banana.dev is a platform that simplifies deployment of machine learning models with robust inference endpoints, scalable infrastructure, and a cost-effective pay-per-second billing model. It offers templates for popular models and one-touch deployment for open-source ones. Users are incentivized to share their models, promoting a collaborative environment. To import a model, users create a file using a template and import it from their GitHub repository, making deployment and management more accessible and user-friendly.

What did we like about Banana.dev?

  1. Serverless, pay-per-second billing with an hour of free credit.
  2. Developer-friendly: GitHub integration, templates, and simplified process.
  3. Quick setup: takes less than 3-4 hours.
  4. Community features for model-sharing and collaboration.
  5. Transparent and engaged: shares roadmap, feature requests, and bug list, and maintains active social media presence.

What gaps did we find in Banana.dev?

  1. Billing for platform-induced delays/issues:
    - Users are charged for cold start time, even when it exceeds 200 seconds.
    - Billing data lacks granularity, making it difficult to understand costs.
    - Users incur additional costs due to server fluctuations or system issues.
  2. Significant variability in cold start and inference times:
    - Banana.dev is best suited for batch processing or for users who can tolerate longer cold start times and potential downtimes. 
    - Inference and cold start times are unpredictable, with a minimum of 5 seconds for models under 100MB.
  3. Auto-scaling challenges:
    - Autoscaling is not optimized, and machine provisioning lacks clarity.
    - The platform struggles with large models; for instance, a 10GB model failed to deploy despite a claimed 16GB support.
  4. Limited logging and monitoring capabilities. The platform does not provide integration or export options for metrics and logs, nor does it support integration with observability tools.
  5. Restricted to GitHub uploads in a specific format. Users with different formats or repositories must undertake additional preparation before uploading their models, which can be cumbersome and time-consuming.

Pricing

GPU usage is billed at USD$ 0.00051992 per second, which is equivalent to $1.87 per hour. This is significantly cheaper compared to the average cloud provider charge of ~3 dollars for an A100 40gb machine.

Technical benchmarking

We have tested out the platform by using 3 majorly used models as given below:

Name
Model
With Cold Start
Inference
Variability
Mode Link
Banana.dev
GPT-  Neo (1.3B)
~64 sec
~3 second
Slightly Variable
https://huggingface.co/
EleutherAI/gpt-neo-1.3B
GPT-  Neo (125M)
~38 sec
~3 second
Slightly Variable
https://huggingface.co/
EleutherAI/gpt-neo-125m
Roberta Large
~31 sec
~1 second
Slightly Variable
https://huggingface.co/
roberta-large

Our Comments: The cold start and inference time can vary greatly. Additionally, the availability of GPUs is not always 100%. The platform may experience "degraded performance" at times, as mentioned by the provider.

Product 2: Beam. cloud

Brief Overview of Beam.cloud

Beam.cloud, formerly known as Slai.io, has evolved from providing an end-to-end solution encompassing both training and inference to focusing exclusively on inference. This pivot has allowed them to better cater to developers by offering a user-friendly, command-line/terminal-based approach. To streamline the integration process, Beam.cloud also offers an optional SDK. Additionally, it can be installed on existing Kubernetes clusters with automatic log export to S3.

What did we like about Beam.cloud?

  1. Supports API endpoints, webhooks, and cron jobs for inference.
  2. Seamless onboarding with a strong focus on the user journey.
  3. Developer-friendly, terminal-based approach, catering to developers and DevOps professionals.
  4. SDK integration option for real-time logs and simplified development.
  5. Compatible with existing Kubernetes clusters and supports automatic log export to S3.
  6. Flexible billing structure for CPU, GPU, and RAM requirements.

What gaps did we find in Beam.cloud?

  1. This tool is exclusively command-line/terminal-based. Users without experience using a terminal or CMD prompt may find it difficult to use.
  2. Loading occasionally encounters issues that require trial and error to resolve. Additionally, memory limitations may be encountered.
  3. High variability in the cold start and inference times:
    - Best suited for batch processing or users tolerating cold starts and downtimes.
    - Not ideal for B2C products requiring real-time high inference loads.
    - Unclear if variability is due to free tier usage.
  4. Auto-scaling for REST API is enabled only on request. Restrictions should be communicated more clearly to users.

Pricing

The pricing structure is interesting; CPU, RAM, and GPU are all charged separately. They offer a convenient pricing calculator for this purpose.

  1. GPU usage is billed at USD$ 0.00056944 per second, which equates to $2.04 per hour. On average, cloud providers charge approximately 3 dollars for an A100 40GB machine.
  2. There's an added bonus of 10 hours of free compute, which is significant.

Additionally, they also charge a subscription fee and have a tier-based structure for using the platform:

  1. Developer - $0 (usage only)
  2. Team - $25 per seat + usage
  3. Professional - monthly fee + usage

Technical benchmarking

We have tested out the platform by using 3 majorly used models as given below:

Name
Model
With Cold Start
Inference
Variability
Mode Link
Beam.cloud
GPT-  Neo (1.3B)
~100 sec
~2.2 second
Highly Variable
https://huggingface.co/
EleutherAI/gpt-neo-1.3B
GPT-  Neo (125M)
~34 sec
~1 second
Highly Variable
https://huggingface.co/
EleutherAI/gpt-neo-125m
Roberta Large
~35 sec
~1 second
Highly Variable
https://huggingface.co/
roberta-large

*these results are what we got when we tested and there is a possibility of variable results

Our Comments: The cold start and inference times are reasonable. The results above are the best we have achieved on the platform. However, there were multiple cases in which we encountered memory issues or technical errors that prevented us from obtaining satisfactory results.

Product 3: Replicate

A little brief about the product

Replicate is a platform that supports custom and pre-trained machine learning models. It offers standard models and emphasizes pre-trained models for user convenience. Users can share their models and collaborate with others. Replicate has a diverse collection of models, many of which have been executed numerous times. It also offers options for depth and flexibility to customize models. Replicate aims to serve as a comprehensive solution in the machine learning model deployment landscape.

What did we like about Replicate?

1. Emphasis on open usage of popular pre-trained models for free, allowing users to explore before moving to custom models.

2. Encourages open-sourcing models through a waitlist concept, fostering more consumer use cases.

3. User-friendly platform with an intuitive interface.

4. Offers a choice between Nvidia T4 and A100, catering to users with lower budgets.

5. Boasts a large community with some models receiving over 47 million calls from public users.

6. Provides an open-source library called COG for deploying models on the platform.

What gaps did we find in Replicate?

  1. Replicate has limited post-deployment offerings, such as monitoring, logging, or data streaming.
  2. Support is limited to email or Discord, with no instant support available.
  3. Waitlist requirements for deploying custom models may deter users.
  4. Output from inference APIs is stored in unique links, which may not be user-friendly, as users must individually delete past outputs.
  5. Replicate doesn't have an option to import models from multiple sources like GitHub or SageMaker, relying solely on their COG library for uploads.

Pricing

Below is the pricing as mentioned by them on the site:

CPU - $0.0002/secNVIDIA T4 - $0.00055/secNVIDIA A100 - $ 0.0023/sec

Replicate NVIDIA A100 costs $8.28 per hour to use. On average, cloud providers charge approximately $3 for an A100 40GB machine.

Free tier users are provided with T4 machines. Although there is a free tier, its limitations are not specified anywhere on the website.

Technical benchmarking.

We have tested out the platform by using 3 majorly used models as given below:

Name
Model
With Cold Start
Inference
Variability
Mode Link
Replicate
GPT-  Neo (1.3B)
~213 sec
~2.4 second
Stable
https://huggingface.co/
EleutherAI/gpt-neo-1.3B
GPT-  Neo (125M)
~178 sec
~1.1 second
Stable
https://huggingface.co/
EleutherAI/gpt-neo-125m
Roberta Large
~160 sec
~1.2 second
Stable
https://huggingface.co/
roberta-large

*these results are what we got when we tested and there is a possibility of variable results

Our Comments: This product is very simple, yet boasts one of the best technical benchmarks. It offers a wide range of open source models, enabling quick deployment, while also providing a high degree of technical flexibility to tweak the models.

Product 4: Runpod

A little brief about the product

Runpod is a platform that lets users choose between machines and serverless endpoints. It uses a Bring Your Own Container (BYOC) approach and has features such as GPU instances, serverless GPUs, and AI endpoints. The platform allows deploying container-based GPU instances from public and private repositories and accessing the SSH terminal through a web portal. Runpod offers fully managed and scalable AI endpoints for diverse workloads and applications. While it aims to address various user needs in machine learning model deployment, the performance of its features in the real world needs further evaluation for effectiveness and reliability.

What did we like About Runpod?

  1. This platform offers servers for all types of users.
  2. The loading process is simple and only requires dropping a container link to pull a pod.
  3. Payment and billing are based on credits and not directly billed to a card.
  4. Although the number of models is limited, the platform has a community feature where users can fork models.
  5. Users can access the SSH terminal through the web portal.

What gaps did we find in Runpod?

  1. The current logging and tracking metrics are limited and do not add much value.
  2. Post-deployment, it can be confusing to understand how the platform works, which may result in users receiving a bill if they are not careful.
  3. The product's scope is limited, as it is only based on BYOC (Bring Your Own Container).
  4. Asynchronous Inferencing
    a. The entire model is based on batch processing, and it is unclear whether there is a constant GPU allocation for the user.
    b. Upon receiving a request, it is added to a queue.
    c. Requests are processed from the queue.
    d. Output needs to be checked with another API, not the one used to run the model. There are technically two APIs, one for running and one for status/output.
  5. If the number of machines is set to 0, the API will not work. If it is set to 1, the user will be billed for no usage.
  6. There is no bot or instant support mechanism available.

Pricing

How does Pod billing work?

Each pod has an hourly cost based on its GPU type. You will be charged for the compute every minute that the pod is running. The charge is deducted from your RunPod credits. If you run out of credits, the pods will be automatically stopped and you will receive an email notification. If you don't refill your credits, the pods will eventually be terminated.

Overall Pricing for machines

The A100 (80GB) starts at $2.09/hour.

Pricing for server-less APIs only

The pricing ranges from $0.0002/second (16GB VRAM) to $0.001/second for an 80GB VRAM GPU.

Note: There is no trial period, so you have to pay to use the service.

Technical benchmarking.

We have tested out the platform by using 3 majorly used models as given below:

Name
Model
With Cold Start
Inference
Variability
Mode Link
Runpod
GPT-  Neo (1.3B)
~164 sec
~1.1 second
Lightly Stable
https://huggingface.co/
EleutherAI/gpt-neo-1.3B
GPT-  Neo (125M)
~2 sec
~0.8 second
Lightly Stable
https://huggingface.co/
EleutherAI/gpt-neo-125m
Roberta Large
~37 sec
~0.8 second
Lightly Stable
https://huggingface.co/
roberta-large

*these results are what we got when we tested and there is a possibility of variable results. As on April 2023

Product 5: Pipeline

A little brief about the product.

Pipeline is a serverless platform that hosts machine learning models via an inference API. It offers both custom and pre-trained models, including standard open-source models for immediate use. With per millisecond billing, the platform is cost-effective. Users can seek assistance and exchange insights on the active community on Discord. Pipeline provides a user-friendly experience backed by a supportive community.

What did we like about pipeline.ai?

  1. They offer over 15 pre-trained models, including Stable Diffusion, GPT, and Whisper, which can be easily deployed with one touch, and APIs can be used instantly.
  2. The inference time for an ON model is very low, the lowest among competitors.
  3. They have structured their model uploading process by focusing on one method, ONNX. The onboarding process is easy to understand, and if ONNX is not being used, there is documentation provided for other methods.
  4. The community and support are excellent, with a super active Discord channel where queries are answered quickly.
  5. Pricing is at a millisecond level, an industry-first.

What gaps did we find in pipeline.ai?

  1. No chat/helpbot/ticket-based support. Support is only Discord-based. While this is a good option for collaborating, it can impact response time.
  2. Documentation can be a little clearer on what a user can achieve with a pipeline. They have built a Pipeline cloud with the Pipeline library, but need more use-case examples of the library.
  3. User journey-based upload is only available for ONNX.
  4. The website is a little broken when it comes to uploading methods. The ONNX quickstart option only appears once. (This issue has been raised with them.)
  5. The payment integration is a little buggy and lacks features.
  6. The metrics (logging and tracking) are very limited and don't add much value to the user. There are no ways to go deeper into the metrics, and there is no option to export data from the platform.
  7. There is an added platform usage fee with not much ROI, according to us. A platform fee would make sense when there are many more ROI/feature additions.

Pricing

They have a platform + per use billing:

  1. $0.00055/ sec is the compute cost, billed per millisecond
  2. There is a platform fee of $12.99/month.
  3. There is also a custom enterprise plan with added support.
  4. They offer 20$ of free credits to try the platform.

Technical benchmarking.

We have tested out the platform by using 3 majorly used models as given below:

Name
Model
With Cold Start
Inference
Variability
Mode Link
Pipeline.ai
GPT-  Neo (1.3B)
<Updating>
<Updating>
<Updating>
https://huggingface.co/
EleutherAI/gpt-neo-1.3B
GPT-  Neo (125M)
~32 sec
~6.5 second
<Updating>
https://huggingface.co/
EleutherAI/gpt-neo-125m
Roberta Large
~13 sec
~3 second
<Updating>
https://huggingface.co/
roberta-large

*these results are what we got when we tested and there is a possibility of variable results. As on february 2022

Our Comments: The benchmarking results are satisfactory. The platform works well and is designed for vertical deployment options.

Summary

This comprehensive guide offers an in-depth analysis of the leading serverless platforms available in the market today, namely Banana.dev, Beam.cloud, Replicate, Runpod, and Pipeline.ai. The goal of this document is to equip readers with a thorough understanding of each platform's distinct strengths and weaknesses, enabling them to make informed decisions when choosing a serverless solution.

In our evaluations, we meticulously examined each platform based on critical factors such as pricing, features, user experience, and technical feasibility. By doing so, we present a clear and concise comparison, showcasing how these platforms measure up against one another.

Furthermore, this guide delves into the broader challenges and limitations faced by serverless platforms in general. For instance, we discuss the importance of cost management, as serverless solutions can become prohibitively expensive if not managed appropriately. Another critical consideration is ensuring that the chosen platform aligns with specific customer requirements, as not all features and capabilities may be an ideal fit for every use case.

In conclusion, this document serves as an invaluable resource for developers, DevOps engineers, and other stakeholders involved in the development and deployment of serverless applications. By providing an extensive examination of various serverless platforms and addressing the challenges and limitations inherent to serverless solutions, we empower readers to make well-informed decisions that align with their unique needs and requirements.

The culmination of the article is this huge summary matrix:

Reminder: We welcome any feedback or updates to refine this comparison and ensure its accuracy. Our aim is to foster awareness of the current market landscape in the serverless GPU space, not to diminish any particular provider. Your input is invaluable – thank you in advance.