Optimized GPU Inference: How Inferless Complements Your Hugging Face Workflows

Table of contents

If there’s one consistent rule in the world of machine learning, it's that the cutting edge is sharpened very quickly. The best model on a given Friday might be outperformed by several percent by the next Monday, and this makes staying current a tricky prospect — especially as model architectures grow in size and complexity.

Staying current on the latest models doesn’t have to be super difficult, though! Most breakthroughs today are happening in an open-source fashion and are available from the free library on Hugging Face. Because the models are already easy to find and obtain, the main bottleneck in staying up to date is maintenance of the infrastructure required to support new models.

There are many potential pitfalls when it comes to model deployment. You may find that costs are prohibitive for having a dedicated GPU cluster, but that when you try to shift to an “on demand” computation system, cold-start times harm your user’s experience with massive latency. Or perhaps provisioning and deploying autoscaling clusters is simply too messy and slows your development process.

In this article we will explain how to leverage Hugging Face to build right model for your purpose, and then describe how you can instantly launch that model to production with Inferless.

‍

Selecting a model

With models changing by the day, there is a growing advantage for those who know how to find the latest open-source models and deploy them quickly. Although the most famous recent success in machine learning, ChatGPT, demonstrated the power of a closed model, the standards of access are changing almost as fast as the models themselves.

Open-source models offer the advantage of quick breakthroughs and customizable patches — as open-source developers attend to various improvements, it means that a marketplace of optimized models is available to you on Hugging Face.

Open source models also offer the benefit that you can freely improve and specify them to your own use case. You may want to finetune models based on your own proprietary data, training parameters, or engineered prompts.

No matter if your use case is text-based, an image processing tool, or some other deep learning application, it's very likely that the best in class is already available on Hugging Face’s model repository.

‍

Searching Hugging Face

When searching for models on Hugging Face, the first thing to understand is your general use case. You probably already understand your business application, but the categories of model displayed in the model homepage can help you refine the definition of your model.

*A screenshot of all the possible task types available on Hugging Face.*

‍

Because there are so many models in each category, it might not be useful to scroll through the whole list. Once viewing a category, you can use keyword search to refine your target model, like so:

*A closeup of the options available when selecting the “chatbot” task type in Hugging Face.*

Even with the advanced search features like full text search, it can sometimes be difficult to find the exact model for your purposes. If you find yourself in this situation, there's nothing wrong with searching Google with “[use case] + Hugging Face model”. Sometimes the old tricks are best.

Test-drive your model

Even if you find a model that says it meets your requirements, it's best to make sure the content matches the label before you spend any money on a deployment. Hugging Face offers an Inference API that allows you to quickly test a model on Hugging Face’s servers, so that you can ensure that the basic model delivers what you are looking for. Using the Inference API is as easy as creating an authentication token — after that, you can freely submit input data to an API endpoint and receive responses from the Hugging Face servers.

The Inference API is designed for free testing and not for a production or even development build. Total queries are rate-limited per authentication token, so although the Inference API can be helpful for picking out a model, you’ll need something else to deploy it to production.

Upload your own model to Hugging Face

An alternative approach to getting the model you want from Hugging Face is to train it yourself! You may already have a well-labeled dataset with strong evaluation results, but no way to easily bring it online. Rather than the hassle of managing a dozen containerized ports, integrated software packages, and package managers, you can use Hugging Face as a repository for your models, and Inferless to deploy them.

After you’ve trained your model, the next step is sharing your model on Hugging Face.

Hugging Face’s repository is based on Git, so you can easily find a model you want to use, train it further on your own machines, Colab instances, or other GPU solutions, and then push updates to an existing model on a new branch. It’s easy to convert Hugging Face models between common frameworks and facilitate developing the model further in a framework you are most comfortable with. Once you’re done retraining the model, push it back into your Hugging Face repository, and it’s ready to work with.

Deploy your model using Inferless

Once you have a good candidate model for your use case, you’ll want to deploy a model API endpoint for your application. The challenge here is to keep two competing variables — latency and cost — as low as possible. In general, the tradeoff is that the more uptime an online model has the more responsive it is, but this will increase computation costs.

So you’ll be looking for a tool that minimizes both variables. You’ll also want something simple and easy to deploy. While it’s possible to build up your own deployment pipeline, the engineering alone can be expensive, and the result is often not optimized to the best possible level. It also means a lot of time spent on DevOps at the expense of working on your models.

A great way to keep costs low without impacting latency is to host your model on serverless machines that shut down when not in use and scale up automatically to meet increased demand. This is what Inferless offers. Inferless ensures low cold-start times even for spiky usage, fixed costs for runtime, and private, secure servers.

Once the work of finding, training, and testing your model is completed, Inferless can bring your Hugging Face model online.

How to build a machine learning endpoint with Inferless

We’ve prepared a tutorial to help demonstrate the Inferless deployment process. All you will need to begin is an Inferless account and a Hugging Face model page (as discussed earlier in this article). You can access the tutorial in our docs or follow along below.

Record the model specifications

‍

A closeup of the relevant components of the Hugging Face page, including labels for the canonical Hugging Face name, task type, framework, and model type. — *A closeup of the relevant components of the Hugging Face page, including labels for the canonical Hugging Face name, task type, framework, and model type*

‍

To begin, you will need certain data from the model. Note down the canonical Hugging Face model name, task type, framework and model type. You may have already input these yourself, if this is a model you uploaded.

Add a model in your Inferless dashboard

In the workspace you want to add a model to, select Add model, as demonstrated here.

*A screenshot of the Inferless console highlighting the Add Model button.*

Match the model framework

The model you selected earlier was trained on a specific framework, as detailed on the Hugging Face model page. Most models are natively supported by Inferless, but if your model was developed with an unusual framework, it might be necessary to convert the format before continuing.

‍

A screenshot of the Inferless console showing the model training framework selection. — *A screenshot of the Inferless console showing the model training framework selection*

After the framework is selected, it’s time to add details about how you will upload your model.

Select Hugging Face as the source of your model

Inferless needs to know whether the model will be loaded from a file, a GitHub repository, or — in this case — a Hugging Face repo. At this stage it’s also necessary to copy your Hugging Face model into your own GitHub repo. Inferless will copy the Hugging Face model into your GitHub repo automatically during the following steps.

The first time you upload a model for Inferless, you will need to enable some simple integrations for the console to work with your model.

First time use only: connect your GitHub and Hugging Face

To deploy a model, Inferless needs to be able to read it from Hugging Face and needs a location to store a model copy in your GitHub. Click Add Provider, then Connect Account for your Hugging Face account.

‍

A view of the button to add a Hugging Face provider. — *A view of the button to add a Hugging Face provider*

‍

You will be asked to provide your Hugging Face API key to connect Hugging Face and Inferless. The integration page will offer an easy visual for finding your Hugging Face read access key.

Go to your account settings in Hugging Face Dashboard by clicking on your profile picture in the top right corner and selecting "Account settings."

‍

*A view to find access tokens on your Hugging Face dashboard*

Your Access Token will now be displayed on the Access Tokens page, you can use this Access Token to access the Hugging Face API and use its resources.

‍

Adding a Hugging Face access key to the Inferless console. — *Adding a Hugging Face access key to the Inferless console*

Once your integrations are enabled, select your accounts as they show up here, you can similarly follow the process for Github Integration and continue to the next step.

*A view of the screen once integration is done*

Enter the model details

Next, enter the details of the model, which you will have found on the Hugging Face page. Add your personal name for the model name, the type of model as listed on Hugging Face, the task type that the model is designed for, and the canonical Hugging Face name you noted earlier for the Hugging Face model name.

*Screenshot showing where to input the model details.*

‍

This screen will also request a sample input and output of the model. This is a necessary step to give the server an idea of the shape of data it will expect as input and output. For more information about these descriptions, check out our documentation relating to them.

*Screenshot showing where to add input and output*

There are two possibilities for entering sample data. You can use our convenient builder tool, or provide a JSON-formatted description of the input and output data shape, as described in our documentation.

Once the model has been validated (this may take a minute), the next step is to configure the model by setting the parameters of the inference workers that will be executing your requests.

Configure the runtime details

The fields Min Scale and Max Scale indicate the number of parallel replicas working on your inference. Setting 0 for Min Scale indicates that you would like to have workers only on demand; that is, a serverless deployment.

By keeping Setup automatic build checked, you can ensure automatic CI/CD. The URL provided can be supplied to Hugging Face via its webhooks interface for activating auto rebuilt function in Inferless

A view of the “Setup automatic build” option on the “Configure model” screen. — *A view of the “Setup automatic build” option on the “Configure model” screen*

Review and approve

After the model and server parameters have been set, you will be taken to a screen that shows all the information you have previously entered. Double check it now, and if everything is correct, click Submit, and then Import.

You will now be able to see all your models that are currently being deployed (In-Progress) and those that have failed to be deployed. You should see your model with an In-Progress status.

A view of all failed and in-progress models in the Inferless console. — *A view of all failed and in-progress models in the Inferless console*

‍

Similarly, your successfully uploaded models will be visible under My Models.

Selecting a particular successful build under My Models will take you to its model page in Inferless. There you will be able to access details such as its API endpoint, as displayed below. You can now call this endpoint from your code.

Screenshot showing the API endpoint for a model. — *Screenshot showing the API endpoint for a model*

Your time is valuable — don’t waste it tinkering with a solved problem

Using Inferless, you can stay up to date with the latest in machine learning while avoiding most of the traditional downsides. The cost of autoscaling, the product impact of long cold starts, and the production effort of infrastructure development are all solved with Inferless.

Regardless of the size or complexity of your project, Inferless can be scaled to support it. Whether you are looking for an affordable serverless inference tool that can be deployed to production, or if you just want a quick solution for your pet project, Inferless can work for you.

It may be hard to find a similar product on the market. Inferless is a one-of-its kind tool that makes the problem of serverless GPU a thing of the past. So, if you are interested in surfing this wave of the future, join our waitlist to get started.