Where to Host Your AI: Comparing ML Model Deployment Services

RisingStack's services:


Node.js Experts

Learn more at risingstack.com

Sign up to our newsletter!

In this article:

When it comes to hosting machine learning models, whether it is for private or public use, it’s not a simple task to find the right services for the job. Many articles online and responses from AI tools tend to include a wide range of tools, platforms and providers that have only one thing in common, being related to machine learning.

In this post, we aim to help by providing a list of services that actually make hosting ML models possible, curated by hand.

Modal is an ML model hosting and training platform, that has direct code integration for runtime configuration, as well as a CLI tool for initiating deployments.

The main features offered include cron jobs for task scheduling, log collection and retention, monitoring, webhook endpoints, secret management and support for custom built images and custom domains. Modal’s infrastructure also supports distributed queues, distributed dictionary learning, and CPU and GPU concurrency.

Everything related to the runtime environment is configured in python, no separate containerisation technology necessary, although working knowledge of docker can be useful seeing how similar the actual configuration is to the structure of a Dockerfile. Having the environment configuration as part of the code can have downsides as well however, as it will have to be built at runtime resulting in potentially longer and more expensive runs. The image, once built is then stored by Modal, so it can be re-used without the need to rebuild.

Modal offers three pricing tiers, from a free tier to team and enterprise subscriptions. The free and the team tiers include 30$ of compute credit a month. The team subscription is 100$ per month, and comes with 10 seats, with the possibility to pay for additional seats at 10$ per. Enterprise subscription details are individually determined, but all of them appear to have no limitation on seats. You can find the currently listed compute costs in the following table:

Hardware typeCost
CPU$0.192 / core / h
Nvidia A100, 40 GB VRAM$3.73 / h
Nvidia A100, 80 GB VRAM$5.59 / h
Nvidia A10G$1.10 / h
Nvidia L4$1.05 / h
Nvidia T4$0.59 / h
Memory$0.024 / GiB / h


Paperspace offers access to many GPU types and deployments configurable from their web UI requiring minimal setup. While it uses docker images, it can pull any public image by providing a url, and can also be set up to use custom images from private registries. Models can be pulled from S3 buckets or from Huggingface. When it comes to high availability, it is possible to create multiple replicas when setting up a deployment and further autoscaling can also be configured here.

Aside from model deployments, it is also possible to create Jupyter notebooks, set up model training workflows and manage secrets on the web UI, but Paperspace has an open source CLI tool with full access to its features if you’d rather.

As far as subscription goes, Paperspace offers four tiers of subscriptions with more powerful instance types becoming available as prices get higher, as well as private projects for all paid tiers. A free tier suitable for trying out the platform is available, with a project limit of 5, 5GB of free storage space and no concurrent job runs.

Paid tiers start at 8$ for a single seat pro tier and 12$ for a team of 2, including a cap of 10 projects, 15GB of free storage and 3 concurrent jobs. At 39$ per seat, the growth tier has a cap of 5 seats, a project limit of 25, free storage of 50GB and 10 concurrent jobs. An enterprise tier is also available, with costs and limits up to an individual contract. While it is possible to go over the free storage limit, overages are billed at $0.29/GB.

We summarised the current prices of compute instances available in the following table:

Instance typeHardwareCostAvailable in free tier
C4 CPU2 CPU 4GB RAM$0.04 / hyes
C5 CPU4 CPU 8GB RAM$0.08 / hyes
C7 CPU12 CPU 30GB RAM$0.30 / hyes
P4000 GPU8 CPU 30GB RAM 8GB VRAM$0.51 / hyes
RTX4000 GPU8 CPU 30GB RAM 8GB VRAM$0.56 / hyes
A4000 GPU8 CPU 45GB RAM 16GB VRAM$0.76 / hyes
P5000 GPU8 CPU 30GB RAM 16GB VRAM$0.78 / hyes
P6000 GPU8 CPU 30GB RAM 24GB VRAM$1.10 / hyes
A5000 GPU8 CPU 45GB RAM 24GB VRAM$1.38 / hno
A4000 GPU x216 CPU 90GB RAM 16GB VRAM$1.52 / hyes
A6000 GPU8 CPU 45GB RAM 48GB VRAM$1.89 / hyes
v100 GPU8 CPU 30GB RAM 16GB VRAM$2.30 / hno
V100-32G GPU8 CPU 30GB RAM 32GB VRAM$2.30 / hyes
A5000 GPU x216 CPU 90GB RAM 24GB VRAM$2.76 / hyes
A100 GPU12 CPU 90GB RAM 40GB VRAM$3.09 / hno
A100-80G GPU12 CPU 90GB RAM 80GB VRAM$3.18 / hyes
A6000 GPU x216 CPU 90GB RAM 48GB VRAM$3.78 / hyes
V100-32G GPU x216 CPU 60GB RAM 32GB VRAM$4.60 / hno
A100 GPU x224 CPU 180GB RAM 40GB VRAM$6.18 / hno
A6000 GPU x432 CPU 180GB RAM 48GB VRAM$7.56 / hno
V100-32G GPU x432 CPU 120GB RAM 32GB VRAM$9.20 / hno

These prices are in addition to the monthly subscription, with free credit offered on a case-by-case basis.

Self-managed Ray

Ray is an open source framework encompassing many tools ranging from libraries to help with common machine learning tasks to distributed computing and parallelisation, from the deployment of ML models to training them and running workloads on them. Ray supports python when it comes to its developer tools, but the deployment and running of models should be usable pretty much anywhere, even locally on a laptop computer – although the computer used should still have a GPU if the model requires it.

It is possible to host Ray on many cloud platforms, being an open source project, with official Ray cluster integrations available for AWS and GCP, and community maintained integrations for Azure, Aliyun and vSphere. Ray also offers configuration files for running it inside a kubernetes cluster via kuberay, enabling it to be hosted with any cloud provider that supports kubernetes.

The cost of a self-managed Ray cluster mainly comes down to the pricing of the chosen cloud platform and the work required to set up and maintain the cluster and related infrastructure, but this option affords the most flexibility and customisability.

Anyscale – managed Ray

Anyscale offers a managed Ray solution on top of the biggest cloud providers, and is operated by the core team behind the development of Ray itself. Even though Ray supports Kubernetes and using Docker images, Anyscale uses plain vms but still provides logs and grafana for monitoring from their own UI. The UI also allows for the launch of workloads and configuration of additional environments through workspaces. Integrations are available for vscode and jupyter notebook to allow developers to launch workloads right from their development tools.

Unfortunately, Anyscale has not published any pricing information for their managed Ray offering, you’d need to contact sales to get a quote. Additional costs with the chosen cloud platform provider should also be considered – AWS and GCP are supported while Azure at the moment is not.

Amazon SageMaker

SageMaker allows for building, training, and deployment of machine learning models using Amazon’s existing infrastructure and some new ML tools. Among many others, it features an IDE, SageMaker Studio for development and deployment and a model management tool called SageMaker MLOps. SageMaker Serverless Inference is a serverless option for serving models that doesn’t require choosing an instance type.

Amazon claims that “SageMaker offers at least 54% lower total cost of ownership (TCO) over a three-year period compared to other cloud-based self-managed solutions”. To help with figuring out costs, there is table detailing the prices of available instance types included on the official page here. SageMaker Serverless Inference prices are based on the duration of the inference and the amount of data that has been processed.

HuggingFace Inference Endpoints

HuggingFace is the biggest ml model and dataset repository out there, and they also offer their own production ready hosting solution in the form of serverless endpoints. Although it does not seem to affect pricing, you can choose which cloud provider to use for your endpoint, AWS, GCP or Azure, each with multiple possible regions.

Further configuration allows for defining minimum and maximum replicas to control autoscaling, with the possible minimum value of 0 meaning that the endpoint will be able to scale down completely when not in use. This has the potential to save quite an amount of money, at the cost of increased waiting time when calling the scaled down endpoint, as it has to spin up an instance before beginning to process the request. It is also possible to configure ssl for your endpoint, or make it entirely private if you choose to.

Before you deploy your model, HuggingFace gives you an estimated monthly cost based on chosen hardware, assuming that the endpoint will be up for the whole month, but excluding any scaling. This can be quite handy to have an idea about just the baseline cost of having a model deployed. Once finished with the configuration, you get a url where yoy can access the model, as well as an inference widget that allows you to test the endpoint.

Pricing is tied to instance types, and while there are different paid services offered by HuggingFace, Inference Endpoints is the one to look out for when considering model hosting.

Available GPU instances on aws:

NVIDIA T414GB$0.60 / h
NVIDIA A10G24GB$1.30 / h
NVIDIA T4 x456GB$4.50 / h
NVIDIA A10080GB$6.50 / h
NVIDIA A10G x496GB$7.00 / h
NVIDIA A100 x2160GB$13.00 / h
NVIDIA A100 x4320GB$26.00 / h
NVIDIA A100 x8640GB$45.00 / h

CPU instances are available both on aws and azure, with the same hourly rates:

1 Intel Xeon core2GB$0.06 / h
2 Intel Xeon cores4GB$0.12 / h
4 Intel Xeon cores8GB$0.24 / h
8 Intel Xeon cores16GB$0.48 / h


Can be used locallynoyesnononono
Vendor lock-inModalnoAWS or GCPAWSPaperspace / DigitalOceanAWS, GCP or Azure
Containersfrom Pythonyesyesyes / from UIyes / from UIyes / from UI
Zero-opsfrom Pythonnomostly UImostly UImostly UImostly UI
Pricingtransparentcloud provider dependentnot disclosedtransparent but complicatedtransparenttransparent
Easy to startif you know Dockeryesnonoyesyes
Free tieravailable with 30$ credithosting dependent, can also be run locallypossible, contact salespossible, contact salesavailable, free credit offered in confirmation emailonly hub is free

All in all, Modal could be a pretty good place to start, with a little image building to get comfortable with, but all of that is done in python, and it has clear pricing that is easy to calculate.

Ray is open source and is the most flexible choice, but will likely require dedicated engineers to set up and maintain.

Anyscale could be a great and simple managed solution for Ray, but with no public pricing, it really depends on what kind of deal you get.

As for SageMaker, while it has pricing information, it is as complicated as it gets with AWS to actually figure out how much you’ll end up paying for it. It has a whole ecosystem of tools for everything you might need in one place accessible from a web UI, with the possibility to connect any other AWS service on top – knowing AWS you’ll probably have to use a bunch of their other services eventually.

Paperspace’s 2023 acquisition sounds like a good opportunity for DigitalOcean to expand into the AI platform market, with an easy to start but still fairly customisable offering. The current prices on their instances seem better than the competitors, with the subscription fees generally being higher.

Then, there is also HuggingFace, the de-facto model repository, offering the most commonly used GPUs for model hosting at competitive prices and simple configuration that can be done from their UI as well.

Share this post



Learn more at risingstack.com

Node.js Experts