Chirp AI PROD Website

By Jeremy Lau, Chirp AI

Note to reader: This is drawn from my experiences and Chirp AI’s point of view. Do not solely rely on this article to shape your approach with choosing an option.

‍

In the world of large language model hosting, there’s primarily two options: self-hosted LLMs or proprietary managed LLMs.

A lot of organisations are starting to take their first and second steps with using large language models, and so a natural question starts to bubble up: should use self-host your own LLM or leverage a proprietary managed LLM?

For the purpose of this blog, let’s quickly align on what a self-hosted LLM and a proprietary managed LLM is.

Self-hosted LLM TL;DR

This encapsulates running and managing the entire model and serving infrastructure yourself. Your costs are usually associated with running the infrastructure stack. This is typically broken down into these core areas:

Choosing a model and choosing a way to serve it. Optionally fine-tuning the model (we’ll dive into fine-tuning in a separate blog as it merits its own topic).
Choosing the underlying GPU hardware specs that your server will run on to host the LLM.
Managing the end-to-end lifecycle of the server and installing any supporting software and server utilities needed for operating the LLM and managing client-to-server or server-to-server operations.
Configuring the networking for your server so that your model can be served and used.

Proprietary managed LLM TL;DR

This is essentially offered as a managed service where the LLM is hosted and managed by a third-party provider. Azure OpenAI is a perfect example.

This approach allows you to leverage the power of LLMs without having to handle anything to do with infrastructure or model training.

Your costs are usually associated with how much you use the LLM, typically charged by the number of input and output tokens, or based on a provisioned throughput capacity you purchase upfront — essentially dedicated compute bandwidth just for your usage.

‍

So what are the drivers that will influence your decision?

At Chirp, we’ve been on the journey ourselves on whether we \’re better off leveraging an enterprise-grade proprietary managed LLM or self-hosting an LLM.

Many organisations that are developing GenAI-powered applications, such as agents (check out my fellow co-founder’s blog that introduces vertical agents) like ourselves will typically consider these drivers when making a decision between the two options:

Economic viability
Reduced latency (how fast you get back a response from the LLM)
Consistent latency
Rate limit flexibility and control
Quality of response
Model customisation
Data sensitivity and sovereignty
Infrastructure and model security

Important to note: not all GenAI-based tasks are created equal, and therefore some of the drivers will have different importance weightings for what you’re trying to do or solve for.

This also then means some of the complexities and considerations for either LLM hosting option can be interchangeably seen as benefits or drawbacks.

‍

Key evaluation points

At Chirp, we positioned our evaluation to help effectively streamline what areas we needed to evaluate and consider.

We knew that leveraging an enterprise-grade proprietary managed LLM service would likely satisfy our key drivers, so therefore the question to be understood was whether hosting our own LLM was going to be more economically beneficial, and whether the TCO (total cost of ownership — e.g. managing server maintenance, scalability, and availability) was going to be outweighed by the margin of benefits we’d receive from hosting our own LLM given our use case.

There are too many factors that weigh into the evaluation, and depending on your drivers, the importance or relevance can differ. The below are what I believe are typical evaluation areas regardless of your problem.

‍

Choosing the appropriate LLM to support your use case

Why does this matter

Choosing the right LLM directly determines the performance/quality, scalability, and cost-effectiveness of your application.

How to assess

This is subjective to your use case but consider:

How good is the model at text generation, instruction following, and tool calling?
Evaluate with benchmarks such as MMLU, HumanEval, or Promptfoo.
Parameter size is critical. Larger parameter sizes often mean larger infrastructure costs and higher latency but typically higher quality outputs. It's super important to right-size the model you need.
How big is the model’s context window? Does it fit your use case.
What architecture type do you need — such as a decoder-only model (text only) or a multi-modal model (accepts images, voice, and text)?
Does it support streaming if your use case requires.
Does it support vLLM or GGUF, which allow you to serve models via a common client-side interface — meaning you can serve and interact with different models the same way at the code level. This is one the ways you can also enable streaming of model outputs.

Impacts

If not chosen properly, potential impacts include:

Poor task performance (low accuracy, struggles with long context or multi-turn conversations).
Latency bottlenecks due to a wrongly sized LLM for the use case.
High infrastructure costs due to oversizing.
Security and privacy risks (some open-source models are more susceptible).

‍

Sizing the correct GPU specs to your chosen LLM

Why does this matter

For the purpose of this article we’re choosing GPU specs for supporting inferencing, not for performing LLM training which would have a different GPU specification profile.

Choosing the correct GPU specs to run inferencing for your chosen LLM will help you optimise key aspects associated to the LLM such as speed of inference, cost and available throughput.

Impacts

Choosing the wrong GPU size impacts:

Model compatibility (e.g., 13B parameter models typically require 28–40GB vRAM to sufficiently operate the model).
Inference speed which is one of the contributors to the latency of getting back a response.
Risk of "out-of-memory" errors or excessive costs.
Scalability and how many sessions you can run in parallel. Depending on your use case, one GPU unit may not be enough.

How to assess

Understand:

The model’s parameter size and precision type (e.g., FP16). This will help infer what GPU vRAM you need at a minimum.
GPU type/model compatibility (refer to Huggingface model cards as they usually list out minimal GPU requirements).
Your workload type — chat apps, batch processing, real-time usage — all influence the GPU specification requirements.

‍

Choosing where the hardware is located

Why does this matter

This will likely matter for 2 reasons. To help with reducing latency and network reliability for your users (customers) depending on the majority of where they’re located. You’ll want you hardware closer to them to reduce the time data transfers across the internet and reduce any cross region networking that can cause unpredictable increases in latency. Secondly, It will help you satisfy any data privacy, compliance or sovereignty regulations if they apply to your organisation or use case.

Impacts

Poor latency speed. This means slow agent responses which impacts user experience.
Inconsistent latency. This means inconsistently good or poor user experience.
Regulatory violations → legal risks, fines, erode customer trust and brand reputation.

How to assess

You should consider where the majority of your users will be interacting with the LLM or application and with that first question understood you can then start whittling down the different cloud providers that:

Operate in the region best suited to your customers locations.
Provide the necessary GPUs that are compatible with your chosen LLM.
Weigh price vs location needs (low latency often trumps cost savings for real-time apps).

‍

Calculate your projected production usage

Why does this matter

If you’re able to project some form of realistic projection of production usage of the LLM this will help you forecast the expected production cost for paying ‘on demand’ or ‘reserved capacity’ via a proprietary managed LLM offering.

Typically the information you need at hand to help project estimated usage is the following:

Average output tokens per API call
Average Input tokens per API call. Make sure to include the context window tokens too.
Average number of API calls to your LLM you’ll be making per minute or per hour.

Projecting a reliable view of production usage will also greatly lend a hand in right sizing your GPU specs which subsequently means a more accurate cost comparison between self hosted and proprietary managed options.

Finally you can then easily math out what that cost looks like head to head between the 2 options.

Impacts

If you’ve not forecasted accurately this could lead to a wrong decision if cost is a significant driver in your decision. It could also lead to incorrect sizing of your GPU specs which can consequently lead to a wrong decision driven by cost, or unnecessary incurred costs.

‍

BONUS: Investigating serverless options

ALWAYS GO SERVERLESS IF YOU CAN!

Focus on your product, not infrastructure. Always offload this to someone else if you can.

What is serverless

You don’t manage servers or infrastructure — you just deploy your code and let the server run your code, and the cloud provider (e.g., AWS) handles everything under the hood.

Why does this matter

This matters and can make a significant cost contribution to the ‘total cost of ownership’ for running a self hosted LLM yourself because you also need to provision, scale, manage availability, maintenance of the underlying compute resources (the server). Performing these activities are not often a simple task and adds to the complexity of this entire operation and operating cost. If you can offload this part of the stack to someone else, you should.

Impacts

improper server management can cause:

Poor networking and latency.
Poor scaling under increased or unexpected traffic.
System instability, downtime, or crashes and lack of auto recovery or failovers.
Security and data risks if not properly configuring some of the mechanics and controls to manage security and access.

‍

TLDR conclusion in a nutshell

Choosing between self-hosting an LLM or using a proprietary managed LLM ultimately depends on your specific needs around economics, latency, flexibility, control, and security. Proprietary managed options massively simplify infrastructure and operational overhead but often come with usage-based costs and arguably less flexibility.

Self-hosting gives you full control and potential cost advantages at scale but introduces significant technical complexity, maintenance, and infrastructure responsibilities.

You will need to first truly understand your use case and potentially customer base geography which will then help with identifying the appropriate LLM for the job, right sizing your GPU needs and then finding the most suitable economical cloud / hardware provider.

I’m confident that if you understand your use case very well and can nail down some of these core evaluation points, you’ve covered 80% of the decision paths to making an optimal decision.

‍

Choosing between self-hosting an LLM and proprietary managed LLM?

So what are the drivers that will influence your decision?

Key evaluation points

Choosing the appropriate LLM to support your use case

Sizing the correct GPU specs to your chosen LLM

Choosing where the hardware is located

Calculate your projected production usage

BONUS: Investigating serverless options

TLDR conclusion in a nutshell

Socials