Search
StarWind is a hyperconverged (HCI) vendor with focus on Enterprise ROBO, SMB & Edge

Private Infrastructure for LLMs: A Guide to Running AI Locally

  • April 29, 2025
  • 13 min read
Marius Sandbu is a cloud evangelist for Sopra Steria in Norway, who mainly focuses on end-user computing and cloud-native technology. He is the author of books such as Mastering Citrix NetScaler and Getting started on Citrix NetScaler. Marius is a Microsoft MVP for Azure, Veeam Vanguard, vExpert EUC Champion, NVIDIA GRID Community Advisor and Citrix CTP.
Marius Sandbu is a cloud evangelist for Sopra Steria in Norway, who mainly focuses on end-user computing and cloud-native technology. He is the author of books such as Mastering Citrix NetScaler and Getting started on Citrix NetScaler. Marius is a Microsoft MVP for Azure, Veeam Vanguard, vExpert EUC Champion, NVIDIA GRID Community Advisor and Citrix CTP.

Public cloud LLMs are powerful, but they don’t always meet security and compliance needs. Learn from our detailed guide how to deploy Large Language Models (LLMs) on your own infrastructure and create a private AI environment – from GPU requirements to scaling for multiple users.

Many are tapping into the power of LLMs and Generative AI via the public cloud, utilizing platforms like Google Vertex or Azure OpenAI. Yet, a number of organizations find themselves needing to host these technologies in their own data centers to meet specific governance or compliance mandates.

In this article, we’ll delve into the nuances of operating LLMs (Large Language Models) within our own infrastructure. We’ll explore the benefits of on-premises deployment, outline the necessary hardware and software requirements, and share best practices for effectively managing these models at scale.

The Importance of private infrastructure for specific use cases

Organizations are increasingly seeking solutions that combine the power of LLMs with the security and control of processing data within their own infrastructure.

To run LLMs efficiently within our own infrastructure, we need GPUs and a lot of it.
LLMs are resource-intensive, particularly in terms of computational power, and they demand significant GPU memory to store both model parameters and intermediate data during inference (aka processing prompts)

GPUs are optimized for high-performance and its high speed and bandwidth is essential for running LLMs effectively, allowing them to handle complex computations without being hindered by bottlenecks from memory-to-processor data transfers. Hosting this within your own datacenter will also enhance the performance compared to use of cloud services.

Technical Requirements for Running LLMs

When sizing a LLM deployment a general rule of thumb is that the size of the model (in billion parameters) x2 +20% overhead is the amount of GPU memory required to have the model loaded into memory.

As an example, if we have a model like LLaMa 3.2 11 Billion parameters, it would require at least (11 (Billion parameters) x 2 = 22 + 20% overhead) = 26.4 GB GPU Memory. Therefore it is recommended to use at least a GPU card like the NVIDIA A100 which has 40 GB of GPU Memory. This allows it to have enough resources to both store the model but also handle some inference processing.

However, if the use-case was to build a service that would be used for a large number of users, the hardware requirement would be even higher.

Unfortunately, there isn’t an exact formula to calculate GPU requirements for a production workload, as it largely depends on the specific usage patterns of the service. A practical approach is to start by testing the model at a smaller scale, measuring the average GPU memory usage, and using that data to estimate the needs for an average user.

In addition there are also other scenarios, for instance if we plan to train the model using fine-tuning. Fine-tuning is a process that requires even more GPU memory and should even be deployed on a dedicated hardware so it does not impact the LLM service for regular users.

We also have smaller models that can also run directly on a user’s device. This year has seen the release of many tools and services that enable us to run large language models (LLMs) locally. This allows us to run models but also provide a set of standardized APIs and that allows us to build software on top such as RAG applications.

How to run LLMs locally

Tools such as Ollama, and LMStudio provide a simple way to download, and host LLMs models directly on a user’s device. For instance Ollama runs as a service on Linux/Windows/Mac and provides a simple CLI that allows us to download optimized models from their website.

With these commands it will download Llama 3.2 using Ollama and create an API endpoint for the model, then lastly set up an interactive session with the model

ollama pull llama3.2

ollama run llama3.2

ollama pull llama3.2

 

In addition to this, Ollama provides an API which can then be utilized from other tools such as AnythingLLM or Continue. Continue which is an extension to Visual Studio Code and can be used as a replacement to Github Copilot. Below is a screenshot showing how AnythingLLM can use Ollama for LLM processing.

Anything LLM | LLM Preference

 

AnythingLLM also provides embedding of content, allowing us to add our own data such as PDF, Word files, pictures to allow you to chat with your data. Providing a simple way for users to have their own ChatGPT alternative on their own device.

Tools like Ollama works well for single-user sessions, however if you have use-cases where you want to provide a centralized service for multiple users you need another service that manages to handle the scale.

Providing Private LLMs at scale

When it comes to providing a centralized service that supports multiple concurrent users, such as an internal chatbot, developer assistants or a company-wide RAG (Retrieval-Augmented Generation) service, the architecture must be designed for handling it at scale.

Therefore some key considerations are important

  1. Load Balancing and Model Parallelism
    In multi-user scenarios, a single instance of a model may not be sufficient to serve all incoming requests. To handle this, load balancing and model sharding can be employed. Model parallelism splits the model across multiple GPUs, while data parallelism replicates the model across GPUs and handles different requests. Frameworks like vLLM are useful tools to manage inference at scale.
  2. Serving Frameworks
    Tools like vLLM (optimized for serving LLMs with high throughput and low latency) or Triton Inference Server by NVIDIA can serve multiple models simultaneously, manage batching efficiently, and even support dynamic model loading. These tools integrate with Kubernetes for orchestration and can expose APIs to frontend applications. One can also turnkey platforms like VMware Private AI, Nutanix Enterprise AI or Azure Arc Edge RAG. Most of these platforms also run natively on Kubernetes.
  3. Model Quantization and Optimization
    Running LLMs at scale can be optimized further through quantization techniques, such as INT4 or INT8 precision. This can reduce memory usage significantly, with minimal accuracy loss.
  4. Multi-Tenant Architecture
    If the LLM service is shared across departments or teams, it’s important to isolate workloads and manage quotas. Techniques such as request throttling, per-user model sessions, or fine-tuned models for different user groups can be introduced. Using an API gateway with authentication and logging helps secure and monitor usage, with tools like OpenLLM.
  5. Storage and Vector Databases for RAG
    RAG applications require document stores or vector databases to retrieve relevant content. Tools like Chroma, Weaviate, AstraDB, or Qdrant integrate well with private LLM infrastructure.

So one can go in the direction of using turnkey ready platforms from vendors like Nutanix, Microsoft, VMware and others. Or you can go in the direction of open-source and using frameworks like vLLM or TensorRT-LLM from NVIDIA.

Most tools and frameworks in the ecosystem today support services that have a supported OpenAI-compatible APIs, making it easy to switch between different services and LLMs.

Summary

This last year, we have seen more and more language models being released as open-source. Many of them also rival the large models from OpenAI in terms of quality and “intelligence”. With models being open source it allows us to host these advanced models on our own infrastructure and we can use them as a base for

  • Virtual Assistants
  • Chatbot
  • RAG services
  • Developer assistants

But ensuring that all data and processing is safely contained within our own infrastructure. While the ecosystem is still young, a lot will happen this year that will make it even easier to provide these models and capabilities in our own private cloud.

Hey! Found Marius’s article helpful? Looking to deploy a new, easy-to-manage, and cost-effective hyperconverged infrastructure?
Alex Bykovskyi
Alex Bykovskyi StarWind Virtual HCI Appliance Product Manager
Well, we can help you with this one! Building a new hyperconverged environment is a breeze with StarWind Virtual HCI Appliance (VHCA). It’s a complete hyperconverged infrastructure solution that combines hypervisor (vSphere, Hyper-V, Proxmox, or our custom version of KVM), software-defined storage (StarWind VSAN), and streamlined management tools. Interested in diving deeper into VHCA’s capabilities and features? Book your StarWind Virtual HCI Appliance demo today!