Self-hosting AI LLMs

With all of the recent focus on deepseek, I decided to check out the fuss. With a degree of a caution, I wanted to self-host deepseek, but also take the opportunity to look at other LLMs.

If you are reading this, you are likely to know what an LLM is. Simply put, it stands for large language model. LLMs use neural network techniques to process natural language. Not all models are equal, and the most famous one (until recently) is OpenAI‘s ChatGPT.

Large organisations have pushing the boundaries, at massive expense both in development an operation. To host and run a large LLM, you need massive infrastructure processing. This is where distilled models come in to the mix. Distilled LLMs use a mechanism whereby one LLM is used to “train” another LLM, fine-tuning it (distilling) it down to a smaller variant. For self-hosters, this brings the opportunity to run LLMs due to the reduction in both the size of the models and the amount of hardware required.

So, I’m going to walk through my experience of setting up and running some distilled LLMs, specifically deepseek-r1 and phi. To do this, I’m going to set up a docker containers running ollama and openweb-ui.

Here is my docker compose file:

services:  webui:    image: ghcr.io/open-webui/open-webui:main    container_name: webui    ports:      - 7000:8080/tcp    volumes:      - open-webui:/app/backend/data    extra_hosts:      - "host.docker.internal:host-gateway"    depends_on:      - ollama    restart: unless-stopped  ollama:    image: ollama/ollama    container_name: ollama    expose:      - 11434/tcp    ports:      - 11434:11434/tcp    healthcheck:      test: ollama --version || exit 1    volumes:      - ollama:/root/.ollama    restart: unless-stoppedvolumes:  ollama:  open-webui:

Once the container is brought up, you should see this:

You’ll then need to set up a user account:

And then you’ll get the following screen:

The home page looks something like this:

So we now need to add a model. Go up to the top left and click on the select a model option:

Type in the model you want to add. To go small, type in “deepseek-r1:1.5b”:

The model will then download in the background. Once it is ready, you can start querying it.

Lets start with a simply prompt:

“how many countries in the world are there?”

On the VM I’m running this in, it responded really quickly (within a couple of seconds). With all AI generated content, the main thing is to check how accurate it is. So how many countries are there? A quick check on wikipedia shows that there are indeed 193 countries recognised by the UN:

And for those interested, here is the stats behind this prompt and response:

Okay, so lets move on to something a bit more complex. Lets ask it a history question. Being Scottish, lets see what it knows about Mary Queen of Scots…

You’ll notice the context aware part – the beauty of generative AI is it remembers the conversation. So it then starts to “think” about my first question in relation to my second one. Interestingly the facts about Mary Queen of Scots are completely wrong.

So I decided to start a new chat and see how it got on. The interesting part is how the model “thinks”.

Lets see what it comes up with:

Again, looks totally legit and believable but completely historically inaccurate 😦

This is the issue with distilled models, and the 1.5 variant of deepseek-r1 is very small. So lets go bigger (think “Tim the tool man Taylor” and “more power” 😄). Lets try the 32B model. Lets add the model by following the steps above and adding “deepseek-r1:32b”.

This will test the hardware I’m using. Well a VM running docker, but you know what I mean. I don’t have GPU in the system I’m using so it isn’t very efficient.

So asking the same question inevitably results in:

But it starts to think, slowly:

This is much more accurate and correct. It is really, really, really slow. To give an idea of the resource usage on the VM:

So absolutely hammering the vCPUs and using all of the 32GB allocated that the model needs to run.

To compare with the public ChatGPT model, and all of its power behind it, it replies in a split second (with historical accuracy):

So the main thing I took from this is that it is more than possible to host LLMs, however you need to set expectations. Depending on what you are doing, it may well be fit for purpose. I’ve seen a lot of posts around using distilled models with Home Assistant for local voice recognition and processing. If you are looking for it to generate text on a topic, then you need to be realistic.

After playing with the deepseek distilled models, I wanted to try something a bit different. Mathematics, maths (not math 😃). To do that, I downloaded the phi LLM from Microsoft. I used the phi-4 model.

I then proceeded to ask it some GSCE maths questions and was pleasantly surprised and astonished by the accuracy of the responses. The model is pretty big (14.7B) for self-hosting standards, so it takes time to get a response on my VM.

Here is the problem I posed (the answer is in red):