Local Deepseek - "how hard could it be?"

first workshopped ~February 2025, published September 2025

DeepSeek snagged a week of global tech headlines with the (Friday Jan 24th?) release of their "R1" language model and chatbot app.

One of the centrally interesting things about its release was that the model weights were released fully open-source under an MIT License, so in theory anyone could run their own instance of it.

I've had a technical interest in neural network language models since implementing the Transformer architecture in Tensorflow 2 for an assignment in the exchange semester I was extremely lucky to arrange at ETH Zurich back in the first half of 2019. So I'm at least familiar in one instance with the work involved in getting neural network stuff running on a GPU.

This all led me to the question... if you really can "run it yourself", "...how hard could it be?"

The main point of release for the DeepSeek model weights was this page on AI model index site HuggingFace: https://huggingface.co/deepseek-ai/DeepSeek-V3/tree/main. (The code is also on github but this excludes the weights.)

We can see that it's a git repo, but that also there's a whole stack of .safetensors files tagged as Git Large File Storage (LFS) objects, ranging from ~1GB to ~7GB in size - 163 in total.

It turns out this sums up to 688GB of data. So, you'd not only need the internet bandwitdh and time to download that much data, but you'd also need a spare 1TB disk or big chunk of free disk space to even store your own copy of the weights.

But hey, "storage is cheap" right? And we are well past the 256k ADSL days of my childhood...

Assuming you can get and store a copy of the weights, the next question is, what do you even use or run them with? Well, thankfully the DeepSeek repo has an entire README section for running it locally, that mentions an SGLang framework, amongst other things, which is pitched as "a fast serving framework for large language models", and as of version 0.4.1, apparently "fully supports running DeepSeek-V3 on both NVIDIA and AMD GPUs". Sounds pretty good!

SGLang has a whole page in its repo dedicated to support for DeepSeek V3, including some recommended minumum hardware requirements. Now, what minimum hardware do you think you might need for an LLM like this? Unfortunately the answer seems to be... eight NVIDIA H200 datacenter GPU chips.

These aren't your garden-variety consumer computing GPU chips that you'd be likely to find on the shelf at a Scoprtec or CentreCom store - so, how do you get them and how much do they cost? Some quick searching led to this page from "TRG Datacenters":

https://www.trgdatacenters.com/resource/nvidia-h200-price-guide/

The NVIDIA H200 offers nearly double the GPU memory of the H100. (141GB instead of the previous 80-90 GB).

Well, that explains the size of the weights.

It turns out there are two main variants you can get - "SXM" or "NVL". (SXM is, interestingly, a custom socket for connecting GPU chips onto a motherboard - there are some good pictures showing it on the SXM Socket Wikipedia page. NVL appears to just be a more usual PCI Express card offering.)

The NVIDIA H200 SXM GPU comes on custom-built NVIDIA SXM boards in 4 or 8 GPU combinations. 900 GB/s NVLink interconnects the GPUs so all work together to provide high-speed performance. The power profile of the SXM is slightly higher at 700W, but the workload optimization it provides more than makes up for it.

A 4-GPU SXM board costs about $175,000.00, and an 8-GPU board costs anywhere from $308,000 to $315,000, depending on the manufacturer. A single SXM GPU chip is not available for purchase at the time of writing.

Good grief. What about the NVL?

The NVIDIA H200 NVL GPU offers the same memory and bandwidth as the SXM but a slightly better power profile at 600W. You also get a choice of 2-way or 4-way NVLink. A 2-way NVLink bridge connects two GPUs, but a 4-way NVLink bridge inter-connects four GPUs, offering more powerful parallel processing capabilities and allowing you to build cutting-edge multi-GPU systems.

A single NVL GPU card with 141GB of memory costs between $31,000 to $32,000.

Well, that's uh, somewhat more manageable, maybe. But hang on what's this next bit?

NVL H200 server boards must be custom-made from NVIDIA partners, and you can request the number of GPUs and NVLink connections. The price starts at $100,000 and can go up to $350,000, depending on your chosen configurations.

Cripes.

You can see in some photos here that the cards do seem extremely long - so perhaps they may not even fit in any usual/normal desktop PC case anyway: https://www.servethehome.com/nvidia-h200-nvl-4-way-shown-at-ocp-summit-2024/

But, anyway, there you have it - for a mere ~$300,000, 700W of power supplies, and 688GB of downloading and file storage, DeepSeek's V3 LLM could be your to ask whatever questions you feel like without having to send them to a third party web service.

Further questions unanswered as of this point:

What happens when you try and run a model with weights larger than your available VRAM? Can things be paged in and out of the GPU? Does this work in any remotely tolerable way performance-wise? Or is it totally infeasible based on the matrix multiplications required?
What's the story if you try with AMD hardware instead? Is it any less worse?
Can inference for models like this be run on CPUs instead? A server with ~800GB of RAM might somehow be more feasible to acquire than these ridiculously hype-scarce and highly priced NVIDIA datacenter chips.
Check: Justine Tunney's llama.cpp mmap() patch + other LLaMA performance work.

Publishing note: This post was formulated and researched in a chat with friends the weekend of the DeepSeek release, but was formally written up later in July 2025. And then finally published in September 2025.

...it took me a while to come up with some kind of structure that I was happy with for this blog OK!! And my planned automated build system for it all isn't even done yet (that will get a write-up of its own in a later post of course...). I seem to have chosen that this website is going to take a decidedly non-linear form so... we'll see how that continues to play out I guess.