Machine Learning Rig: The Build
Building a Machine Learning Rig from Misc Gaming Hardware
I’ve been following OpenAI and AI more generally for a few years now, and have done a few projects using LLMs and Diffusion models (GPT-3 dicord bot, stable diffusion videos) but more recently I’ve developed an interest in the nitty gritty details of how GPT-like models work, and what sort of hardware it takes to run.
I love building things from scratch to really understand how they work, and I like trying to get software running on my own hardware (for example I bought and setup 3 RPis to learn about kubernetes). Unfortunately buying eight A100 NVIDIA GPUs was out of my price range so I won’t be able to run/fine-tune models like BLOOM, but I did have three AMD RX6800 GPUs lying around that I sniped off trademe real cheap after the proof of stake Ethereum merge.
I figure this is enough hardware to get my feet wet, maybe train some smaller models or run some bigger ones highly quantised. They’re also AMD GPUs meaning they don’t support CUDA natively, which could cause issues, however ROCm seems to be coming along nicely and I’m always up for some additional challenge.
The north star for this project is to try and get Copilot-like functionality in an IDE running all on the local network, I’d also like to try out and see if I can get suggestions on bespoke/private libraries by fine-tuning the model on new source code or using a vector database like pinecone or chroma.
This blog post outlines the hardware and early software setup.
Hardware list
- Motherboard
- Gigabyte: Z170X-Gaming 3
- RAM
- 2x 32gb 2666MHz DIMM DDR4
- GPUs
- 2x AMD Radeon RX 6800 (16gb VRAM)
- I also have a bunch of other GPUs that are less good
- CPU
- Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
- Storage
- 450gb for the OS
- 2x 1tb for storing training data / models
Build process
I roughly followed the design from this video
Here are some pics from the build:
Starting materials
Partial build
Computer hardware fitcheck
On-off button
Funny story - After putting it all together, I realised that I didn’t have a power button. I was turning it on and off by manually shorting the pins on the motherboard. I realised the GPUs had an overclocking button on them, so I figured I could just use that. It was only attached to the GPU with some header pins, so I pulled it off and plugged it into the motherboard, and it worked without modification!
Finished product, integrated with the battlestation
OS and setup
- Ubuntu 20.04
- I went with Ubuntu because that seemed to have the best support for ROCm (based on a bit of googling)
- Software installation: ROCm, PyTorch, Jupyter
- I followed the installation guide for ROCm on Ubuntu 20.04
- I set up the two SSDs as a RAID 0 array using mdadm, and mounted them to
/mnt/raid0
.- I also setup up MinIO on the same RAID array, so I can use it as a S3 bucket for storing training data and models, but for now I’m just using the drives as a local file system because it’s all on the same machine.
- Made a python environment with PyTorch and Jupyter, with JupyterLab starting up on boot (see instructions here)
rocm-smi output
jerome@mlrig:~$ rocm-smi
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 27.0c 5.0W 0Mhz 96Mhz 0% auto 203.0W 0% 0%
1 28.0c 6.0W 0Mhz 96Mhz 0% auto 203.0W 0% 0%
================================================================================
============================= End of ROCm SMI Log ==============================
I’m going to tinker with the GPU settings to optimise for machine learning performance, but for now I’m just using the default settings.
JupyterLab
Next steps
- Start training some models!
- Get good metrics/monitoring for the rig
Related links
Here are some links to things I referenced when building this rig:
- ROCm installation guide
- GPU rig build youtube video
- Tim Dettmers Blog
- This blog is amazing, so much good info on hardware (and in general) for machine learning
If you have any questions or comments, feel free to reach out to me at jeromeswannack@gmail.com, or add an issue to the repo for this project.