A tale of programming: Why DeepSeek-R1 is so powerful?

Introduction

On January 2025, DeepSeek R1 irrupted as a new Large Language Model (LLM), claiming that it had been developed by using substantially less resources: processing power, money, etc.; than comparable models available on the stage. And even more: it is open source, under MIT license.

I intensively use LLMs to get assistance researching and coding. Therefore, I usually download these models to execute them offline on my laptop, as a personal effort to reduce the environmental impact, considering the consumption of resources as power, water, etc.; so I did that again, and due to the processing power capabilities of the laptop, it was downloaded the minimal version of the model: DeepSeek-R1-Distill-Qwen-1.5B from Hugging Face.

After installing DeepSeek R1, I was favourably surprised by the differences about how the model generates answers to the prompts. See Figure 1 below.

Figure 1: How DeepSeek-R1 "thinks" before answers the prompt.

Therefore, I looked for additional references about this model, trying to understand why it has achieved this noticeable performance, considering the resources that were used; so I reviewed in detail how it was built, by reading the "DeepSeek-V3 Technical Report".

Key factors

In my opinion, there are four factors that explain why DeepSeek R1 has achieved this outstanding results:

1. Research: Searching "DeepSeek" on arxiv.org shows about 77 papers published since september 2023.

2. Focus: DeepSeek R1 works using less parameters than other models, incentivizing reasoning capabilities in its model. Read section 2. "Approach" on "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning".

3. Code optimization: The developers coded significant sections of DeepSeek-R1 by using Parallel Thread Execution (PTX) instead of programming by using the CUDA-toolkit. Notice that at the PTX level, the GPUs can be optimized as a Single Instruction Multiple Data (SIMD) computer; and even more: the GPUs can transfer data among them without the intervention of the CPU.

4. Data Compression: By using Vector Quantization (VQ) of the parameters, the volume of data transferred is significantly reduced. Note: I have personally used the VQ algorithm to compress echocardiographic video sequences. You can review my GitHub repository CompressionVQ.

Conclusion

When it is mandatory to solve a problem, and there are not enough resources, your only alternative is to intensively use the best and most powerful computers available: human brains!

40+ Years of a Humble Programmer

viernes, 31 de enero de 2025