viernes, 31 de enero de 2025

A tale of programming: Why DeepSeek-R1 is so powerful?

A tale of programming: Why DeepSeek-R1 is so powerful? 

Introduction

On January 2025, DeepSeek R1 irrupted as a new Large Language Model (LLM), claiming that it had been developed by using substantially less resources: processing power, money, etc.; than comparable models available on the stage. And even more: it is open source, under MIT license.

I intensively use LLMs to get assistance researching and coding. Therefore, I usually download these models to execute them offline on my laptop, as a personal effort to reduce the environmental impact, considering the consumption of resources as power, water, etc.; so I did that again, and due to the processing power capabilities of the laptop, it was downloaded the minimal version of the model: DeepSeek-R1-Distill-Qwen-1.5B from Hugging Face.

After installing DeepSeek R1, I was favourably surprised by the differences about how the model generates answers to the prompts. See Figure 1 below.

Figure 1: How DeepSeek-R1 "thinks" before answers the prompt.

Therefore, I looked for additional references about this model, trying to understand why it has achieved this noticeable performance, considering the resources that were used; so I reviewed in detail how it was built, by reading the "DeepSeek-V3 Technical Report".

Key factors

In my opinion, there are four factors that explain why DeepSeek R1 has achieved this outstanding results: 

1. Research: Searching "DeepSeek" on arxiv.org shows about 77 papers published since september 2023. 

2. FocusDeepSeek R1 works using less parameters than other models, incentivizing reasoning capabilities in its model. Read section 2. "Approach" on "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning". 

3. Code optimization: The developers coded significant sections of DeepSeek-R1 by using Parallel Thread Execution (PTX) instead of programming by using the CUDA-toolkit. Notice that at the PTX level, the GPUs can be optimized as a Single Instruction Multiple Data (SIMD) computer; and even more: the GPUs can transfer data among them without the intervention of the CPU.

4. Data Compression: By using Vector Quantization (VQ) of the parameters, the volume of data transferred is significantly reduced. Note: I have personally used the VQ algorithm to compress echocardiographic video sequences.  You can review my GitHub repository  CompressionVQ

Conclusion

When it is mandatory to solve a problem, and there are not enough resources, your only alternative is to intensively use the best and most powerful computers available: human brains!