Model distillation appeared in the news recently when Chinese AI company DeepSeek caused a stir in the tech world, suspected of using this technique. What is distillation, and why is it important?
Whether it’s a chatbot answering customer queries or a large-scale AI generating human-like text, the quality, size, and diversity of data play a critical role. But data isn’t always readily available—privacy concerns, high costs, and access limitations often stand in the way.
This is where two techniques come in. Data distillation involves compressing a large model by transferring its knowledge to a smaller, denser, more efficient and lightweight one. That’s what DeepSeek did—it transferred data to Qwen 2.5 32B to create DeepSeek R1 Distill Qwen 32B. Data synthesis means creating realistic artificial data for AI training.
FLock’s research team is excited to analyse and break down the latest AI trends for you. Stay tuned for more educational blogs about why they matter for the future of DeAI.
Data: quality, size and variety
AI models rely on vast amounts of data to learn patterns and make predictions. But getting the right dataset is tricky.
Poor-quality data leads to poor model performance—garbage in, garbage out. More data doesn’t always mean better results, with a bigger model not necessarily being better. AI models perform better when trained on a variety of high-quality datasets, and diversity is key.
The field of AI research has long debated the best way to scale models. OpenAI’s paper ‘Scaling Laws’ (2020) suggested that model size was the key factor in AI performance,
Meanwhile, DeepMind’s paper ‘Training Compute-Optimal Large Language Models’ (2022), which introduced the Chinchilla Scaling Law, argued that dataset size is more important.
The takeaway? Striking the right balance between data size and model complexity is crucial.
Model distillation: making AI smaller and smarter
The problem with LLMs is that they’re simply enormous, and require thousands of GPUs. Training them is costly and resource-intensive. Model distillation takes a big model and squishes it into a smaller version, transferring knowledge like a teacher to a student.
How does model distillation work?
- Knowledge transfer: A smaller AI model learns patterns from a larger one.
- Response-based distillation: The student model mimics the teacher model’s probability distributions.
- Feature-based & relation-based learning: The student learns internal patterns and relationships within the data.
Techniques used in model distillation:
- Temperature scaling (adjusting how confident the model is in its predictions)
- Dataset optimisation (focusing on the most relevant data points)
- Generative distillation (using AI to generate synthetic training data)
For example, DeepSeek successfully applied model distillation by using a teacher-student training process, achieving near-state-of-the-art performance at a fraction of the computational cost.
Applying synthesis & distillation to build better AI
Say you need to create a small AI model (~3 billion parameters) that can roleplay as any persona. Here’s how you might approach it:
- Choose a base model (an existing model).
- Generate training data: Use multi-agent data synthesis (like CamelAI) to create diverse roleplaying scenarios.
- Optimise the model: Use model distillation techniques to make the AI lightweight yet effective.
The key challenge is deciding whether to train on a fixed set of personas, or develop a generalisable system for infinite personas. This decision will shape how the data synthesis process is designed.
Data synthesis: creating data from scratch
As a concept, data synthesis is nothing new, with key milestones stretching back to the 1980s. But regarding LLMs, it got popular after models like ChatGPT and Gemini emerged. Here, we present an update on the newest methods to do it.
Data synthesis is the process of using AI algorithms to generate artificial data that mimics real-world data, essentially creating a “synthetic dataset” which can be used for training and testing machine learning models. It’s a way of generating high-quality data without relying on real-world datasets.
Advantages of synthetic data
Data synthesis comes with several advantages related to privacy, ethics, scalability, and standardisation.
It avoids issues associated with personal data, which is especially beneficial for healthcare use cases where synthetic patient data can be generated for research. Also, you can generate as much data as needed, while ensuring consistency across datasets.
How is synthetic data generated?
There are several approaches:
- Rule-based generation (e.g., Faker library for fake names, emails, etc.)
- Data augmentation (e.g., NLP tools like NLPAug and TextAttack to tweak text data)
- LLM-based synthesis (e.g., GPT-4o generating chatbot dialogues)
- Simulated data (e.g., GANs creating realistic images, and SynthCity for privacy-preserving tabular data)
- Multi-agent AI generation (e.g., CamelAI to create diverse roleplaying scenarios)
For example, CamelAI uses a structured, multi-agent system to create high-quality synthetic training data by simulating conversations, debates and reasoning tasks. It employs a research agent to generate topics, a writer agent to produce responses, and a reviewer agent to ensure quality.
This method has proven effective when real-world data is scarce, ensuring AI models are trained on diverse, high-quality inputs.
Closing thoughts
Data synthesis and model distillation are becoming essential tools. They help overcome real-world limitations, making AI more scalable, cost-effective, and adaptable.
At Flock, we’re closely following these advancements, and applying them to DeAI training to make it as equitable, efficient and innovative as possible. Stay tuned for upcoming research blogs!
About FLock
FLock.io is a community-driven platform facilitating the creation of private, on-chain AI models. By combining federated learning with blockchain technology, FLock offers a secure and collaborative environment for model training, ensuring data privacy and transparency. FLock’s ecosystem supports a diverse range of participants, including data providers, task creators, and AI developers, incentivising engagement through its native FLOCK token.