arxiv.org to alphaxiv.org and chat away with AI.
to
https://www.alphaxiv.org/pdf/2505.09343
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
Trying to understand the picture:
What is cross-entropy loss?
"Cross-entropy loss is a loss function used in machine learning and optimization, particularly in classification problems. It quantifies the difference between two probability distributions: the predicted distribution from your model and the true distribution of the labels."
What is the difference between a shared expert vs routed expert?
In the context of Mixture of Experts (MoE) models, "shared experts" and "routed experts" refer to different ways of organizing and utilizing the expert sub-networks within the larger model.
- Routed experts allow the model to specialize, with different experts handling different types of inputs.
- Shared experts process all tokens, potentially capturing general features or providing a baseline level of processing.
The Problem GQA Solves: In multi-head attention, each attention head has its own separate Q, K, and V vectors. During inference, the key and value vectors from previous tokens need to be stored in a cache (KV cache) to efficiently handle multi-turn conversations. This KV cache can consume a significant amount of memory, especially for long sequences.
How GQA Works: GQA reduces memory consumption by having multiple attention heads share a single set of Key and Value (KV) pairs. Instead of maintaining separate KV pairs for each attention head, multiple heads share one. This significantly compresses KV storage.
No comments:
Post a Comment