Thursday, May 15, 2025

alphaxiv

 arxiv.org to alphaxiv.org and chat away with AI.

arxiv.org/pdf/2505.09343

to

https://www.alphaxiv.org/pdf/2505.09343


Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

Trying to understand the picture:

What is cross-entropy loss?

"Cross-entropy loss is a loss function used in machine learning and optimization, particularly in classification problems. It quantifies the difference between two probability distributions: the predicted distribution from your model and the true distribution of the labels."

What is the difference between a shared expert vs routed expert?

In the context of Mixture of Experts (MoE) models, "shared experts" and "routed experts" refer to different ways of organizing and utilizing the expert sub-networks within the larger model. 

  • Routed experts allow the model to specialize, with different experts handling different types of inputs.
  • Shared experts process all tokens, potentially capturing general features or providing a baseline level of processing.

What is GQA?

GQA stands for Grouped-Query Attention. It's an attention mechanism used in transformer models, particularly designed to improve inference efficiency.
  • The Problem GQA Solves: In multi-head attention, each attention head has its own separate Q, K, and V vectors. During inference, the key and value vectors from previous tokens need to be stored in a cache (KV cache) to efficiently handle multi-turn conversations. This KV cache can consume a significant amount of memory, especially for long sequences.

  • How GQA Works: GQA reduces memory consumption by having multiple attention heads share a single set of Key and Value (KV) pairs. Instead of maintaining separate KV pairs for each attention head, multiple heads share one. This significantly compresses KV storage.


No comments: