Adhyayan: alphaxiv

Thursday, May 15, 2025

alphaxiv

arxiv.org to alphaxiv.org and chat away with AI.

https://www.alphaxiv.org/pdf/2505.09343

Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

Trying to understand the picture:

What is cross-entropy loss?

"Cross-entropy loss is a loss function used in machine learning and optimization, particularly in classification problems. It quantifies the difference between two probability distributions: the predicted distribution from your model and the true distribution of the labels."

What is the difference between a shared expert vs routed expert?

In the context of Mixture of Experts (MoE) models, "shared experts" and "routed experts" refer to different ways of organizing and utilizing the expert sub-networks within the larger model.

Routed experts allow the model to specialize, with different experts handling different types of inputs.
Shared experts process all tokens, potentially capturing general features or providing a baseline level of processing.

What is GQA?

GQA stands for Grouped-Query Attention. It's an attention mechanism used in transformer models, particularly designed to improve inference efficiency.

The Problem GQA Solves: In multi-head attention, each attention head has its own separate Q, K, and V vectors. During inference, the key and value vectors from previous tokens need to be stored in a cache (KV cache) to efficiently handle multi-turn conversations. This KV cache can consume a significant amount of memory, especially for long sequences.
How GQA Works: GQA reduces memory consumption by having multiple attention heads share a single set of Key and Value (KV) pairs. Instead of maintaining separate KV pairs for each attention head, multiple heads share one. This significantly compresses KV storage.

Adhyayan

Thursday, May 15, 2025

alphaxiv

No comments:

About Me

Popular Posts