Saturday, October 30, 2021

Data center

 Things driving Data center

lower latency, faster access, less delay, lower power

“Typically, CPUs are optimized for capacity

while accelerators and GPUs are optimized for bandwidth

However, with the exponentially growing model sizes, we see constant demand for both capacity and bandwidth without tradeoffs. We are seeing more memory tiering, which includes support for software-visible HBM plus DDR, and software transparent caching that uses HBM as a DDR-backed cache. Beyond CPUs and GPUs, HBM is also popular for data center FPGAs.”

"GDDR is a very power-hungry interface, but HBM is a super power-efficient interface. "

Interposers

Interconnects

Shareable and partitionable

"Arm has developed memory system resource partitioning and monitoring ( MPAM ) framework to tie resource controls to the software that accesses the memory system."

DVFS and AVFS

. “If you miss that (usecase), then you’re planning for some other chip,” Mijatovic said. 

a set of the most common use cases to optimize the data lines, 

to optimize the place-and-route. 

"every power domain costs wiring and isolation cells."


DRAM Low power

 Deep power down and clock stop

Logic Design Engineer

 QAT (Quick Assist Technology) hardware design team enables Data Center Technology thru a set of scalable hardware accelerators, like lossless compression, network security like secure key establishment, IPSec, SSL/TLS, and firewall and data center virtualization technology.

QAT team, e CPM (Content Processing Module) front end design team, where you will work on RTL/DFX development and integration activities within the Custom Logic
Responsibilities will include, but are not limited to:

  • Perform logic design, Register Transfer Level (RTL) coding, and simulation to generate cell libraries, functional units, and subsystems for inclusion in full chip designs.

  • Participate in the development of Architecture and Microarchitecture specifications for the Logic components.

  • Provide IP integration support to SoC customers and represents RTL team.

  • Implement RTL in System Verilog, validating the design, synthesizing the design, and closing timing.

  • High-level Architecture through to the details of timing.

  • Work with specifications at multiple levels, including the HAS and MAS (microarchitecture spec).

  • Balance design trade-offs with modularity, scalability, DFX requirements, power, area, and performance.


Friday, October 29, 2021

Systolic arrays and beyond

 


Each PE can store multiple weights.
Weights can be selected on the fly.
Pipelined parallel programs
Pipelined file compression
DAE - Decoupled Access and Execute - Modern example Pentium 4
Queue reduces the need for registers
Professor's simplified version of Pentium 4.

Friday, October 22, 2021

28nm STA

 28nm STA using PBA and GBA

Understanding 28-nm SoC Design With ARM-Based Cores

   Flexible Abstract model to decrease the size of netlist.

Implementation challenges for large 28nm socs

Global clock channel could limit reroute of deisgn

Memory-to-flop paths have high logic levels and a memory delay of ~400ps. 

create placement regions near the memories to ensure less buffering in the path, and achieve timing closure.

Clock gating issues also cause high congestion in the localized area.

different placement rules for different types of standard cells/macros, local power drop targets on top of the global targets, special clock tree design to take care of the skew and to minimize the number of buffers on clock tree. 

clock tree synthesis for low power

"Optimal grouping clock tree sinks for clock gating during both the RTL design and synthesis stages offer significant advantages for power saving. The grouping and clock gating that can be coded manually by the RTL designer who knows the architecture and typical application scenarios for the device is the most critical part that contributes most to the power savings. "

At a block level, you also have them all on or off at once with a selection for it.