### Integration, Specialization and Approximation the "ISA" of Post-Moore Servers

### Babak Falsafi





parsa.epfl.ch

# DATACENTER GROWTH





# ■Data → fuel for digital economy

- Exponential demand for digital services
- Many apps (e.g., AI) with higher exponential demand



### DATACENTERS ARE BACKBONE OF CLOUD

- I 00s of 1000 of commodity or homebrewed servers
- Centralized to exploit economies of scale
- Network fabric w/ µ-second connectivity
- Often limited by
  - Electricity
  - Network
  - Cooling



350MW, Bedford

# CLOUDS AT VARIOUS SCALES





Temporal/Sensitive/Local Data

Persistent/Global Data

### DATACENTERS NOT GETTING DENSER





### End of Moore's Law (of Silicon)

- Five decades of doubling density
- Recent slowdown in density
- Chip density limited by physics

### Growth means building more

- 41%/year  $\rightarrow$  28x in ten years
- At 15%/year  $\rightarrow$  7x more DCs



### Training a single Al model can emit as much carbon as five cars in their lifetimes

Deep learning has a terrible carbon footprint.

by Karen Hao

Jun 6, 2019

# POST-MOORE DATACENTERS



- Design for "ISA"
- Integration
  - Move data less frequently
  - Move data less distance
- Specialization
  - Customize resources
  - Less work/computation
- ApproximationAdjust precision



# INTEGRATED COOLING [Thome, Atienza]



3D server chip

- Two-phase liquid cooling
  - Uniform higher thermals
  - Higher heat removal
  - Localized cooling



Rcond

К





 Data plane principles: zero-copy, runto-completion, coherence free

- Protected operating system with clean-slate API
- Accelerates object sharing in datacenters
- IX Kernel  $\rightarrow$  best paper at OSDI'14
- Follow-on work → SIGOPS'21 dissertation award

# sertation award

3.6x throughput with <50% latency @ 99<sup>th</sup> percentile

#### EcoCloud - Copyright 2022

# SPECIALIZED NETWORKS [Bugnion]





# SPECIALIZED DATABASES [Koch]





### QUANTIFYING EFFICIENCY/EMISSIONS BEYOND ''PUE'' (sdea.ch)

### DC INFRASTRUCTURE EFFICIENCY (PUE+)

• electrical, cooling and heat recycling components

### IT INFRASTRUCTURE EFFICIENCY

+ compute, storage, network and workloads

### DC CARBON FOOTPRINT

EPF

+ emissions from input electricity sources

Enterprise

Hewlett Packard



EFFICIENCY











# OUTLINE



Post-Moore Server Architecture 80's Desktops Specialized CPUs Integrated logic/memory Integrated networks Approximating AI Summary



# SCALE-OUT DATACETNERS



Cost is the primary metric Online services hosted in memory Divide data up across servers Design server for low cost, scale out



## TODAY'S SERVERS



Today's platforms are PC's of the 80's

- CPU "owns" and manages memory
- OS moves data back/forth from peripherals
- Legacy interfaces connecting the CPU/mem to outside
- Legacy POSIX abstractions

Fragmented logic/memory:

- Manycore network cards w/ own memory
- Flash controllers with embedded cores and memory
- Discrete accelerators with own memory

### 80'S DESKTOP







- 33 MHz 386 CPU, 250ns DRAM
- OS: Windows, Unix BSD (or various flavors)
- Focus: mlutiprogrammed in-memory compute

### TODAY'S SERVER: 80'S DESKTOP





- Dual 2GHz CPU's, 50ns DRAM
- OS: Linux (and various distributions)

# TODAY'S SERVER: 80'S DESKTOP





• Dual 2GHz CPU's, 50ns DRAM, Linux



- Bottlenecked by legacy interfaces
  - Fragmented silicon

# TODAY'S SERVER: 80'S DESKTOP





# IDEAL POST-MOORE SERVER





- Think of the server as a network
- Control plane: set up via CPU & OS
- Data plane: protected access to memory
- Eliminates silicon fragmentation

OUTLINE





# Post-Moore servers -80's Desktops Specialized CPUs Integrated logic/memory Integrated networks Approximating AI Summary



#### THE SPECIALIZATION FUNNEL an EPFL research center Specialized • Thunder X/TPU ASIC • DBToaster • Crypto • IX Kernel • Network logic • PyTorch Analog NN General Purpose • Intel CPU • Oracle DB • Linux OS • Python/C PL . . . . . Domain-specific languages to platforms New interfaces (i.e., IRs, hardware abstractions)

# THE LIMITS OF CPUS



CPUs follow the von Neumann machine organization

- Machine instructions fetched from memory
- Operands fetched/written to memory
- Referred to as von Neumann bottleneck

### Only 6% power in Pentium 4 spent in arithmetic (ALU)



[src: Chen, et. al., IEEE Transactions, 2006]

# WORKLOAD-OPTIMIZED CPUS



Three classes of workloads in datacenters

- First-party workloads (e.g., search, retail, media)
  - I. Data management
  - 2. Analytics
  - Multi-tier to microservices
- Third-party workloads (cloud)3. ContainerizedEmerging serverless

### **CloudSuite** (cloudsuite.ch, 4.0 coming)





Supports x86, ARM64, RISC-V



SERVICES STUCK IN MEMORY [ASPLOS'12]





### Cache overprovisioned

### Instruction supply bottlenecked

# SCALE-OUT PROCESSOR (SOP)





General-purpose CPU
XLogic 60% of silicon
X6x bigger cores



# CUSTOM SERVER CPU [c.a. 2014]





Case for Workload Optimized Processors For Next Generation Data Center & Cloud

Gopal Hegde VP/GM, Data Center Processing Group

EcoCloud - Copyright 2022

### Thunder X

- Based on SOP blueprint
- Designed to serve data
- 7x more core than cache
- Optimizes instruction supply
- Ran stock software
- 10x throughput over Xeon



CHASING POINTERS W/ WALKERS [MICRO'13]



- Traverse data structures (e.g., hash table, B-tree)
- Parallelize pointer chains
- Overlap pointer access across chains



### 15x better performance/Watt over Xeon

# WALKERS IN SOFTWARE [VLDB'16]



Use insights to help CPUs

- Decouple hash & walk(s) in software
- Schedule off-chip pointer access with co-routines

### 2.3x speedup on Xeon

- Unclogs dependences in microarchitecture
- Maximizes memory level parallelism
- DSL w/ co-routines
- Integrated in SAP HANA [VLDB'18]

### POST-MOORE VIRTUAL MEMORY [ISCA'21]



- Keeps POSIX (VMA) interface to apps
  - Linux, MacOS/iOS, Android
- Eliminates page-based translation in \$
- Unclogs virtual memory for security, virtualization, accelerators

📕 midgard Midgard 4K page 2M page 35 30 VM Overhead (%) 25 **Higher overhead** 20 15 Lower overhead 10 5 0 321/18 268 16MB Cache Hierarchy (\$) Capacity

midgard.epfl.ch

OUTLINE





### Post-Moore servers

- -80's Desktops
- Specialized CPUs
- Integrated logic/memory
- Integrated networks
- Approximating AI

Summary



### INTEGRATED LOGIC/MEMORY



Memory chip stack w/ nearby logic

- Minimize data movement
- Massive internal bandwidth

[source: AMD]



Opportunities for algorithm/hardware co-design

### COST OF MOVING DATA





### Data access much more expensive than arithmetic operation

### MEMORY B/W BOTTLENECK





### Internal DRAM BW presents big opportunity

### NMP COMMANDMENTS [IEEE Micro on Big Data' I 6]



- Not (CPU) business as usual
- I. DRAM favors streaming over random access
- 2. DRAM favors parallelism over arithmetic speed
- 3. NMP DRAM must maintain CPU memory semantics

### Co-design algorithm/HW for NMP

# WHY NOT RANDOM ACCESS?



Internally DRAM is a block device

- Activating a TKB row
- High latency & energy per row
- Exploit row locality for efficiency



### Example:

- For DRAM with 128 GB/s internal bandwidth
- Optimal (parallel) random access only captures ~8 GB/s
- Requires 5x more power

### Use algorithms that favor streaming access

## CASE STUDY: DB JOIN ON MONDRIAN



Revisiting Sort join:

- Sort join (O(nlogn)) vs. Hash Join (O(n))
- Sort tables and then merge join
- Streaming vs. random access

Perform way more work But, finish faster and use less power!

Trade off algorithm complexity for sequential memory accesses

## THE MONDRIAN ENGINE [ISCA'17]



#### SIMD cores + data streaming

- Saturates b/w with parallel SIMD streams
- I024-bit SIMD @ I GHz
- No caches
- Runs Spark Analytic Ops

50x over Xeon



#### Algorithm/hardware co-design maximize near-memory performance

OUTLINE





## Post-Moore servers

- ■80's Desktops
- Specialized CPUs
- Integrated logic/memory
- Integrated networks
- Approximating AI

Summary



## NETWORKS

Network stack bottleneck:

- B/W growing faster than silicon
- Emerging µServices + serverless
- RPC, orchestration, ....

Key challenges:

- New abstractions
- Co-design of network stacks







## SCALE-OUT NUMA [ASPLOS'14'19,ISCA'15, MICRO'16]





soNUMA:

- Socket-integrated network interface
- Protected global memory read/write + synch
- Fine-grain (~64B) & bulk objects (~IMB)
- Remote memory ~ 2x local memory latency
- Extensions for messaging & RPC

300ns round-trip latency to remote memory





- Wire time and protocol stacks have shrunk
- RPC dominates CPU cycles in µServices
- E.g., data transformation @  $\sim$ 2.4Gbps w/Thrift on Xeon

CEREBROS & NEBULA [ASPLOS'20,ISCA'20,MICRO'21]

RPC processing at line rate:

- A ''schema'' (not instructions) interface to an RPC core
- Implements load balancing/affinity scheduling for µServices







OUTLINE





## Post-Moore servers

- -80's Desktops
- Specialized CPUs
- Integrated logic/memory
- Integrated networks
- Approximating AI

Summary



## COST OF LOGIC VS. MEMORY





[src: Gholami, et. al.]

# DNN PLATFORM DIVERGENCE



Inference platforms:

- Tight latency constraints
- Ubiquitous deployment
- Relies on fixed-point arithmetic





## Training platforms:

- Throughput optimized
- Server deployment
- Requires floating-point arithmetic

## FLOATING VS. FIXED POINT



- Floating point
- Mantissa + exponent
- Wide representable range
- Value has independent range





- Narrow representable range
- Values range pre-determined





# HYBRID BLOCK FLOATING POINT (HBFP) [NeurIPS'18]

- Block floating point (BFP): one exponent/tensor
  - Low magnitude variation in tensor products
  - > 90% of all arithmetic operations
- 2. FP32 for all activations
  - High magnitude variation in gradient updates

Co-Located Training & Inference (ColTraIn)  $\checkmark$  One accelerator for training and inference Eliminates quantization ✓ Enables online learning





Exponent



Block of Mantissas

## HBFP vs. FP32



Resnet-50 on ImageNet

#### FP32 performance with 8-bit logic for CNN, LSTM, BERT







#### Trends:

- Demand is growing faster than Moore
- Moore's law is slowing down

## Post-Moore servers:

- Revisit legacy abstractions, SW/HW interfaces
- Holistic algorithm/SW/HW co-design
- Division of control vs. data plane

## Integration + Specialization + Approximation





# For more information, please visit us at parsa.epfl.ch

