(Heterogeneous ARchitectures for NExt-Generation Server Systems)

Servers for the Post-Moore Era

Server architecture is entering an age of heterogeneity, as silicon performance scaling approaches its horizon and the economics of scale enables cost-effective deployment of custom silicon in the datacentre. Traditionally, customized components have been deployed as discrete expansion boards to reduce cost and design complexity to ensure compatibility with rigidly designed CPU silicon and its surrounding infrastructure. A prime example of this pattern is the tethering of todays Remote Direct Memory Access (RDMA) network interface cards to commodity PCIe interconnects. Although using a commodity I/O interconnect has enabled RDMA to be deployed at large scales in today's datacentres, our prior work on the "Scale-Out NUMA" project has shown that judiciously integrating architectural support directly on the CPU silicon provides significant benefits. Namely, integration affords lower RDMA latency and the ability to perform richer operations such as atomic accesses to software objects, and remote procedure calls. The HARNESS project therefore aims at co-designing server silicon with software to support the performance-critical primitives in the datacentre - in particular those pertaining to networked systems and storage stacks.


RPC-Optimized Server Architecture

The complex software stacks which are responsible to deliver today's online services are often structured into numerous tiers, where the interactions between a service’s tiers take the form of Remote Procedure Calls (RPCs). Moreover, due to the vast improvements in networking bandwidths and kernel-bypassed networking, the limits of RPC performance will inevitably end up being dominated by the endpoints themselves. This is especially true as common runtimes of each RPC continue to shrink to the microsecond scale, latencies which are small enough that inefficient hardware can actually begin to impact the RPC's performance. A prime example of this impact is the increased memory latencies brought about from large on-chip interconnects, as well as any NUMA effects brought about by multi-socket servers. In the NEBULA project, we aim to design an RPC-optimized server architecture to address these challenges, enabling RPCs to be delivered to CPU cores with minimal latency. Furthermore, with today's server network cards already delivering bandwidths of 200Gbps, traditional CPU-centric approaches to RPC processing can simply no longer keep up. NEBULA also aims to push the appropriate RPC semantics into the network interface, saving CPU cycles and allowing the server to keep pace with higher packet rates.


Cost-Effective TB-Scale Memory Hierarchy

With DRAM having slowed down in density scaling, the only path forward to improving density in servers is a tighter integration of denser technologies. Solid State Disks (SSDs) offer a 50x density advantage (and cost/capacity) but with also 500x (50us) longer access latency over DRAM. As such, SSDs have historically been treated as storage devices connected as a peripheral and interfaced through a host OS. These legacy interfaces were not designed for the tight tail latency requirements of today’s online services. Therefore, we believe that a careful hardware/software co-design for a tighter integration of SSDs into a host CPU and DRAM cache would enable a substantial increase in density while maintaining tail latency. AstriFlash is a hardware/software co-design to manage a DRAM cache with a mapped SSD. AstriFlash integrates SSDs that are directly mapped into the address space with address translation support in hardware and a DRAM cache that allow for high-bandwidth lookup by CPUs with custom support for miss handling.


M. Sutherland, S. Gupta, B. Falsafi, V. Marathe, D. Pnevmatikatos, and A. Daglis, The NeBuLa RPC-Optimized Architecture, The 47th International Symposium on Computer Architecture (ISCA '20), Valencia, Spain, May 30- June 3, 2020.
[detailed record] [bibtex]

A. Pourhabibi, S. Gupta, H. Kassir, M. Sutherland, Z. Tian, M. Drumond, B. Falsafi, and C. Koch, Optimus Prime: Accelerating Data Transformation in Servers, The 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '20), Lausanne, Switzerland, March 16–20, 2020.
[detailed record] [bibtex]

S. Gupta, A. Daglis, and B. Falsafi, Distributed Logless Atomic Durability with Persistent Memory, The 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-52), Columbus, OH, USA, October 12–16, 2019.
[detailed record] [bibtex]

A. Daglis, M. Sutherland, and B. Falsafi, RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs, The 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19), Providence, Rhode Island, USA, April 13–17, 2019.
[detailed record] [bibtex]


Siddharth Gupta

Yunho Oh

Arash Pourhabibi

Mark Sutherland

Abhishek Bhattacharjee

Babak Falsafi

Peter Hsu