Reader 1: Introduction | ||
Chapter 6 (Sections 6.1 & 6.2) of Hennessy & Patterson's Computer Architecture | ||
Reader 2: Evaluation | ||
Gupta, Udit, et al. Chasing carbon: The elusive environmental footprint of computing | ||
Wunderlich, R.E., Wenisch, T.F., Falsafi, B. and Hoe, J.C., 2003, May. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling | ||
Reader 3: Parallel Software Construction | ||
Dean, Jeffrey, and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters | ||
Abadi, Martín, et al. TensorFlow: a system for Large-Scale machine learning | ||
Reader 4: Communication Model | ||
Background: Chapter 2, 18, and 20 of Maurice Helihy's The Art of Multiprocessor Programming | ||
Birrell, Andrew D., and Bruce Jay Nelson. Implementing remote procedure calls | ||
Reader 5: Workload I | ||
Ferdman, Michael, et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware | ||
Gan, Yu, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems | ||
Reader 6: Workload II | ||
Ustiugov, Dmitrii, et al. Benchmarking, analysis, and optimization of serverless function snapshots | ||
Reddi, Vijay Janapa, et al. Mlperf inference benchmark | ||
Reader 7: Coherence | ||
Background: Chapter 6 and 7 of Nagarajan, Sorin, Hill, Wood's A Primer on Memory Consistency and Cache Coherence | ||
Background slide from CS-307 regarding coherence | ||
Moshovos, Andreas, et al. JETTY: Filtering snoops for reduced energy consumption in SMP servers | ||
Ferdman, Michael, et al. Cuckoo directory: A scalable directory for many-core systems | ||
Reader 8: Memory Ordering | ||
Background slide from CS-307 regarding hardware memory reordering | ||
Background slide from CS-307 regarding compiler memory reoredering | ||
Adve, Sarita V., and Kourosh Gharachorloo. Shared memory consistency models: A tutorial | ||
Blundell, Colin, Milo MK Martin, and Thomas F. Wenisch. Invisifence: performance-transparent memory ordering in conventional multiprocessors | ||
Reader 9: CMP Caches | ||
Xie, Yuejian, and Gabriel H. Loh. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches | ||
Hardavellas, Nikos, et al. Reactive NUCA: near-optimal block placement and replication in distributed caches | ||
Reader 10: GPUs | ||
Background slide from CS307 regarding GPU introduction | ||
Background slide from CS307 regarding GPU programming | ||
Chapter 1 and 2 of Nemirovsky & Tullsen's Multithreading architecture | ||
Choquette, Jack. Nvidia hopper h100 gpu: Scaling performance | ||
Reader 11: DRAM Caches | ||
Volos, Stavros, et al. Fat caches for scale-out servers | ||
Sodani, Avinash, et al. Knights landing: Second-generation Intel XEON Phi product | ||
Reader 12: Interconnects | ||
Chapter 1, 2, and 6 of Jerger & Peh's On-Chip Networks | ||
Reader 13: Cloud-Native CPUs | ||
Lotfi-Kamran, Pejman, et al. Scale-out processors | ||
Lotfi-Kamran, Pejman, Boris Grot, and Babak Falsafi. NOC-Out: Microarchitecting a scale-out processor | ||
Reader 14: Cloud-Native Acclerators | ||
Biswas, Arijit, and Sailesh Kottapalli. Next-Gen Intel Xeon CPU-Sapphire Rapids | ||
Kocberber, Onur, et al. Meet the walkers: Accelerating index traversals for in-memory databases | ||
Reader 15: AI Acclerators | ||
Jouppi, Norman P., et al. Ten lessons from three generations shaped google's tpuv4i | ||
Drumond, Mario, et al. Equinox: Training (for free) on a custom inference accelerator | ||
Reader 16: Near-Memory Computing | ||
Drumond, Mario, et al. The mondrian data engine | ||
Chi, Ping, et al. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory | ||
Reader 17: Cloud-Native Memory I | ||
Gupta, Siddharth, et al. AstriFlash: a flash-based system for online services | ||
Gupta, Siddharth, et al. Rebooting virtual memory with Midgard | ||
Reader 18: Cloud-Native Memory II | ||
Novakovic, Stanko, et al. Scale-out NUMA | ||
Li, Huaicheng, et al. Pond: CXL-based memory pooling systems for cloud platforms | ||
Reader 19: Cloud-Native Networks I | ||
Daglis, Alexandros, Mark Sutherland, and Babak Falsafi. RPCValet: NI-driven tail-aware balancing of µs-scale RPCs | ||
Sutherland, Mark, et al. The NeBuLa RPC-optimized architecture | ||
Reader 20: Cloud-Native Networks II | ||
Karandikar, Sagar, et al. A hardware accelerator for protocol buffers. | ||
Pourhabibi, Arash, et al. Cerebros: Evading the rpc tax in datacenters | ||
Reader 21: Datacenters I | ||
Chapter 1 and 2 of Barroso & Hölzle's The Datacenter as a Computer - An Introduction to the Design of Warehouse-Scale Machines | ||
Reader 22: Datacenters II | ||
Acun, Bilge, et, al. Carbon Dependencies in Datacenter Design and Management |