FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs (2023)

by Yujia Zhai, Elisabeth Giem, Kai Zhao, Jinyang Liu, Jiajun Huang, Bryan M. Wong, Christian R. Shelton, and Zizhong Chen

Abstract: Basic Linear Algebra Subprograms (BLAS) serve as a foundational library for scientific computing and machine learning. In this paper, we present a new BLAS implementation, FT-BLAS, that provides performance comparable to or faster than state-of-the-art BLAS libraries, while being capable of tolerating soft errors on-the-fly. At the algorithmic level, we propose a hybrid strategy to incorporate fault-tolerant functionality. For memory-bound Level-1 and Level-2 BLAS routines, we duplicate computing instructions and re-use data at the register level to avoid memory overhead when validating the runtime correctness. Here we novelly propose to utilize mask registers on AVX512-enabled processors and SIMD registers on AVX2-enabled processors to store intermediate comparison results. For compute-bound Level-3 BLAS routines, we fuse memory-intensive operations such as checksum encoding and verification into the GEMM assembly kernels to optimize the memory footprint. We also design cache-friendly parallel algorithms for our fault-tolerant library. Through a series of architectural-aware optimizations, we manage to maintain the fault-tolerant overhead at a negligible order (<3%). Experimental results obtained on widely-used processors such as Intel Skylake, Intel Cascade Lake, and AMD Zen2 demonstrate that FT-BLAS offers high reliability and high performance - faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14%, and 21.70%, respectively, for both serial and parallel routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.

Download Information

Yujia Zhai, Elisabeth Giem, Kai Zhao, Jinyang Liu, Jiajun Huang, Bryan M. Wong, Christian R. Shelton, and Zizhong Chen (2023). "FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs." IEEE Transactions on Parallel and Distributed Systems. pdf          

Bibtex citation

   author = "Yujia Zhai and Elisabeth Giem and Kai Zhao and Jinyang Liu and Jiajun Huang and Bryan M. Wong and Christian R. Shelton and Zizhong Chen",
   title = "{FT-BLAS}: A Fault Tolerant High Performance {BLAS} Implementation on {x86} {CPU}s",
   year = 2023,
   journal = "{IEEE} Transactions on Parallel and Distributed Systems",
   journalabbr = "TPDS",