Supercomputing Laboratory - Home

Skip to Main Content

University of California, Riverside

CSM Supercomputing Laboratory



Publications (with students underlined)



- 2013 -

  • "Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach."
    Dong Li, Zizhong Chen, Panruo Wu, and Jeffrey Vetter.
    Proceedings of the ACM/IEEE SC'13 Conference, Denver, CO, November 17-22, 2013. ACM Press.
  • "Energy-Efficient Scheduling for Multicore Systems with Bounded Resources."
    Ziliang Zong, Jonathan Bush, Rong Ge, Xin Li, and Zizhong Chen.
    Proceedings of the 2013 IEEE International Conference on Green Computing and Communications (GreenCom 2013), Beijing, China. August 20-23, 2013.
  • "On-line Soft Error Correction in Matrix-Matrix Multiplication."
    Panruo Wu, Chong Ding, Longxiang Chen, Teresa Davies, Christer Karlsson, and Zizhong Chen.
    Journal of Computational Science. Vol. , No., June, 2013.
  • "Characterizing Power and Energy Consumption of MapReduce Data Movements."
    Thomas Wirtz, Rong Ge, Ziliang Zong and Zizhong Chen.
    Proceedings of the Fourth International Green Computing Conference (IGCC 2013), Work-in-Progress, Arlington, Virginia, USA. June 27-29, 2013.
  • "Correcting Soft Errors Online in LU Factorization."
    Teresa Davies and Zizhong Chen.
    Proceedings of the 22th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2013) , New York City, NY, USA. June 17-21, 2013. ACM Press.
  • "Multi-Level Diskless Checkpointing."
    Doug Hakkarinen and Zizhong Chen.
    IEEE Transactions on Computers. Vol. 62, No. 4, pp. 772-783, April, 2013.
  • "Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods."
    Zizhong Chen.
    Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2013), Shenzhen, China, February 23-27, 2013. ACM Press.

- 2012 -

  • "Optimizing Process-to-Core Mappings for Application Level Multi-dimensional MPI Communications."
    Christer Karlsson, Teresa Davies, Zizhong Chen.
    Proceedings of the 2012 IEEE International Conference on Cluster Computing (Cluster 2012), Beijing, China, September 24-28, 2012.
  • "Energy Efficient Parallel Matrix-Matrix Multiplication for DVFS-Enabled Clusters."
    Longxiang Chen, Panruo Wu, Zizhong Chen, Rong Ge, and Ziliang Zong.
    Proceedings of the 2012 International Workshop on Power-Aware Systems and Architectures (PASA 2012), in conjunction with the 41st Annual International Conference on Parallel Parallel Processing (ICPP 2012), Pittsburgh, USA, September 10-13, 2012
  • "eTune: A Power Analysis Framework for Data-Intensive Computing."
    Rong Ge, Xizhou Feng, Thomas Wirtz, Ziliang Zong, and Zizhong Chen.
    Proceedings of the 2012 International Workshop on Power-Aware Systems and Architectures (PASA 2012), in conjunction with the 41st Annual International Conference on Parallel Parallel Processing (ICPP 2012), Pittsburgh, USA, September 10-13, 2012
  • "Runtime Optimization of Broadcast Communications using Dynamic Network Topology Information from MPI."
    Jeffrey Godwin, Christer Karlsson, Zizhong Chen.
    Proceedings of the 14th IEEE International Conference on High Performance Computing and Communications (HPCC-2012), Liverpool, UK, June 25-27, 2012.
  • "Energy Consumption Analysis of Parallel Sorting Algorithms Running on Multicore Systems."
    Ivan Zecena, Ziliang Zong, Rong Ge, Tongdan Jin, Zizhong Chen.
    Proceedings of the 2nd International Workshop on Power Measurement and Profiling (PMP 2012), San Jose, California, USA, June 5-8, 2012.
  • "Reduced Data Communication for Parallel CMA-ES for REACTS."
    Doug Hakkarinen, Tracy Camp, Zizhong Chen, and Allan Haas.
    Proceedings of the 20th EUROMICRO International Conference on Parallel, Distributed and Network-based Processing (PDP 2012) , Garching, Germany, February 15 - 17, 2012.

- 2011 -

  • "Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line."
    Panruo Wu, Chong Ding, Longxiang Chen, Feng Gao, Teresa Davies, Christer Karlsson, and Zizhong Chen.
    Proceedings of the 2011 Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) held in conjunction with the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC11), Seattle, WA, USA, November 14 - 18, 2011. ACM Press.
  • "Optimizing Process-to-Core Mappings for Two Dimensional Broadcast/Reduce on Multicore Architectures."
    Christer Karlsson, Teresa Davies, Chong Ding, Hui Liu, and Zizhong Chen.
    Proceedings of the 40th IEEE International Conference on Parallel Processing (ICPP 2011) , Taipei, Taiwan, September 13 - 16, 2011. IEEE Computer Society Press. Acceptance Rate 22.3%, 81/363.
  • "Algorithm-Based Recovery for Iterative Methods without Checkpointing."
    Zizhong Chen.
    Proceedings of the 20th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2011) , San Jose, California, June 8-11, 2011. ACM Press. Full Paper Acceptance Rate 12.9%, 22/170.
  • "High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing."
    Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, and Zizhong Chen.
    Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011) , Tucson, Arizona, May 31 - June 4, 2011. ACM Press. Acceptance Rate 21.7%, 35/161.
  • "Matrix Multiplication on GPUs with On-Line Fault Tolerance."
    Chong Ding, Christer Karlsson, Hui Liu, Teresa Davies, and Zizhong Chen.
    Proceedings of the 9th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA 2011), Busan, Korea, May 26-28, 2011. IEEE Computer Society Press.
  • "Algorithm-Based Recovery for Newton's Method without Checkpointing."
    Hui Liu, Teresa Davies, Chong Ding, Christer Karlsson, and Zizhong Chen.
    Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium, DPDNS'11 Workshop, Anchorage, Alaska, USA, May 16-20, 2011. IEEE Computer Society Press.
  • "Algorithm-Based Recovery for HPL."
    Teresa Davies, Christer Karlsson, Hui Liu, and Zizhong Chen.
    Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2011), poster paper, page 303-304, San Antonio, TX, USA, February 12-16, 2011. ACM Press.

- 2010 -

  • "Constructing Numerically Stable Real Number Codes using Evolutionary Computation."
    Aaron Garrett, Zizhong Chen, and Daniel Smith.
    Proceedings of the 12th ACM Annual Conference on Genetic and Evolutionary Computation (GECCO 2010), Portland, OR, USA, July 7-11, 2010. ACM Press.
  • "Algorithmic Cholesky Factorization Fault Recovery."
    Doug Hakkarinen and Zizhong Chen.
    Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2010), Atlanta, GA, USA, April 19-23, 2010. IEEE Computer Society Press.
  • "Fault Tolerant Linear Algebra: Recovering from Fail-Stop Failures without Checkpointing."
    Teresa Davies and Zizhong Chen.
    Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium, PhD Forum, Atlanta, GA, USA, April 19-23, 2010. IEEE Computer Society Press. Best Paper Award.
  • "Highly Scalable Checkpointing for Exascale Computing."
    Christer Karlsson and Zizhong Chen.
    Proceedings of the 24th IEEE International Parallel & Distributed Processing Symposium, PhD Forum, Atlanta, GA, USA, April 19-23, 2010. IEEE Computer Society Press.
  • "Adaptive Checkpointing."
    Zizhong Chen.
    Journal of Communications, Volume 5, Number 1, January 2010.

- 2009 -

  • "Optimal Real Number Codes for Fault Tolerant Matrix Operations."
    Zizhong Chen.
    Proceedings of the ACM/IEEE SC09 Conference, Portland, OR, November 14-20, 2009. ACM Press.
  • "N-Level Diskless Checkpointing."
    Doug Hakkarinen and Zizhong Chen.
    Proceedings of the 11th IEEE International conference on High Performance Computing and Communications (HPCC-09), Seoul, Korea, June 25-27, 2009. IEEE Computer Society Press.
  • "Scalable Fault Tolerance for Large-Scale Parallel and Distributed Computing."
    Zizhong Chen.
    Handbook of Research on Scalable Computing Technologies, Chapter 34, IGI Global, 2009.
  • "Self Adaptive Application Level Fault Tolerance for High Performance Computing."
    Zizhong Chen.
    Bulletin of Advanced Technology Research, Vol. 3, No.8, Aug., 2009.
  • "Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing."
    Zizhong Chen and Jack Dongarra.
    IEEE Transactions on Computers. Vol. 58, No. 11, November, 2009.
  • "Pipelining Parallel Image Compositing and Delivery for Efficient Remote Visualization."
    Qishi Wu, Jinzhu Gao, Zizhong Chen, and Mengxia Zhu.
    Journal of Parallel and Distributed Computing, Vol. 69, No. 3, March, 2009.

- 2008 -

  • "Algorithm-Based Fault Tolerance for Fail-Stop Failures."
    Zizhong Chen and Jack Dongarra.
    IEEE Transactions on Parallel and Distributed Systems, Vol. 19, No. 12, December, 2008.
  • "A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing."
    Zizhong Chen and Jack Dongarra.
    Proceedings of the 11th IEEE High Assurance Systems Engineering Symposium, (HASE'08), Nanjing, China, December 3 - 5, 2008. IEEE Computer Society Press.
  • "Extending Algorithm-based Fault Tolerance to Tolerate Fail-stop Failures in High Performance Distributed Environments."
    Zizhong Chen.
    Proceedings of the 22nd IEEE International Parallel & Distributed Processing Symposium, DPDNS'08 Workshop, Miami, FL, USA, April 14-18, 2008. IEEE Computer Society Press.
  • "Performance of MPI Broadcast Algorithms."
    Daniel M. Wadsworth and Zizhong Chen.
    Proceedings of the 22nd IEEE International Parallel & Distributed Processing Symposium, PDSEC'08 Workshop, Miami, FL, USA, April 14-18, 2008. IEEE Computer Society Press.

- 2007 -

  • "Recovery Patterns for Iterative Methods in a Parallel Unstable Environment."
    Julien Langou, Zizhong Chen, George Bosilca, and Jack Dongarra.
    SIAM Journal on Scientific Computing, 30(1):102-116, 2007.
  • "Disaster Survival Guide in Petascale Computing: An Algorithmic Approach."
    Jack J. Dongarra, Zizhong Chen, George Bosilca, and Julien Langou.
    Petascale Computing: Algorithms and Applications, Chapman & Hall / CRC Press, 2007.
  • "An Efficient Packet Loss Recovery Methodology for Video-over-IP."
    Ming Yang, Nikolaos Bourbakis, Zizhong Chen, and Guillermo Francia, III.
    Proceedings of the 9th IASTED International Conference on Signal and Image Processing (SIP2007), Honolulu, Hawaii, USA, August 20-22, 2007.
  • "An Efficient Recovery Scheme for Supercomputing Clusters and Grids."
    Zizhong Chen, Ming Yang, Monica Trifas, and Jack Dongarra.
    Proceedings of the 6th International Conference on Distributed Computing and Applications for Business, Engineering and Sciences (DCABES2007), Yichang, Hubei, P. R. China, August 14-17, 2007.
  • "An Efficient Audio-Video Synchronization Methodology."
    Ming Yang, Nikolaos Bourbakis, Zizhong Chen, and Monica Trifas.
    Proceedings of the 2007 IEEE International Conference on Multimedia & Expo (ICME 2007), Beijing, P. R. China, July 2-5 , 2007. IEEE Computer Society Press.
  • "Self Adaptive Application Level Fault Tolerance for Parallel and Distributed Computing."
    Zizhong Chen, Ming Yang, Guillermo Francia, III, and Jack Dongarra.
    Proceedings of the 21st IEEE International Parallel & Distributed Processing Symposium, DPDNS'07 Workshop, Long Beach, CA, USA, March 26-29, 2007. IEEE Computer Society Press.

- 2006 and before -

  • "Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources."
    Zizhong Chen and Jack Dongarra.
    Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2006), Rhodes Island, Greece, April 25-29, 2006. IEEE Computer Society Press.
  • "Self Adapting Numerical Software (SANS) Effort."
    Jack Dongarra, George Bosilca, Zizhong Chen, Victor Eijkhout, Graham Fagg, Erika Fuentes, Julien Langou, Piotr Luszczek, Jelena Pjesivac-Grbovic, Keith Seymour, Haihang You, and Satish S. Vadiyar.
    IBM Journal of Research and Development. Volume 50, Number 2/3, Page 223-238, 2006.
  • "Condition Numbers of Gaussian Random Matrices."
    Zizhong Chen and Jack J. Dongarra.
    SIAM Journal on Matrix Analysis and Applications, Volume 27, Number 3, Page 603-620, 2005.
  • "Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing."
    Graham E. Fagg, Edgar Gabriel, Zizhong Chen, Thara Angskun, George Bosilca, Jelena Pjesivac-Grbovic, and Jack Dongarra.
    International Journal of High Performance Computing Applications, Volume 19, Number 4, Page 465-477, Winter, 2005.
  • "Fault Tolerant High Performance Computing by a Coding Approach."
    Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun, George Bosilca, and Jack J. Dongarra.
    Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'05), Chicago, Illinois, USA, June 15-17, 2005. ACM Press.
  • "Numerically Stable Real Number Codes Based on Random Matrices."
    Zizhong Chen and Jack J. Dongarra.
    Proceedings of the 5th International Conference on Computational Science (ICCS2005), Atlanta, Georgia, USA, May 22-25, 2005. LNCS 3514, Springer-Verlag.
  • "Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems."
    Graham E. Fagg, Edgar Gabriel, George Bosilca, Thara Angskun, Zizhong Chen, Jelena Pjesivac-Grbovic, Kevin London and Jack J. Dongarra.
    Proceedings of the 19th International Supercomputer Conference (ISC2004), Heidelberg, German, June 21-24, 2004. Best Paper Award.
  • "LAPACK for Clusters Project: An Example of Self Adapting Numerical Software."
    Zizhong Chen, Jack Dongarra, Piotr Luszczek, and Kenneth Roche.
    Proceedings of the 37th Hawaii International Conference on System Sciences (HICSS-37), Kauai, Hawaii, USA, January 5-8, 2004. IEEE Computer Society Press.
  • "Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters."
    Zizhong Chen, Jack Dongarra, Piotr Luszczek, and Kenneth Roche.
    Parallel Computing, Volume 29, Number 11-12, Page 1723-1743, November-December, 2003.
  • "Fault Tolerant Communication Library and Applications for High Performance Computing."
    Graham E. Fagg, Edgar Gabriel, Zizhong Chen, Thara Angskun, George Bosilca, Antonin Bukovsky, and Jack J. Dongarra.
    Proceedings of the 4th Los Alamos Computer Science Institute Symposium (LACSI'03), Santa Fe, NM, USA, October 27-29, 2003.
  • "Self Adaptive Software for Numerical Linear Algebra Library Routines on Clusters."
    Zizhong Chen, Jack Dongarra, Piotr Luszczek, and Kenneth Roche.
    Proceedings of the 3rd International Conference on Computational Science, WoPLA'03 Workshop, Melbourne, Australia, June 7-9, 2003. LNCS 2659, Springer-Verlag.