What supercomputers say: A study of five system logs A Oliner, J Stearley 37th annual IEEE/IFIP international conference on dependable systems and …, 2007 | 723 | 2007 |
Addressing failures in exascale computing M Snir, RW Wisniewski, JA Abraham, SV Adve, S Bagchi, P Balaji, J Belak, ... The International Journal of High Performance Computing Applications 28 (2 …, 2014 | 534 | 2014 |
Memory errors in modern systems: The good, the bad, and the ugly V Sridharan, N DeBardeleben, S Blanchard, KB Ferreira, J Stearley, ... ACM SIGARCH Computer Architecture News 43 (1), 297-310, 2015 | 390 | 2015 |
Evaluating the viability of process replication reliability for exascale systems K Ferreira, J Stearley, JH Laros III, R Oldfield, K Pedretti, R Brightwell, ... Proceedings of 2011 International Conference for High Performance Computing …, 2011 | 335 | 2011 |
Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults V Sridharan, J Stearley, N DeBardeleben, S Blanchard, S Gurumurthi Proceedings of the International Conference on High Performance Computing …, 2013 | 243 | 2013 |
Towards informatic analysis of syslogs J Stearley 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No …, 2004 | 195 | 2004 |
Alert detection in system logs AJ Oliner, A Aiken, J Stearley 2008 Eighth IEEE International Conference on Data Mining, 959-964, 2008 | 132 | 2008 |
Bad words: Finding faults in spirit's syslogs J Stearley, AJ Oliner 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid …, 2008 | 90 | 2008 |
Bridging the gaps: Joining information sources with splunk J Stearley, S Corwell, K Lord Workshop on Managing Systems via Log Analysis and Machine Learning …, 2010 | 52 | 2010 |
Inter-agency workshop on hpc resilience at extreme scale J Daly, B Harrod, T Hoang, L Nowell, B Adolf, S Borkar, N DeBardeleben, ... National Security Agency Advanced Computing Systems, 2012 | 45 | 2012 |
Redundant computing for exascale systems. JR Stearley, RE Riesen, JH Laros III, KB Ferreira, KTT Pedretti, ... Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA …, 2010 | 38 | 2010 |
Increasing fault resiliency in a message-passing environment K Ferreira, R Riesen, R Oldfield, J Stearley, J Laros, K Pedretti, ... Sandia National Laboratories, Technical report SAND2009-6753, 2009 | 38 | 2009 |
Defining and measuring supercomputer Reliability, Availability, and Serviceability (RAS) J Stearley Proceedings of the Linux clusters institute conference, 2005 | 37 | 2005 |
Does partial replication pay off? J Stearley, K Ferreira, D Robinson, J Laros, K Pedretti, D Arnold, ... IEEE/IFIP International Conference on Dependable Systems and Networks …, 2012 | 36 | 2012 |
See applications run and throughput jump: The case for redundant computing in HPC R Riesen, K Ferreira, J Stearley 2010 International Conference on Dependable Systems and Networks Workshops …, 2010 | 29 | 2010 |
rMPI: increasing fault resiliency in a message-passing environment K Ferreira, R Riesen, R Oldfield, J Stearley, J Laros, K Pedretti, ... Sandia National Laboratories, Albuquerque, NM, Tech. Rep. SAND2011-2488, 2011 | 25 | 2011 |
JHL III, R K Ferreira, R Riesen, P Bridges, D Arnold, J Stearley Oldfield, K. Pedretti, and R. Brightwell,“Evaluating the viability of …, 2011 | 23 | 2011 |
Extra bits on SRAM and DRAM errors–more data from the field N DeBardeleben, S Blanchard, V Sridharan, S Gurumurthi, J Stearley, ... IEEE Workshop on Silicon Errors in Logic-System Effects (SELSE), 2014 | 19 | 2014 |
Sisyphus log data mining toolkit J Stearley Accessed from the Web, 2009 | 13 | 2009 |
A {State-Machine} Approach to Disambiguating Supercomputer Event Logs J Stearley, R Ballance, L Bauman | 10 | 2012 |