Seguir
Nathan DeBardeleben
Nathan DeBardeleben
Research Scientist, Los Alamos National Laboratory
Dirección de correo verificada de lanl.gov
Título
Citado por
Citado por
Año
Addressing failures in exascale computing
M Snir, RW Wisniewski, JA Abraham, SV Adve, S Bagchi, P Balaji, J Belak, ...
The International Journal of High Performance Computing Applications 28 (2 …, 2014
5342014
Memory errors in modern systems: The good, the bad, and the ugly
V Sridharan, N DeBardeleben, S Blanchard, KB Ferreira, J Stearley, ...
ACM SIGARCH Computer Architecture News 43 (1), 297-310, 2015
3902015
Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults
V Sridharan, J Stearley, N DeBardeleben, S Blanchard, S Gurumurthi
Proceedings of the International Conference on High Performance Computing …, 2013
2432013
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
D Tiwari, S Gupta, J Rogers, D Maxwell, P Rech, S Vazhkudai, D Oliveira, ...
2015 IEEE 21st International Symposium on High Performance Computer …, 2015
2022015
On the diversity of cluster workloads and its impact on research results
G Amvrosiadis, JW Park, GR Ganger, GA Gibson, E Baseman, ...
2018 USENIX Annual Technical Conference (USENIX ATC 18), 533-546, 2018
1612018
BinFI an efficient fault injector for safety-critical machine learning systems
Z Chen, G Li, K Pattabiraman, N DeBardeleben
Proceedings of the International Conference for High Performance Computing …, 2019
1192019
F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability
Q Guan, N Debardeleben, S Blanchard, S Fu
Proceedings of the 2014 IEEE 28th International Parallel and Distributed …, 2014
892014
High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development
N DeBardeleben, J Laros, JT Daly, SL Scott, C Engelmann, B Harrod
Whitepaper, Dec, 2009
882009
GPGPUs: How to Combine High Computational Power with High Reliability
LB Gomez, F Cappello, L Carro, N DeBardeleben, B Fang, S Gurumurthi, ...
872014
Tensorfi: A flexible fault injection framework for tensorflow applications
Z Chen, N Narayanan, B Fang, G Li, K Pattabiraman, N DeBardeleben
2020 IEEE 31st International Symposium on Software Reliability Engineering …, 2020
792020
Tensorfi: A configurable fault injector for tensorflow applications
G Li, K Pattabiraman, N DeBardeleben
2018 IEEE International symposium on software reliability engineering …, 2018
732018
Experimental and analytical study of xeon phi reliability
D Oliveira, L Pilla, N DeBardeleben, S Blanchard, H Quinn, I Koren, ...
Proceedings of the International Conference for High Performance Computing …, 2017
622017
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
WM Jones, JT Daly, N DeBardeleben
Proceedings of the 19th ACM International Symposium on High Performance …, 2010
582010
Towards practical algorithm based fault tolerance in dense linear algebra
P Wu, Q Guan, N DeBardeleben, S Blanchard, D Tao, X Liang, J Chen, ...
Proceedings of the 25th ACM International Symposium on High-Performance …, 2016
532016
Application monitoring and checkpointing in hpc: looking towards exascale systems
WM Jones, JT Daly, N DeBardeleben
Proceedings of the 50th Annual Southeast Regional Conference, 262-267, 2012
492012
Inter-agency workshop on hpc resilience at extreme scale
J Daly, B Harrod, T Hoang, L Nowell, B Adolf, S Borkar, N DeBardeleben, ...
National Security Agency Advanced Computing Systems, 2012
452012
Lessons learned from memory errors observed over the lifetime of Cielo
S Levy, KB Ferreira, N DeBardeleben, T Siddiqua, V Sridharan, ...
SC18: International Conference for High Performance Computing, Networking …, 2018
442018
GPU behavior on a large HPC cluster
N DeBardeleben, S Blanchard, L Monroe, P Romero, D Grunau, C Idler, ...
Euro-Par 2013: Parallel Processing Workshops: BigDataCloud, DIHC, FedICI …, 2014
442014
TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs
J Chen, N Xiong, X Liang, D Tao, S Li, K Ouyang, K Zhao, ...
Proceedings of the ACM International Conference on Supercomputing, 106-116, 2019
422019
Interpretable anomaly detection for monitoring of high performance computing systems
E Baseman, S Blanchard, N DeBardeleben, A Bonnie, A Morrow
Outlier Definition, Detection, and Description on Demand Workshop at ACM …, 2016
372016
El sistema no puede realizar la operación en estos momentos. Inténtalo de nuevo más tarde.
Artículos 1–20