2016-04-25
Runtime interval optimization and dependable performance for application-level checkpointing
Publication
Publication
As aggressive integration paves the way for performance enhancement of many-core chips and technology nodes go below deca-nanometer dimensions, system-wide failure rates are becoming noticeable. Inevitably, system designers need to properly account for such failures. Checkpoint/Restart (C/R) can be deployed to prolong dependable operation of such systems. However, it introduces additional overheads that lead to performance variability. We present a versatile dependability manager (DepMan) that orchestrates a many-core application-level C/R scheme, while being able to follow time-varying error rates. DepMan also contains a dedicated module that ensures on-the-fly performance dependability for the executing application. We evaluate the performance of our scheme using an error injection module both on the experimental Intel Single-Chip Cloud Computer (SCC) and on a commercial Intel i7 general purpose computer. Runtime checkpoint interval optimization adapts to a variety of failure rates without extra performance or energy costs. The inevitable timing overhead of C/R is reclaimed systematically with Dynamic Voltage and Frequency Scaling (DVFS), so that dependable application performance is ensured.
| Additional Metadata | |
|---|---|
| hdl.handle.net/1765/97414 | |
| 19th Design, Automation and Test in Europe Conference and Exhibition, DATE 2016 | |
| Organisation | Erasmus MC: University Medical Center Rotterdam |
|
Kokolis, A. (Apostolos), Mavrogiannis, A. (Alexandros), Rodopoulos, D., Strydis, C., & Soudris, D. (2016). Runtime interval optimization and dependable performance for application-level checkpointing. In Proceedings of the 2016 Design, Automation and Test in Europe Conference and Exhibition, DATE 2016 (pp. 594–599). Retrieved from http://hdl.handle.net/1765/97414 |
|