On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows


Presentation held at the 12th Workflows in Support of Large-Scale Science, 2017
Denver, CO, USA – SuperComputing’17

Abstract – Science applications frequently produce and consume large volumes of data, but delivering this data to and from compute resources can be challenging, as parallel file system performance is not keeping up with compute and memory performance. To mitigate this I/O bottleneck, some systems have deployed burst buffers, but their impact on performance for real-world workflow applications is not always clear. In this paper, we examine the impact of burst buffers through the remote-shared, allocatable burst buffers on the Cori system at NERSC. By running a subset of the SCEC CyberShake workflow, a production seismic hazard analysis workflow, we find that using burst buffers offers read and write improvements of about an order of magnitude, and these improvements lead to increased job performance, even for long-running CPU-bound jobs.

 

Related Publication

  • [PDF] [DOI] R. Ferreira da Silva, S. Callaghan, and E. Deelman, “On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows,” in 12th Workshop on Workflows in Support of Large-Scale Science (WORKS’17), 2017.
    [Bibtex]
    @inproceedings{ferreiradasilva-works-2017,
    title = {On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows},
    author = {Ferreira da Silva, Rafael and Callaghan, Scott and Deelman, Ewa},
    booktitle = {12th Workshop on Workflows in Support of Large-Scale Science (WORKS'17)},
    year = {2017},
    pages = {},
    doi = {10.1145/3150994.3151000}
    }

 

88 views

Continue Reading

Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows


Presentation held at the 11th Workflows in Support of Large-Scale Science, 2016
Salt Lake City, UT, USA – SuperComputing’16

Abstract – Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. In spite of many success stories, a key challenge for running workflows in distributed systems is failure prediction, detection, and recovery. In this paper, we propose an approach to use control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach applying the proportional-integral-derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, to mitigate faults by adjusting the inputs of the controller. The PID controller aims at detecting the possibility of a fault far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of the Big Data era—data storage overload and memory overflow. We define, implement, and evaluate simple PID controllers to autonomously manage data and memory usage of a bioinformatics workflow that consumes/produces over 4.4TB of data, and requires over 24TB of memory to run all tasks concurrently. Experimental results indicate that workflow executions may significantly benefit from PID controllers, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed method, and faults are detected and mitigated far in advance of their occurrence.

 

Related Publication

  • [PDF] R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, and M. Atkinson, “Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows,” in 11th Workflows in Support of Large-Scale Science, 2016, pp. 15-24.
    [Bibtex]
    @inproceedings{ferreiradasilva-works-2016,
    author = {Ferreira da Silva, Rafael and Filgueira, Rosa and Deelman, Ewa and Pairo-Castineira, Erola and Overton, Ian Michael and Atkinson, Malcolm},
    title = {Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows},
    year = {2016},
    booktitle = {11th Workflows in Support of Large-Scale Science},
    series = {WORKS'16},
    pages = {15--24}
    }

 

687 views

Continue Reading