Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows


Presentation held at the 11th Workflows in Support of Large-Scale Science, 2016
Salt Lake City, UT, USA – SuperComputing’16

Abstract – Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. In spite of many success stories, a key challenge for running workflows in distributed systems is failure prediction, detection, and recovery. In this paper, we propose an approach to use control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach applying the proportional-integral-derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, to mitigate faults by adjusting the inputs of the controller. The PID controller aims at detecting the possibility of a fault far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of the Big Data era—data storage overload and memory overflow. We define, implement, and evaluate simple PID controllers to autonomously manage data and memory usage of a bioinformatics workflow that consumes/produces over 4.4TB of data, and requires over 24TB of memory to run all tasks concurrently. Experimental results indicate that workflow executions may significantly benefit from PID controllers, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed method, and faults are detected and mitigated far in advance of their occurrence.

 

Related Publication

  • [PDF] R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, and M. Atkinson, “Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows,” in 11th Workflows in Support of Large-Scale Science, 2016, pp. 15-24.
    [Bibtex]
    @inproceedings{ferreiradasilva-works-2016,
    author = {Ferreira da Silva, Rafael and Filgueira, Rosa and Deelman, Ewa and Pairo-Castineira, Erola and Overton, Ian Michael and Atkinson, Malcolm},
    title = {Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows},
    year = {2016},
    booktitle = {11th Workflows in Support of Large-Scale Science},
    series = {WORKS'16},
    pages = {15--24}
    }

 

253 views

Continue Reading

Automating Real-time Seismic Analysis Through Streaming and High Throughput Workflows


Presentation held at the Workshop of Environmental Computing Applications, 2016
Baltimore, MD, USA – IEEE 12th International Conference on eScience

Abstract – In order to support the computational and data needs of today’s science, new knowledge must be gained on how to deliver the growing capabilities of the national cyberinfrastructures and more recently commercial clouds to the scientist’s desktop in an accessible, reliable, and scalable way. In over a decade of working with domain scientists, the Pegasus workflow management system has being used by researchers to model seismic wave propagation, to discover new celestial objects, to study RNA critical to human brain development, and to investigate other important research questions. Recently, the Pegasus and the dispel4py teams have collaborated to enable automated processing of real-time seismic interferometry and earthquake “repeater” analysis using data collected from the IRIS database. The proposed integrated solution empowers real-time stream-based workflows to seamlessly run on different distributed infrastructures (or in the wide area), where data is automatically managed by a task-oriented workflow system, which orchestrates the distributed execution. We have demonstrated the feasibility of this approach by using docker containers to deploy the workflow management systems and two different computing infrastructures: an Apache Storm cluster for real-time processing, and an MPI-based cluster for shared memory computing. Stream-based executions is managed by dispel4py, while the data movement between the clusters and the workflow engine (submit host) is managed by Pegasus.

 

Related Publication

  • [PDF] [DOI] R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, and B. Mayer, “Automating Environmental Computing Applications with Scientific Workflows,” in Environmental Computing Workshop, IEEE 12th International Conference on e-Science, 2016, pp. 400-406.
    [Bibtex]
    @inproceedings{ferreiradasilva-ecw-2016,
    author = {Ferreira da Silva, Rafael and Deelman, Ewa and Filgueira, Rosa and Vahi, Karan and Rynge, Mats and Mayani, Rajiv and Mayer, Benjamin},
    title = {Automating Environmental Computing Applications with Scientific Workflows},
    year = {2016},
    booktitle = {Environmental Computing Workshop, IEEE 12th International Conference on e-Science},
    series = {ECW'16},
    doi = {10.1109/eScience.2016.7870926},
    pages = {400--406}
    }

 

150 views

Continue Reading

Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud and Amazon Web Services


Presentation held at the 18th Workshop on Advances in Parallel and Distributed Computational Models, 2016
Chicago, IL, USA – 30th IEEE International Parallel and Distributed Processing Symposium

Abstract – Scientific workflows have become the mainstream to conduct large-scale scientific research. In the meantime, cloud computing has emerged as an alternative computing paradigm. In this paper, we conduct an analysis of the performance of an I/O-intensive real scientific workflow on cloud environments using makespan (the turnaround time for a workflow to complete its execution) as the key performance metric. In particular, we assess the impact of varying the storage configurations on workflow performance when executing on Google Cloud and Amazon Web Services. We aim to understand the performance bottlenecks of the popular cloud-based execution environments. Experimental results show significant differences in application performance for different configurations. They also reveal that Amazon Web Services outperforms Google Cloud with equivalent application and system configurations. We then investigate the root cause of these results using provenance data and by benchmarking disk and network I/O on both infrastructures. Lastly, we also suggest modifications in the standard cloud storage APIs, which will reduce the makespan for I/O-intensive workflows.

 

Related Publication

  • [PDF] [DOI] H. Nawaz, G. Juve, R. Ferreira da Silva, and E. Deelman, “Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud and Amazon Web Services,” in 18th Workshop on Advances in Parallel and Distributed Computational Models, 2016, pp. 535-544.
    [Bibtex]
    @inproceedings{nawaz-apdcm-2016,
    author = {Nawaz, Hassan and Juve, Gideon and Ferreira da Silva, Rafael and Deelman, Ewa},
    title = {Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud and Amazon Web Services},
    booktitle = {18th Workshop on Advances in Parallel and Distributed Computational Models},
    series = {APDCM'16},
    year = {2016},
    doi = {10.1109/IPDPSW.2016.90},
    pages = {535--544}
    }

 

169 views

Continue Reading