Running Accurate, Scalable, and Reproducible Simulations of Distributed Systems with WRENCH

Scientific workflows are used routinely in numerous scientific domains, and Workflow Management Systems (WMSs) have been developed to orchestrate and optimize workflow executions on distributed platforms. WMSs are complex software systems that interact with complex software infrastructures. Most WMS research and development activities rely on empirical experiments conducted with full-fledged software stacks on actual hardware platforms. Such experiments, however, are limited to hardware and software infrastructures at hand and can be labor- and/or time-intensive. As a result, relying solely on real-world experiments impedes WMS research and development. An alternative is to conduct experiments in simulation.

In this work, we present WRENCH, a WMS simulation framework, whose objectives are (i) accurate and scalable simulations; and (ii) easy simulation software development. WRENCH achieves its first objective by building on the SimGrid framework. While SimGrid is recognized for the accuracy and scalability of its simulation models, it only provides low-level simulation abstractions and thus large software development efforts are required when implementing simulators of complex systems. WRENCH thus achieves its second objective by providing high-level and directly reusable simulation abstractions on top of SimGrid. After describing and giving rationales for WRENCH’s software architecture and APIs, we present a case study in which we apply WRENCH to simulate the Pegasus production WMS. We report on ease of implementation, simulation accuracy, and simulation scalability so as to determine to which extent WRENCH achieves its two above objectives. We also draw both qualitative and quantitative comparisons with a previously proposed workflow simulator.

Empirical cumulative distribution function of task submit times (left) and task completion times (right) for sample real-world (“pegasus”) and simulated (“wrench” and “workflowsim”) executions of Montage-2.0 on AWS-m5.xlarge.

 

Reference to the paper:

  • [PDF] [DOI] H. Casanova, S. Pandey, J. Oeth, R. Tanaka, F. Suter, and R. Ferreira da Silva, “WRENCH: A Framework for Simulating Workflow Management Systems,” in 13th Workshop on Workflows in Support of Large-Scale Science (WORKS’18), 2018, p. 74–85.
    [Bibtex]
    @inproceedings{casanova-works-2018,
    title = {WRENCH: A Framework for Simulating Workflow Management Systems},
    author = {Casanova, Henri and Pandey, Suraj and Oeth, James and Tanaka, Ryan and Suter, Frederic and Ferreira da Silva, Rafael},
    booktitle = {13th Workshop on Workflows in Support of Large-Scale Science (WORKS'18)},
    year = {2018},
    pages = {74--85},
    doi = {10.1109/WORKS.2018.00013}
    }

 


 

69 views

Continue Reading

WRENCH: Workflow Management System Simulation Workbench


Abstract – WRENCH enables novel avenues for scientific workflow use, research, development, and education. WRENCH capitalizes on recent and critical advances in the state of the art of distributed platform/application simulation. WRENCH builds on top of the open-source SimGrid simulation framework. SimGrid enables the simulation of large-scale distributed applications in a way that is accurate (via validated simulation models), scalable (low ratio of simulation time to simulated time, ability to run large simulations on a single computer with low compute, memory, and energy footprints), and expressive (ability to simulate arbitrary platform, application, and execution scenarios). WRENCH provides directly usable high-level simulation abstractions using SimGrid as a foundation. More information on https://wrench-project.org.

In a nutshell, WRENCH makes it possible to:

  • Prototype implementations of Workflow Management System (WMS) components and underlying algorithms;
  • Quickly, scalably, and accurately simulate arbitrary workflow and platform scenarios for a simulated WMS implementation; and
  • Run extensive experimental campaigns to conclusively compare workflow executions, platform architectures, and WMS algorithms and designs.

 

Reference to the paper:

  • [PDF] [DOI] H. Casanova, S. Pandey, J. Oeth, R. Tanaka, F. Suter, and R. Ferreira da Silva, “WRENCH: A Framework for Simulating Workflow Management Systems,” in 13th Workshop on Workflows in Support of Large-Scale Science (WORKS’18), 2018, p. 74–85.
    [Bibtex]
    @inproceedings{casanova-works-2018,
    title = {WRENCH: A Framework for Simulating Workflow Management Systems},
    author = {Casanova, Henri and Pandey, Suraj and Oeth, James and Tanaka, Ryan and Suter, Frederic and Ferreira da Silva, Rafael},
    booktitle = {13th Workshop on Workflows in Support of Large-Scale Science (WORKS'18)},
    year = {2018},
    pages = {74--85},
    doi = {10.1109/WORKS.2018.00013}
    }

 

447 views

Continue Reading

The Interplay of Workflow Execution and Resource Provisioning


Presentation held at the 18th SIAM Conference on Parallel Processing for Scientific Computing, 2018
Resource Management, Scheduling, Workflows: Critical Middleware for HPC and Clouds
Tokyo, Japan

Abstract – This talk will examine issues of workflow execution, in particular using the Pegasus Workflow Management System, on distributed resources and how these resources can be provisioned ahead of the workflow execution. Pegasus was designed, implemented and supported to provide abstractions that enable scientists to focus on structuring their computations without worrying about the details of the target cyberinfrastructure. To support these workflow abstractions Pegasus provides automation capabilities that seamlessly map workflows onto target resources, sparing scientists the overhead of managing the data flow, job scheduling, fault recovery and adaptation of their applications. In some cases, it is beneficial to provision the resources ahead of the workflow execution, enabling the re-use of resources across workflow tasks. The talk will examine the benefits of resource provisioning for workflow execution.

 

635 views

Continue Reading

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows


Presentation held at the 12th Workflows in Support of Large-Scale Science, 2017
Denver, CO, USA – SuperComputing’17

Abstract – Science applications frequently produce and consume large volumes of data, but delivering this data to and from compute resources can be challenging, as parallel file system performance is not keeping up with compute and memory performance. To mitigate this I/O bottleneck, some systems have deployed burst buffers, but their impact on performance for real-world workflow applications is not always clear. In this paper, we examine the impact of burst buffers through the remote-shared, allocatable burst buffers on the Cori system at NERSC. By running a subset of the SCEC CyberShake workflow, a production seismic hazard analysis workflow, we find that using burst buffers offers read and write improvements of about an order of magnitude, and these improvements lead to increased job performance, even for long-running CPU-bound jobs.

 

Related Publication

  • [PDF] [DOI] R. Ferreira da Silva, S. Callaghan, and E. Deelman, “On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows,” in 12th Workshop on Workflows in Support of Large-Scale Science (WORKS’17), 2017.
    [Bibtex]
    @inproceedings{ferreiradasilva-works-2017,
    title = {On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows},
    author = {Ferreira da Silva, Rafael and Callaghan, Scott and Deelman, Ewa},
    booktitle = {12th Workshop on Workflows in Support of Large-Scale Science (WORKS'17)},
    year = {2017},
    pages = {},
    doi = {10.1145/3150994.3151000}
    }

 

947 views

Continue Reading

Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows


Presentation held at the 11th Workflows in Support of Large-Scale Science, 2016
Salt Lake City, UT, USA – SuperComputing’16

Abstract – Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. In spite of many success stories, a key challenge for running workflows in distributed systems is failure prediction, detection, and recovery. In this paper, we propose an approach to use control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach applying the proportional-integral-derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, to mitigate faults by adjusting the inputs of the controller. The PID controller aims at detecting the possibility of a fault far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of the Big Data era—data storage overload and memory overflow. We define, implement, and evaluate simple PID controllers to autonomously manage data and memory usage of a bioinformatics workflow that consumes/produces over 4.4TB of data, and requires over 24TB of memory to run all tasks concurrently. Experimental results indicate that workflow executions may significantly benefit from PID controllers, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed method, and faults are detected and mitigated far in advance of their occurrence.

 

Related Publication

  • [PDF] R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I. M. Overton, and M. Atkinson, “Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows,” in 11th Workflows in Support of Large-Scale Science, 2016, p. 15–24.
    [Bibtex]
    @inproceedings{ferreiradasilva-works-2016,
    author = {Ferreira da Silva, Rafael and Filgueira, Rosa and Deelman, Ewa and Pairo-Castineira, Erola and Overton, Ian Michael and Atkinson, Malcolm},
    title = {Using Simple PID Controllers to Prevent and Mitigate Faults in Scientific Workflows},
    year = {2016},
    booktitle = {11th Workflows in Support of Large-Scale Science},
    series = {WORKS'16},
    pages = {15--24}
    }

 

1,435 views

Continue Reading

Automating Real-time Seismic Analysis Through Streaming and High Throughput Workflows


Presentation held at the Workshop of Environmental Computing Applications, 2016
Baltimore, MD, USA – IEEE 12th International Conference on eScience

Abstract – In order to support the computational and data needs of today’s science, new knowledge must be gained on how to deliver the growing capabilities of the national cyberinfrastructures and more recently commercial clouds to the scientist’s desktop in an accessible, reliable, and scalable way. In over a decade of working with domain scientists, the Pegasus workflow management system has being used by researchers to model seismic wave propagation, to discover new celestial objects, to study RNA critical to human brain development, and to investigate other important research questions. Recently, the Pegasus and the dispel4py teams have collaborated to enable automated processing of real-time seismic interferometry and earthquake “repeater” analysis using data collected from the IRIS database. The proposed integrated solution empowers real-time stream-based workflows to seamlessly run on different distributed infrastructures (or in the wide area), where data is automatically managed by a task-oriented workflow system, which orchestrates the distributed execution. We have demonstrated the feasibility of this approach by using docker containers to deploy the workflow management systems and two different computing infrastructures: an Apache Storm cluster for real-time processing, and an MPI-based cluster for shared memory computing. Stream-based executions is managed by dispel4py, while the data movement between the clusters and the workflow engine (submit host) is managed by Pegasus.

 

Related Publication

  • [PDF] [DOI] R. Ferreira da Silva, E. Deelman, R. Filgueira, K. Vahi, M. Rynge, R. Mayani, and B. Mayer, “Automating Environmental Computing Applications with Scientific Workflows,” in Environmental Computing Workshop, IEEE 12th International Conference on e-Science, 2016, p. 400–406.
    [Bibtex]
    @inproceedings{ferreiradasilva-ecw-2016,
    author = {Ferreira da Silva, Rafael and Deelman, Ewa and Filgueira, Rosa and Vahi, Karan and Rynge, Mats and Mayani, Rajiv and Mayer, Benjamin},
    title = {Automating Environmental Computing Applications with Scientific Workflows},
    year = {2016},
    booktitle = {Environmental Computing Workshop, IEEE 12th International Conference on e-Science},
    series = {ECW'16},
    doi = {10.1109/eScience.2016.7870926},
    pages = {400--406}
    }

 

1,267 views

Continue Reading

Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud and Amazon Web Services


Presentation held at the 18th Workshop on Advances in Parallel and Distributed Computational Models, 2016
Chicago, IL, USA – 30th IEEE International Parallel and Distributed Processing Symposium

Abstract – Scientific workflows have become the mainstream to conduct large-scale scientific research. In the meantime, cloud computing has emerged as an alternative computing paradigm. In this paper, we conduct an analysis of the performance of an I/O-intensive real scientific workflow on cloud environments using makespan (the turnaround time for a workflow to complete its execution) as the key performance metric. In particular, we assess the impact of varying the storage configurations on workflow performance when executing on Google Cloud and Amazon Web Services. We aim to understand the performance bottlenecks of the popular cloud-based execution environments. Experimental results show significant differences in application performance for different configurations. They also reveal that Amazon Web Services outperforms Google Cloud with equivalent application and system configurations. We then investigate the root cause of these results using provenance data and by benchmarking disk and network I/O on both infrastructures. Lastly, we also suggest modifications in the standard cloud storage APIs, which will reduce the makespan for I/O-intensive workflows.

 

Related Publication

  • [PDF] [DOI] H. Nawaz, G. Juve, R. Ferreira da Silva, and E. Deelman, “Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud and Amazon Web Services,” in 18th Workshop on Advances in Parallel and Distributed Computational Models, 2016, p. 535–544.
    [Bibtex]
    @inproceedings{nawaz-apdcm-2016,
    author = {Nawaz, Hassan and Juve, Gideon and Ferreira da Silva, Rafael and Deelman, Ewa},
    title = {Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud and Amazon Web Services},
    booktitle = {18th Workshop on Advances in Parallel and Distributed Computational Models},
    series = {APDCM'16},
    year = {2016},
    doi = {10.1109/IPDPSW.2016.90},
    pages = {535--544}
    }

 

1,289 views

Continue Reading

Pegasus: automate, recover, and debug scientific computations


Automate the scientific computational work as portable workflows. Automatically locates the necessary input data and computational resources, and manages storage space for executing data-intensive workflows on storage-constrained resources.Recover from failures at runtime. Task are automatically retried in the presence of errors. A rescue workflow containing a description of only the work that remains is provided. Provenance is also captured (data, software, parameters, etc.). Debug failures in computations using a set of system provided debugging tools and an online workflow monitoring dashboard.

 

Related Publications

  • [DOI] E. Deelman, K. Vahi, M. Rynge, G. Juve, R. Mayani, and R. Ferreira da Silva, “Pegasus in the Cloud: Science Automation through Workflow Technologies,” IEEE Internet Computing, vol. 20, iss. 1, p. 70–76, 2016.
    [Bibtex]
    @article{deelman-ic-2016,
    title = {Pegasus in the Cloud: Science Automation through Workflow Technologies},
    author = {Deelman, Ewa and Vahi, Karan and Rynge, Mats and Juve, Gideon and Mayani, Rajiv and Ferreira da Silva, Rafael},
    journal = {{IEEE} Internet Computing},
    volume = {20},
    number = {1},
    pages = {70--76},
    year = {2016},
    doi = {10.1109/MIC.2016.15}
    }
  • [PDF] [DOI] E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling, R. Mayani, W. Chen, R. Ferreira da Silva, M. Livny, and K. Wenger, “Pegasus, a Workflow Management System for Science Automation,” Future Generation Computer Systems, vol. 46, p. 17–35, 2015.
    [Bibtex]
    @article{deelman-fgcs-2015,
    title = {Pegasus, a Workflow Management System for Science Automation},
    journal = {Future Generation Computer Systems},
    volume = {46},
    number = {0},
    pages = {17--35},
    year = {2015},
    doi = {10.1016/j.future.2014.10.008},
    author = {Deelman, Ewa and Vahi, Karan and Juve, Gideon and Rynge, Mats and Callaghan, Scott and Maechling, Phil J. and Mayani, Rajiv and Chen, Weiwei and Ferreira da Silva, Rafael and Livny, Miron and Wenger, Kent},
    }

2,123 views

Continue Reading

Task Resource Consumption Prediction for Scientific Applications and Workflows


Presentation held at the Algorithms and Scheduling Techniques to Manage Resilience and Power Consumption in Distributed Systems 2015
Dagstuhl, Germany

Abstract – Estimates of task runtime, disk space usage, and memory consumption, are commonly used by scheduling and resource provisioning algorithms to support efficient and reliable scientific application executions. Such algorithms often assume that accurate estimates are available, but such estimates are difficult to generate in practice. In this work, we first profile real scientific applications and workflows, collecting fine-grained information such as process I/O, runtime, memory usage, and CPU utilization. We then propose a method to automatically characterize task requirements based on these profiles. Our method estimates task runtime, disk space, and peak memory consumption. It looks for correlations between the parameters of a dataset, and if no correlation is found, the dataset is divided into smaller subsets using the statistical recursive partitioning method and conditional inference trees to identify patterns that characterize particular behaviors of the workload. We then propose an estimation process to predict task characteristics of scientific applications based on the collected data. For scientific workflows, we propose an online estimation process based on the MAPE-K loop, where task executions are monitored and estimates are updated as more information becomes available. Experimental results show that our online estimation process results in much more accurate predictions than an offline approach, where all task requirements are estimated prior to workflow execution.

 

Related Publications

  • [PDF] [DOI] R. Ferreira da Silva, G. Juve, M. Rynge, E. Deelman, and M. Livny, “Online Task Resource Consumption Prediction for Scientific Workflows,” Parallel Processing Letters, vol. 25, iss. 3, 2015.
    [Bibtex]
    @article{ferreiradasilva-ppl-2015,
    title = {Online Task Resource Consumption Prediction for Scientific Workflows},
    author = {Ferreira da Silva, Rafael and Juve, Gideon and Rynge, Mats and Deelman, Ewa and Livny, Miron},
    journal = {Parallel Processing Letters},
    volume = {25},
    number = {3},
    pages = {},
    year = {2015},
    doi = {10.1142/S0129626415410030}
    }
  • [PDF] [DOI] R. Ferreira da Silva, M. Rynge, G. Juve, I. Sfiligoi, E. Deelman, J. Letts, F. Würthwein, and M. Livny, “Characterizing a High Throughput Computing Workload: The Compact Muon Solenoid (CMS) Experiment at LHC,” Procedia Computer Science, vol. 51, p. 39–48, 2015.
    [Bibtex]
    @article{ferreiradasilva-iccs-2015,
    title = {Characterizing a High Throughput Computing Workload: The Compact Muon Solenoid ({CMS}) Experiment at {LHC}},
    author = {Ferreira da Silva, Rafael and Rynge, Mats and Juve, Gideon and Sfiligoi, Igor and Deelman, Ewa and Letts, James and W\"urthwein, Frank and Livny, Miron},
    journal = {Procedia Computer Science},
    year = {2015},
    volume = {51},
    pages = {39--48},
    note = {International Conference On Computational Science, \{ICCS\} 2015 Computational Science at the Gates of Nature},
    doi = {10.1016/j.procs.2015.05.190}
    }
  • [PDF] [DOI] R. Ferreira da Silva, G. Juve, E. Deelman, T. Glatard, F. Desprez, D. Thain, B. Tovar, and M. Livny, “Toward fine-grained online task characteristics estimation in scientific workflows,” in 8th Workshop on Workflows in Support of Large-Scale Science, 2013, p. 58–67.
    [Bibtex]
    @inproceedings{ferreiradasilva-works-2013,
    author = {Ferreira da Silva, Rafael and Juve, Gideon and Deelman, Ewa and Glatard, Tristan and Desprez, Fr{\'e}d{\'e}ric and Thain, Douglas and Tovar, Benjamin and Livny, Miron},
    title = {Toward fine-grained online task characteristics estimation in scientific workflows},
    booktitle = {8th Workshop on Workflows in Support of Large-Scale Science},
    series = {WORKS '13},
    year = {2013},
    pages = {58--67},
    doi = {10.1145/2534248.2534254},
    }

 

1,122 views

Continue Reading

Characterizing a High Throughput Computing Workload: The Compact Muon Solenoid (CMS) Experiment at LHC


Presentation held at ICCS 2015 Conference, 2015
Reykjavik, Iceland

High throughput computing (HTC) has aided the scientific community in the analysis of vast amounts of data and computational jobs in distributed environments. To manage these large workloads, several systems have been developed to efficiently allocate and provide access to distributed resources. Many of these systems rely on job characteristics estimates (e.g., job runtime) to characterize the workload behavior, which in practice is hard to obtain. In this work, we perform an exploratory analysis of the CMS experiment workload using the statistical recursive partitioning method and conditional inference trees to identify patterns that characterize particular behaviors of the workload. We then propose an estimation process to predict job characteristics based on the collected data. Experimental results show that our process estimates job runtime with 75% of accuracy on average, and produces nearly optimal predictions for disk and memory consumption.

 

Related Publication

  • [PDF] [DOI] R. Ferreira da Silva, M. Rynge, G. Juve, I. Sfiligoi, E. Deelman, J. Letts, F. Würthwein, and M. Livny, “Characterizing a High Throughput Computing Workload: The Compact Muon Solenoid (CMS) Experiment at LHC,” Procedia Computer Science, vol. 51, p. 39–48, 2015.
    [Bibtex]
    @article{ferreiradasilva-iccs-2015,
    title = {Characterizing a High Throughput Computing Workload: The Compact Muon Solenoid ({CMS}) Experiment at {LHC}},
    author = {Ferreira da Silva, Rafael and Rynge, Mats and Juve, Gideon and Sfiligoi, Igor and Deelman, Ewa and Letts, James and W\"urthwein, Frank and Livny, Miron},
    journal = {Procedia Computer Science},
    year = {2015},
    volume = {51},
    pages = {39--48},
    note = {International Conference On Computational Science, \{ICCS\} 2015 Computational Science at the Gates of Nature},
    doi = {10.1016/j.procs.2015.05.190}
    }

 

781 views

Continue Reading