A major challenge for the European electronic components and systems (ECS) industry is to increase productivity and reduce costs while ensuring safety and quality. Model-Driven Engineering (MDE) principles have already shown valuable capabilities for the development of ECSs but still need to scale to support real-world scenarios implied by the full deployment and use of complex electronic systems, such as Cyber-Physical Systems, and real-time systems. Moreover, maintaining efficient traceability, integration and communication between fundamental stages of the development lifecycle (i.e., design time and runtime) is another challenge to the scalability of MDE tools and techniques. This paper presents "MegaModelling at runtime – Scalable model-based framework for continuous development and runtime validation of complex systems" (MegaM@Rt2), an ECSEL–JU project whose main goal is to address the above mentioned challenges. Driven by both large and small industrial enterprises, with the support of research partners and technology providers, MegaM@Rt2aims to deliver a framework of tools and methods for: (i) system engineering/design and continuous development, (ii) related runtime analysis, and (iii) global model and traceability management. ; This project has received funding from the Electronic Component Systems for European Leadership Joint Undertaking under grant agreement No. 737494. This Joint Undertaking receives support from the European Union's Horizon 2020 research and innovation program and from Sweden, France, Spain, Italy, Finland & Czech Republic.
OmpSs is a programming model that provides a simple and powerful way of annotating sequential programs to exploit heterogeneity and task parallelism based on runtime data dependency analysis, dataflow scheduling and out-of-order task execution; it has greatly influenced Version 4.0 of the OpenMP standard. The current implementation of OmpSs achieves those capabilities with a pure-software runtime library: Nanos++. Therefore, although powerful and easy to use, the performance benefits of exploiting fine-grained (pico) task parallelism are limited by the software runtime overheads. To overcome this handicap we propose Picos, an implementation of the Task Superscalar (TSS) architecture that provides hardware support to the OmpSs programming model. Picos is a novel hardware dataflow-based task scheduler that dynamically analyzes inter-task dependencies and identifies task-level parallelism at run-time. In this paper, we describe the Picos Hardware Design and the latencies of the main functionality of its components, based on the synthesis of their VHDL design. We have implemented a full cycle-accurate simulator based on those latencies to perform a design exploration of the characteristics and number of its components in a reasonable amount of time. Finally, we present a comparison of the Picos and Nanos++ runtime performance scalability with a set of real benchmarks. With Picos, a programmer can achieve ideal scalability using aggressive parallel strategies with a large number of fine granularity tasks. ; This work is supported by the Spanish Government through Programa Severo Ochoa (SEV-2011-0067), by the Spanish Ministry of Science and Technology through TIN2012-34557 project, by the Generalitat de Catalunya (contract 2009-SGR-980), by the European FP7 project TERAFLUX id. 249013 and by the European Research Council under the European Union's 7th FP, ERC Grant Agreement number 321253. We also thank the Xilinx University Program for its hardware and software donations. ; Peer Reviewed ; Postprint (author's final draft)
OmpSs is a programming model that provides a simple and powerful way of annotating sequential programs to exploit heterogeneity and task parallelism based on runtime data dependency analysis, dataflow scheduling and out-of-order task execution; it has greatly influenced Version 4.0 of the OpenMP standard. The current implementation of OmpSs achieves those capabilities with a pure-software runtime library: Nanos++. Therefore, although powerful and easy to use, the performance benefits of exploiting fine-grained (pico) task parallelism are limited by the software runtime overheads. To overcome this handicap we propose Picos, an implementation of the Task Superscalar (TSS) architecture that provides hardware support to the OmpSs programming model. Picos is a novel hardware dataflow-based task scheduler that dynamically analyzes inter-task dependencies and identifies task-level parallelism at run-time. In this paper, we describe the Picos Hardware Design and the latencies of the main functionality of its components, based on the synthesis of their VHDL design. We have implemented a full cycle-accurate simulator based on those latencies to perform a design exploration of the characteristics and number of its components in a reasonable amount of time. Finally, we present a comparison of the Picos and Nanos++ runtime performance scalability with a set of real benchmarks. With Picos, a programmer can achieve ideal scalability using aggressive parallel strategies with a large number of fine granularity tasks. ; This work is supported by the Spanish Government through Programa Severo Ochoa (SEV-2011-0067), by the Spanish Ministry of Science and Technology through TIN2012-34557 project, by the Generalitat de Catalunya (contract 2009-SGR-980), by the European FP7 project TERAFLUX id. 249013 and by the European Research Council under the European Union's 7th FP, ERC Grant Agreement number 321253. We also thank the Xilinx University Program for its hardware and software donations. ; Peer Reviewed ; Postprint (author's final draft)
Parallel task-based programming models, like OpenMP, allow application developers to easily create a parallel version of their sequential codes. The standard OpenMP 4.0 introduced the possibility of describing a set of data dependences per task that the runtime uses to order the tasks execution. This order is calculated using shared graphs, which are updated by all threads in exclusive access using synchronization mechanisms (locks) to ensure the dependence management correctness. The contention in the access to these structures becomes critical in many-core systems because several threads may be wasting computation resources waiting their turn. This paper proposes an asynchronous management of the runtime structures, like task dependence graphs, suitable for task-based programming model runtimes. In such organization, the threads request actions to the runtime instead of doing them directly. The requests are then handled by a distributed runtime manager (DDAST) which does not require dedicated resources. Instead, the manager uses the idle threads to modify the runtime structures. The paper also presents an implementation, analysis and performance evaluation of such runtime organization. The performance results show that the proposed asynchronous organization outperforms the speedup obtained by the original runtime for different benchmarks and different many-core architectures. ; This work is partially supported by the European Union H2020 Research and Innovation Action (projects 801051, 754337 and 780681), by the Spanish Government (projects SEV-2015-0493 and TIN2015-65316-P, grant BES-2016-078046), and by the Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017-SGR-1328). ; Peer Reviewed ; Postprint (author's final draft)
The growing trend to support parallel computation to enable the performance gains of the recent hardware architectures is increasingly present in more conservative domains, such as safety-critical systems. Applications such as autonomous driving require levels of performance only achievable by fully leveraging the potential parallelism in these architectures. To address this requirement, the Ada language, designed for safety and robustness, is considering to support parallel features in the next revision of the standard (Ada 202X). Recent works have motivated the use of OpenMP, a de facto standard in high-performance computing, to enable parallelism in Ada, showing the compatibility of the two models, and proposing static analysis to enhance reliability. This paper summarizes these previous efforts towards the integration of OpenMP into Ada to exploit its benefits in terms of portability, programmability and performance, while providing the safety benefits of Ada in terms of correctness. The paper extends those works proposing and evaluating an application transformation that enables the OpenMP and the Ada runtimes to operate (under certain restrictions) as they were integrated. The objective is to allow Ada programmers to (naturally) experiment and evaluate the benefits of parallelizing concurrent Ada tasks with OpenMP while ensuring the compliance with both specifications. ; This work was supported by the Spanish Ministry of Science and Innovation under contract TIN2015-65316-P, by the European Union's Horizon 2020 Research and Innovation Programme under grant agreements no. 611016 and No 780622, and by the FCT (Portuguese Foundation for Science and Technology) within the CISTER Research Unit (CEC/04234). ; Peer Reviewed ; Postprint (published version)
Producción Científica ; Many real-world applications feature data accesses on periodic domains. Manually implementing the synchronizations and communications associated to the data dependences on each case is cumbersome and error-prone. It is increasingly interesting to support these applications in high-level parallel programming languages or parallelizing compilers. In this paper, we present a technique that, for distributed-memory systems, calculates the specific communications derived from data-parallel codes with or without periodic boundary conditions on affine access expressions. It makes transparent to the programmer the management of aggregated communications for the chosen data partition. Our technique moves to runtime part of the compile-time analysis typically used to generate the communication code for affine expressions, introducing a complete new technique that also supports the periodic boundary conditions. We present an experimental study to evaluate our proposal using several study cases. Our experimental results show that our approach can automatically obtain communication codes as efficient as those found in MPI reference codes, reducing the development effort. ; 2019-01-01 ; MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H6 Network (TIN2016-81840-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS). By the computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional Development Fund (ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of Spain.
The invasion of multi-core and multi-processor platforms on all aspects of computing makes shared memory parallel programming mainstream. Yet, the fundamental problems of exploiting parallelism efficiently and correctly have not been fully addressed. Moreover, the execution model of these platforms (notably the relaxed memory models they implement) introduces new challenges to static and dynamic program analysis. In this work we address 1) the optimization of pessimistic implementations of critical sections and 2) the dynamic information flow analysis for parallel executions of multi-threaded programs. Critical sections are excerpts of code that must appear as executed atomically. Their pessimistic implementation reposes on synchronization mechanisms, such as mutexes, and consists into obtaining and releasing them at the beginning and end of the critical section respectively. We present a general algorithm for the acquisition/release of synchronization mechanisms and define on top of it several policies aiming to reduce contention by minimizing the possession time of synchronization mechanisms. We demonstrate the correctness of these policies (i.e. they preserve atomicity and guarantee deadlock freedom) and evaluate them experimentally. The second issue tackled is dynamic information flow analysis of parallel executions. Precisely tracking information flow of a parallel execution is infeasible due to non-deterministic accesses to shared memory. Most existing solutions that address this problem enforce a serial execution of the target application. This allows to obtain an explicit serialization of memory accesses but incurs both an execution-time overhead and eliminates the effects of relaxed memory models. In contrast, the technique we propose allows to predict the plausible serializations of a parallel execution with respect to the memory model. We applied this approach in the context of taint analysis , a dynamic information flow analysis widely used in vulnerability detection. To improve precision of taint analysis we further take into account the semantics of synchronization mechanisms such as mutexes, which restricts the predicted serializations accordingly. The solutions proposed have been implemented in proof of concept tools which allowed their evaluation on some hand-crafted examples. ; L'utilisation massive des plateformes multi-cœurs et multi-processeurs a pour effet de favoriser la programmation parallèle à mémoire partagée. Néanmoins, exploiter efficacement et de manière correcte le parallélisme sur ces plateformes reste un problème de recherche ouvert. De plus, leur modèle d'exécution sous-jacent, et notamment les modèles de mémoire "relâchés", posent de nouveaux défis pour les outils d'analyse statiques et dynamiques. Dans cette thèse nous abordons deux aspects importants dans le cadre de la programmation sur plateformes multi-cœurs et multi-processeurs: l'optimisation de sections critiques implémentées selon l'approche pessimiste, et l'analyse dynamique de flots d'informations. Les sections critiques définissent un ensemble d'accès mémoire qui doivent être exécutées de façon atomique. Leur implémentation pessimiste repose sur l'acquisition et le relâchement de mécanismes de synchronisation, tels que les verrous, en début et en fin de sections critiques. Nous présentons un algorithme générique pour l'acquisition/relâchement des mécanismes de synchronisation, et nous définissons sur cet algorithme un ensemble de politiques particulier ayant pour objectif d'augmenter le parallélisme en réduisant le temps de possession des verrous par les différentes threads. Nous montrons alors la correction de ces politiques (respect de l'atomicité et absence de blocages), et nous validons expérimentalement leur intérêt. Le deuxième point abordé est l'analyse dynamique de flot d'information pour des exécutions parallèles. Dans ce type d'analyse, l'enjeu est de définir précisément l'ordre dans lequel les accès à des mémoires partagées peuvent avoir lieu à l'exécution. La plupart des travaux existant sur ce thème se basent sur une exécution sérialisée du programme cible. Ceci permet d'obtenir une sérialisation explicite des accès mémoire mais entraîne un surcoût en temps d'exécution et ignore l'effet des modèles mémoire relâchées. A contrario, la technique que nous proposons permet de prédire l'ensemble des sérialisations possibles vis-a-vis de ce modèle mémoire à partir d'une seule exécution parallèle ("runtime prediction"). Nous avons développé cette approche dans le cadre de l'analyse de teinte, qui est largement utilisée en détection de vulnérabilités. Pour améliorer la précision de cette analyse nous prenons également en compte la sémantique des primitives de synchronisation qui réduisent le nombre de sérialisations valides. Les travaux proposé ont été implémentés dans des outils prototype qui ont permit leur évaluation sur des exemples représentatifs.
Energy efficiency and consumption are now the most important and challenging issues in current Petascale and in designing future Exascale computing systems. The European Union Horizon 2020 READEX project uses an online approach to exploit application dynamism and tune large-scale HPC applications to improve energy efficiency and performance. The paper presents the READEX methodology, consisting of the Design-Time Analysis and Runtime Application Tuning, and describes the pre-analysis steps involving application dynamism and significant region detection. During design-time, the READEX tuning plugin evaluates configurations of hardware and software tuning parameters to determine the best settings for instances of application regions. The runtime tuning dynamically switches to the best configuration for an application region during production runs. Finally, the energy savings obtained for LULESH on the Taurus supercomputer highlight the effectiveness of this methodology.
This research has received funding from the European Union's Horizon 2020 research and innovation programme under grant number 666363. ; Where full static analysis of systems fails to scale up due to system size, dynamic monitoring has been increasingly used to ensure system correctness. The downside is, however, runtime overheads which are induced by the additional monitoring code instrumented. To address this issue, various approaches have been proposed in the literature to use static analysis in order to reduce monitoring overhead. In this paper we generalise existing work which uses control-flow static analysis to optimise properties specified as automata, and prove how similar analysis can be applied to more expressive symbolic automata - enabling reduction of monitoring instrumentation in the system, and also monitoring logic. We also present empirical evidence of the effectiveness of this approach through an analysis of the effect of monitoring overheads in a financial transaction system. ; peer-reviewed
Information systems are widespread and used by anyone with computing devices as well as corporations and governments. It is often the case that security leaks are introduced during the development of an application. Reasons for these security bugs are multiple but among them one can easily identify that it is very hard to define and enforce relevant security policies in modern software. This is because modern applications often rely on container sharing and multi-tenancy where, for instance, data can be stored in the same physical space but is logically mapped into different security compartments or data structures. In turn, these security compartments, to which data is classified into in security policies, can also be dynamic and depend on runtime data. In this thesis we introduce and develop the novel notion of dependent information flow types, and focus on the problem of ensuring data confidentiality in data-centric software. Dependent information flow types fit within the standard framework of dependent type theory, but, unlike usual dependent types, crucially allow the security level of a type, rather than just the structural data type itself, to depend on runtime values. Our dependent function and dependent sum information flow types provide a direct, natural and elegant way to express and enforce fine grained security policies on programs. Namely programs that manipulate structured data types in which the security level of a structure field may depend on values dynamically stored in other fields The main contribution of this work is an efficient analysis that allows programmers to verify, during the development phase, whether programs have information leaks, that is, it verifies whether programs protect the confidentiality of the information they manipulate. As such, we also implemented a prototype typechecker that can be found at http://ctp.di.fct.unl.pt/DIFTprototype/.
Part 1: Distributed Protocols ; International audience ; The shift to cloud technologies is a paradigm change that offers considerable financial and administrative gains. However governmental and business institutions wanting to tap into these gains are concerned with security issues. The cloud presents new vulnerabilities and is dominated by new kinds of applications, which calls for new security solutions.Intuitively, Byzantine fault tolerant (BFT) replication has many benefits to enforce integrity and availability in clouds. Existing BFT systems, however, are not suited for typical "data-flow processing" cloud applications which analyze large amounts of data in a parallelizable manner: indeed, existing BFT solutions focus on replicating single monolithic servers, whilst data-flow applications consist in several different stages, each of which may give rise to multiple components at runtime to exploit cheap hardware parallelism; similarly, BFT replication hinges on comparison of redundant outputs generated, which in the case of data-flow processing can represent huge amounts of data. In fact, current limits of data processing directly depend on the amount of data that can be processed per time unit.In this paper we present ClusterBFT, a system that secures computations being run in the cloud by leveraging BFT replication coupled with fault isolation. In short, ClusterBFT leverages a combination of variable-degree clustering, approximated and offline output comparison, smart deployment, and separation of duty, to achieve a parameterized tradeoff between fault tolerance and overhead in practice. We demonstrate the low overhead achieved with ClusterBFT when securing data-flow computations expressed in Apache Pig, and Hadoop. Our solution allows assured computation with less than 10 percent latency overhead as shown by our evaluation.
Quantitative analysis of digital topographic data is an increasingly important part of many studies in the geosciences. Initially, performing these analyses was a niche endeavor, requiring detailed domain knowledge and programming skills, but increasingly broad, flexible, open-source code bases have been developed to increasingly democratize topographic analysis. However, many of these analyses still require specific computing environments and/or moderate levels of knowledge of both the relevant programming language and the correct way to take these fundamental building blocks and conduct an efficient and effective topographic analysis. To partially address this, we have written the Topographic Analysis Kit (TAK), which leverages the power of one of these open code bases, TopoToolbox, to build a series of high-level topographic analysis tools to perform a variety of common topographic analyses. These analyses include the generation of maps of normalized channel steepness, or χ , and selection and statistical analysis of populations of watersheds. No programming skills or advanced mastery of MATLAB is required for effective use of TAK. In addition – to expand the utility of TAK along with the primary functions, which like the underlying TopoToolbox functions require MATLAB and several proprietary toolboxes to run – we provide compiled versions of these functions that use the free MATLAB Runtime Environment for users who do not have institutional access to MATLAB or all of the required toolboxes.
Quantitative analysis of digital topographic data is an increasingly important part of many studies in the geosciences. Initially, performing these analyses was a niche endeavor, requiring detailed domain knowledge and programming skills, but increasingly broad, flexible, open-source code bases have been developed to increasingly democratize topographic analysis. However, many of these analyses still require specific computing environments and/or moderate levels of knowledge of both the relevant programming language and the correct way to take these fundamental building blocks and conduct an efficient and effective topographic analysis. To partially address this, we have written the Topographic Analysis Kit (TAK), which leverages the power of one of these open code bases, TopoToolbox, to build a series of high-level topographic analysis tools to perform a variety of common topographic analyses. These analyses include the generation of maps of normalized channel steepness, or χ, and selection and statistical analysis of populations of watersheds. No programming skills or advanced mastery of MATLAB is required for effective use of TAK. In addition – to expand the utility of TAK along with the primary functions, which like the underlying TopoToolbox functions require MATLAB and several proprietary toolboxes to run – we provide compiled versions of these functions that use the free MATLAB Runtime Environment for users who do not have institutional access to MATLAB or all of the required toolboxes.
The article of record as published may be found at http://dx.doi.org/10.1109/TSE/2014.2302433 ; This paper presents a novel application of computer-assisted formal methods for systematically specifying, documenting, statically and dynamically checking, and maintaining human-centered workflow processes. This approach provides for end-to-end verification and validation of process workflows, which is needed for process workflows that are intended for use in developing and maintaining high-integrity systems. We demonstrate the technical feasibility of our approach by applying it on the development of the US government's process workflow for implementing, certifying, and accrediting cross-domain computer security solutions. Our approach involves identifying human-in-the-loop decision points in the process activities and then modeling these via statechart assertions. We developed techniques to specify and enforce workflow hierarchies, which was a challenge due to the existence of concurrent activities within complex workflow processes. Some of the key advantages of our approach are: it results in development of a model that is executable, supporting both upfront and runtime checking of process-workflow requirements; aids comprehension and communication among stakeholders and process engineers; and provides for incorporating accountability and risk management into the engineering of process workflows.
With the appearance of multi-many core machines, applications and runtime systems evolved in order to exploit the new on-node concurrency that brought new software paradigms. POSIX threads (Pthreads) was widely-adopted for that purpose and it remains as the most used threading solution in current hardware. Lightweight thread (LWT) libraries emerged offering lighter mechanisms to tackle the massive concurrency that current hardware is offering. In this paper, we analyze in detail the most representative threading libraries including Pthread- and LWT-based solutions. In addition, to examine the suitability of LWTs for different use cases, we develop a set of microbenchmarks consisting of commonly found OpenMP patterns in current parallel codes, and we compare the results using threading libraries and OpenMP implementations. Moreover, we study the semantics offered by threading libraries in order to expose the similarities among different LWT application programming interfaces and their advantages over Pthreads. This study reveals that LWT libraries outperform solutions based on operating system threads in cases where tasks and nested parallelism are required. ; The researchers from the Universitat Jaume I and UniversitatPolitècnica de València were supported by project TIN2014-53495-R of the MINECO and FEDER, and the Generalitat Valenciana fellowship programme Vali+d 2015. Antonio J. Peña is financed by the European Union's Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant No. 749516. This work was partially supported by the U.S. Dept. of Energy, Office of Science, Office of Advanced Scientific Computing Research (SC-21), under contract DE-AC02-06CH11357. ; Peer Reviewed ; Postprint (author's final draft)