A performance prediction tool for irregular parallel programs
Advisor: Prof. J.-D. Nicoud, LAMI (LAboratoire de MicroInformatique)
Members of the jury (industry):
Dr. P. Ienne, Siemens
Dr. T. Cornu, Simulog
Members of the jury (EPFL):
Prof. G. Coray, LITH (LAboratoire d'Informatique THéorique)
Dr. R. Gruber, CAPA EPFL
Dr. P. Kuonen, LITH (LAboratoire d'Informatique THéorique)
1.Introduction
During the last decade, performance prediction has
been repeatedly quoted as a key factor to developing
parallel systems [1]. Predicting the behaviour of
a program performance as a function of the number of
processors and of the problem size is essential to users:
in order to choose the right implementation method.
to manage execution of
processes in supercomputer systems or in nondedicated networks of workstations/PC.
for optimizations in parallelizing compilers.
Numerous prediction tools have been recently proposed in the literature[2,3].
Most of these works focus mainly on applications with regular and
static data structures. In contrast, the present work focusses on irregular
applications [4] and especially those exhibiting dynamic data structures
and a data-dependent execution scheme. For this type of application,
performances does not depend only on the number of processor and on the
problem size. Other parameters have been taken into account, the value of
which are data-dependent, and therefore unpredictable.
In this doctoral thesis, we present a performance prediction model for irregular (or regular)
parallel applications on parallel machines and on nondedicated networks of workstations/PC.
We are developing a performance prediction tool to show that our model is
intended for automatic use.
2. Modeling irregular applications
In our approach, we began by assuming that the total execution time of
an application is the sum of different times spent in computation,
communication, synchronization, etc.
In order to predict the behaviour of irregularly (or regularly)
structured algorithms, it
is necessary to find well defined templates or frameworks that can capture
their behaviour. We have assumed that the algorithms to be modeled
can be decomposed into a sequence of blocks, where each blocks consists
of either local computation (performed in parallel by the processors),
or of a global communication operation, or a synchronization barrier.
There are two main categories of blocks: computation blocks and
communication blocks (Table 1).
Computation blocks:
Type of block
Symbol
Sequential block
Parallel block
Iterative block
Communication blocks:
Type of block
Symbol
Point-to-point
Broadcast
Scatter
Gather
Barrier
Table 1: Model of blocks
For each of the proposed block types, a simple performance model is established as
a function of basic parameters.
Sequential block: A simple way of estimating single-processor execution time is
to measure it on one of the processors of the target parallel machine.
Parallel block: A global synchronization is supposed to occur at the end of the block. The time of
the block is the time spent in the slower concurrent process of the parallel block.
Iteration block: For N succesive execution of code, time for loop is modeled as
N times the duration of the loop body.
Communication block: A set of benchmarks is established, to time
these communication operations beforehand on the target multi-processor architectures or
on networks of workstations/PC, for different numbers of processors
and different message sizes. These measurements have to be done once for all
for a given architecture and stored into a database.
Execution time of the whole program can then be
modeled by the sum of the execution times for the consecutive blocks.
In order to predict performance of irregular applications we add the
possibility to model blocks where the length varies during the execution[4].
3. Results
In order to verify the utilisation of the model in practice we are carrying
out prediction experiments on custom-made, irregular, parallel programs,
then on real programs. Included among the programs used for these performance
validation experiments are:
application simulation of wave diffusion in urban areas (parallel C++ application used for a European project: EPFL, University of Geneva, Swisscom, ...) [5].
optimal networking decomposition algorithm in domains (parallel C application presented in NASA National Symposium) [6].
a genetic algorithm based on the concept of individual islands (parallel C++ application used for a European project: TDF, EPFL, ...).
The method developed in this thesis has allowed to model real irregular programs.
The extremely satisfying results of the validation experiments confirm the
validity of the method [6,7].
For example, with the application of wave diffusion in urban areas,
we can automatically predict with very good precision how efficient it is
to decompose the application in parallel (Figure 1).
Figure 1: Result of prediction and measurement with an application of wave diffusion in urban areas (Cray T3D).
We have invented the tool, planned and directed the efforts of 7 senior students to implement a prediction performance tool
named IP3T (Irregular Parallel Performance Prediction Tool) useful on networks of nondedicated workstations/PC
and on parallel machines [7].
IP3T can be also used for sequential machines like single processor PCs.
The analysed application can be programmed in many languages: C, C++, Fortran and so on.
The communication libraries can be: PVM, MPI and so on.
IP3T permits to demonstrate that my model is intended for automatic use.
It work in 3 phases (Figure 2):
Analysis phase: in which the structure of the applications is recognized.
Measurement phase: in which the performances of sequential blocks are measured
on one of the processors of the target parallel machine.
Prediction phase: in which the tool can compute the prediction
with the model using
the performance measurements of the sequential blocks and the database of the
performances of communication blocks.
Figure 2: Phases of IP3T.
In the first phase, IP3T can automatically recognize the block structure of applications
and show it in a window (figure 3). This window permits also directly
to add or delete new blocks and/or change some parameters of blocks.
After changing you can regenerate a new source code.
Morever, in this window, the user can select a subpart of the program to be analyzed.
Figure 3: Window with block decomposition of the analysed application.
In order to predict performance of applications with many irregularities,
the user of the tool can define some personal global block parameters and
can interact on them for performance prediction through a dialog box generated dynamically.
For example, we obtain the dialog box of the figure 4 with our defined
parameters of the wave diffusion application.
Figure 4: Dialog box with user defined parameters with the wave diffusion application.
IP3T has been successfully tested on the real applications already mentioned in the results section.
3. Conclusions
In this Ph.D. project, we have done successfuly:
A performance prediction model for irregular (or regular) real parallel (or sequential) applications.
A validation through experiments with industry-oriented applications in order
to confirm the validity of the method.
A complete prediction tool to show that our model is intended for automatic use
on parallel machines or on nondedicated networks of workstations/PC.
Now, the the prediction tool can be integrated in a development environment or in a parallel
operating system.
2. Mustafa Uysal, Tahsin M. Kurc, Alan Sussman, and Joel Saltz.
A performance prediction framework for data intensive applications on large scale
parallel machines. Technical Report CS-TR-3918, University of Maryland, College Park, July 1998.
3. Yu-Kwong Kwok, Ishfaq Ahmad, Min-You Wu, and Wei Shu. Graphical tool for
automatic parallelization and scheduling of programs on multiprocessors.
In Proceedings of Euro-Par'97, pages 294-301, August 1997.
4. M. Pahud and T. Cornu. Predicting the behaviour of irregular parallel applications using code block decomposition. In Proceedings of the
SIPAR Workshop'96 on Parallel and Distributed Systems, Geneva, Oct. 1996.
Available on line, pdf format
5. M. Pahud, F. Guidec and T. Cornu. Performance evaluation of a radio wave propagation
parallel simulator. In Proceedings of the Third International
Conference on Massively Parallel Computing System, MPCS'98, Colorado Springs,
USA, April 1998.
6. N. Bouhmala and M. Pahud. A parallel variant of simulated annealing for optimizing mesh partitions
on workstations. In 4th NASA National Symposium, Williamsburg, October 1997.
7. M. Pahud. Une méthode de prédiction de performance pour
les programmes parallèles irréguliers. PhD thesis,
École Polytechnique Fédérale de Lausanne, Switzerland, 1998.
Number 1911.
Available on line, pdf format