TriDenT CoLUMBO

Trees in HPF2

trident

My Research Team was in the University of Bordeaux I in the laboratory named LABRI (LABoratory of Research in Informatique/Computer Science). Our works began with the study in a new way to describe irregular data structures in HPF. This method is based on a treelike writing of the data structures by way of the derived types of Fortran 90/95 and the extensions approved of HPF 2.0. These notations allow to remove present necessary indirect accesses in the writing of the classic irregular data structures as the CCS (Compressed Column Storage). Indirect accesses prevent the compiler from generating parallel effective codes because of the not knowledge of the values of these indirect indices.

Following example shows a sparse matrix with accesses directed by column, its show under shape of a tree and its writing in HPF 2.0 as well as its distribution.

tree

We also suggest spreading the possibilities of distribution of trees as well as the addition of a new directive TREE to the language HPF: this directive specifies restricted usage which is made notations pointers and derived types, limitations producing effectively a treelike structure of data and in known depth during the compilation.

These limitations as well as additions for the distribution allow an effective implementation of derived types by the compiler as well as optimizations during the compilation based on regular frame analyses of dependences and of locality. These analyses are not practicable with irregular standard formats.

Inspection-Execution

In spite of this improvement of analyses in the compilation, it remains unknown information during this one. The parallel properties of these applications are not, as a consequence, exploitable during the compilation. It is necessary to use paradigm of inspection-execution.

The compiler generates a code containing two phases: first phase, inspection, scans these information only known in the execution to produce, during the second phase, the execution, the realization of calculations with plans optimized as at the level of the communications as of the distribution of calculations. We used this paradigm for various properties which were not studied in the previous works until this day. We were notably interested in the cases of loops with partial inter-iterations dependences (BDIIP (french) - LPIID).
It should be noted that our inspector - executor is very different from classic approaches in the sense where, in our study, inspection must be made profitable with only one phase of execution, contrary to the other approaches which are used in an iterative context, that is to say with several identical phases of execution in terms of access and communications.

Irregular sets of processors
Irregular indexes of loops
Communications inter-and extra - iterations (LPIID)
Scheduling of the DAG associated to a LPIID
Progressive Irregular Prefixe Operations

We present quickly each of these inspections - executions :

Nature	HPF 2.0	Example
Irregular sets of processors	Not supported	DO I= 1, N < /FONT> !HPF$ ON HOME (A(J), ANY (INDEX(I,1:NB(I)))=J .or. J=I), BEGIN !HPF$ INDEPENDENT, REDUCTION(A(I)) DO J1= 1, NB(I) < /FONT> J=INDEX(I,J1) A(I) = A(I) + B(J) < /FONT > END DO A(I)= f(A(I)) < /FONT> !HPF$ END ON END DO Extended directive ONE HOME (containing an algorithmic expression of the type FORALL) allows to declare an irregular set of processors: the reduction operation will therefore only use this subset, reducing implied synchronizations.
Irregular Loop Indexes	Not specified	DO I= 1, N !HPF$ INDEPENDENT, REDUCTION(A(I)) FORALL(J=1:I, ANY(INDEX(I,1:NB(I)) = J) A(I) = A(I) + B(J) END FORALL A(I) = f(A(I)) END DO The inspector of the indexes of loops precomputes valid indexes for every processor (here, based on the expression FORALL). It allows in certain cases to reduce algorithmic costs implied by the evaluation of guards at every iteration.
Communications inter-and extra - iterations (LPIID)	Not supported	This inspector allows to optimize the extractable communications of the global loop, or to send inter-iteration communications as soon as possible. Communications are split into two steps: emission and reception (contrary to the standard generation of code in HPF).
Scheduling of the DAG associated to a LPIID	Not supported	!HPF$ SCHEDULE (J = 1:I-1, ANY(INDEX(I,1:NB(I)) = J) DO I = 1, N ... END DO Gives: Call Schedule(task, 1, N) Subroutine task(I) !HPF$ ON HOME (A(J), ANY (INDEX(I,NB(I)))= J .or. J= I), BEGIN !HPF$ INDEPENDENT, REDUCTION(A(I)) FORALL(J = 1:I, ANY(INDEX(I,1:NB(I)) = J) A(I) = A(I) + B(J) END FORALL A(I) = f(A(I)) !HPF$ END ON END Subroutine This new directive SCHEDULE allows the consideration of LPIID and of task graph associated (every iteration is associated to a task, the associated DAG is extracted from the algorithmics expression of the directive clarifying the dependences of the iteration current I). The code of the loop is then rewritten in a sub-routine which will be executed according to the result of the scheduler.
Progressive Irregular Prefixe Operations	Not supported	!HPF$ PREFIX(A) DO I= 1, N !HPF$ ON HOME (A(J), ANY (INDEX(I,NB(I)))= J .or. J= I), BEGIN !HPF$ INDEPENDENT, PREFIX(A) FORALL(J = 1:I, ANY(INDEX(I,1:NB(I)) = J) A(I) = A(I) + A(J) END FORALL A(I)= f(A(I)) !HPF$ END ON END DO Defined directive PREFIX present data-parallel operation in the loop as being an irregular progressive prefix operation. This operation is progressive because every index is calculated from the final values of the array. It is not about a simple reducing, because the array in entry is also the array in exit. The management of this operation uses asynchronous communications.

Experimental Results

This is the results of a Cholesky factorization of sparse matrices in block-columns 1D version (distribution of blocks-columns). Every versions include more and more optimizations.

The last version corresponds to a Team's Cholesky factorization with important performances [A Mapping and Scheduling Algorithm for Parallel Sparse Fan-In Numerical Factorization, EuroPar'99, 1999, P. Henon, P. Ramet and J. Roman].

Factorization is placed in an iterative frame with a constant sparse matrix structure (only values evolve), as in Karmarkar's algorithm which solves AX system = B and which minimizes a function objectivizes CTX.

First iteration is obtained after the static scheduling (if any) and the time enclosed the inspection ones, the second with the dynamic scheduling (the time of the next iterations in case of the PREFIX directive) without the time of the inspection.

Version	Time First iteration	Time repetition
Version 0 : TREE notation only (TriDenT)	190s	190s
Irregular sets of processors (TriDenT + CoLUMBO)	-	-
+ Loop Indexes and Communications	8.56s	6.83s
+ DAG (SCHEDULE)	6.42s	3.36s
+ DAG and thread-based execution system (SCHEDULE)	6.18s	3.31s
+ PREFIX operation	4.65s	2.04s
PasTiX : high optimized version with MPI (en C, source LABRI)	1.80s	1.80s

Articles

Propositions de Gestion de l’Irregularité avec HPF2, Frédéric Brégier, Conférence RenPar’10, Prix IEEE Jeune Chercheur (meilleure présentation en parallélisme) , Strasbourg, 9-12 Juin 1998.
Contribution to Better Handling of Irregular Problems in HPF2 , Thomas Brandes (1), Frederic Bregier (2), Marie Christine Counilh (2), Jean Roman (2) , (1) GMD/SCAI, (2) LaBRI, ENSERB, Euro-Par’98, Southampton, LNCS Vol. 1470, pp 639-649, Springer-Verlag Ed., 1-4 September 1998.
Asynchronous Progressive Irregular Prefix Operation in HPF2 , Frederic Bregier, Marie Christine Counilh, Jean Roman, PDP’2000 (IEEE), pp 275-282, 2000.
Scheduling Loops with Partial Loop-Carried Dependencies in HPF2 , Frederic Bregier, Marie Christine Counilh, Jean Roman, Parallel Computing for Irregular Applications, 2000.
Extensions d’OpenMP pour les Architectures a Hierarchies Memoire Multiple , Frederic Bregier, Barbara Chapman, A. Patil et A. Prabhakar RenPar’2001
Achieving Performance under OpenMP on ccNUMA and Software Distributed Shared Memory Systems , Frederic Bregier, Barbara Chapman, A. Patil et A. Prabhakar accepté à Concurrency: Practice and Experiment 2001

Interventions

RenPar’10, Strasbourg, 9-12 Juin 1998 (Rencontres Francophones du Parallélisme des Architectures et des Systèmes) .
HUG’98, Porto, 25-26 June 1998 (HPF User Group meeting).
Euro-Par’98, Southampton, 1-4 September 1998.
COCA’98, Baie de Somme, 15-17 Septembre 1998 (Journées sur le Calcul Out of core & Compilation Adaptative).
Séminaire du PRiSM, Versailles Saint Quentin, Novembre 1998.
Séminaire du LaMI , Evry, Juin 1999.
PDP’2000 conference, Rhodos, Greese, January 2000

Thèse