
TriDenT CoLUMBO
Trees in HPF2
My Research Team was in the University of Bordeaux I in the laboratory named LABRI (LABoratory of Research in Informatique/Computer Science). Our works began with the study in a new way to describe irregular data structures in HPF. This method is based on a treelike writing of the data structures by way of the derived types of Fortran 90/95 and the extensions approved of HPF 2.0. These notations allow to remove present necessary indirect accesses in the writing of the classic irregular data structures as the CCS (Compressed Column Storage). Indirect accesses prevent the compiler from generating parallel effective codes because of the not knowledge of the values of these indirect indices.
Following example shows a sparse matrix with accesses directed by column, its show under shape of a tree and its writing in HPF 2.0 as well as its distribution.
We also suggest spreading the possibilities of distribution of trees as well as the addition of a new directive TREE to the language HPF: this directive specifies restricted usage which is made notations pointers and derived types, limitations producing effectively a treelike structure of data and in known depth during the compilation.
These limitations as well as additions for the distribution allow an effective implementation of derived types by the compiler as well as optimizations during the compilation based on regular frame analyses of dependences and of locality. These analyses are not practicable with irregular standard formats.
InspectionExecution
In spite of this improvement of analyses in the compilation, it remains unknown information during this one. The parallel properties of these applications are not, as a consequence, exploitable during the compilation. It is necessary to use paradigm of inspectionexecution.
The compiler generates a code containing two phases: first phase, inspection, scans these information only known in the execution to produce, during the second phase, the execution, the realization of calculations with plans optimized as at the level of the communications as of the distribution of calculations. We used this paradigm for various properties which were not studied in the previous works until this day. We were notably interested in the cases of loops with partial interiterations dependences (BDIIP (french)  LPIID). It should be noted that our inspector  executor is very different from classic approaches in the sense where, in our study, inspection must be made profitable with only one phase of execution, contrary to the other approaches which are used in an iterative context, that is to say with several identical phases of execution in terms of access and communications.
Irregular sets of processors
Irregular indexes of loops
Communications interand extra  iterations (LPIID)
Scheduling of the DAG associated to a LPIID

Progressive Irregular Prefixe Operations
We present quickly each of these inspections  executions :
Nature 
HPF 2.0 
Example 
Irregular sets of processors 
Not supported 
DO I= 1, N < /FONT>
!HPF$ ON HOME (A(J), ANY (INDEX(I,1:NB(I)))=J .or. J=I), BEGIN
!HPF$ INDEPENDENT, REDUCTION(A(I))
DO J1= 1, NB(I) < /FONT>
J=INDEX(I,J1)
A(I) = A(I) + B(J) < /FONT >
END DO
A(I)= f(A(I)) < /FONT>
!HPF$ END ON
END DO
Extended directive ONE HOME (containing an algorithmic expression of the type FORALL) allows to declare an irregular set of processors: the reduction operation will therefore only use this subset, reducing implied synchronizations. 
Irregular Loop Indexes 
Not specified 
DO I= 1, N
!HPF$ INDEPENDENT, REDUCTION(A(I))
FORALL(J=1:I, ANY(INDEX(I,1:NB(I)) = J)
A(I) = A(I) + B(J)
END FORALL
A(I) = f(A(I))
END DO
The inspector of the indexes of loops precomputes valid indexes for every processor (here, based on the expression FORALL). It allows in certain cases to reduce algorithmic costs implied by the evaluation of guards at every iteration.

Communications interand extra  iterations (LPIID) 
Not supported 
This inspector allows to optimize the extractable communications of the global loop, or to send interiteration communications as soon as possible. Communications are split into two steps: emission and reception (contrary to the standard generation of code in HPF). 
Scheduling of the DAG associated to a LPIID 
Not supported 
!HPF$ SCHEDULE (J = 1:I1, ANY(INDEX(I,1:NB(I)) = J)
DO I = 1, N
...
END DO
Gives:
Call Schedule(task, 1, N)
Subroutine task(I)
!HPF$ ON HOME (A(J), ANY (INDEX(I,NB(I)))= J .or. J= I), BEGIN
!HPF$ INDEPENDENT, REDUCTION(A(I))
FORALL(J = 1:I, ANY(INDEX(I,1:NB(I)) = J)
A(I) = A(I) + B(J)
END FORALL
A(I) = f(A(I))
!HPF$ END ON
END Subroutine
This new directive SCHEDULE allows the consideration of LPIID and of task graph associated (every iteration is associated to a task, the associated DAG is extracted from the algorithmics expression of the directive clarifying the dependences of the iteration current I). The code of the loop is then rewritten in a subroutine which will be executed according to the result of the scheduler.

Progressive Irregular Prefixe Operations 
Not supported 
!HPF$ PREFIX(A)
DO I= 1, N
!HPF$ ON HOME (A(J), ANY (INDEX(I,NB(I)))= J .or. J= I), BEGIN
!HPF$ INDEPENDENT, PREFIX(A)
FORALL(J = 1:I, ANY(INDEX(I,1:NB(I)) = J)
A(I) = A(I) + A(J)
END FORALL
A(I)= f(A(I))
!HPF$ END ON
END DO
Defined directive PREFIX present dataparallel operation in the loop as being an irregular progressive prefix operation. This operation is progressive because every index is calculated from the final values of the array. It is not about a simple reducing, because the array in entry is also the array in exit. The management of this operation uses asynchronous communications.

Experimental Results
This is the results of a Cholesky factorization of sparse matrices in blockcolumns 1D version (distribution of blockscolumns). Every versions include more and more optimizations.
The last version corresponds to a Team's Cholesky factorization with important performances [A Mapping and Scheduling Algorithm for Parallel Sparse FanIn Numerical Factorization, EuroPar'99, 1999, P. Henon, P. Ramet and J. Roman].
Factorization is placed in an iterative frame with a constant sparse matrix structure (only values evolve), as in Karmarkar's algorithm which solves AX system = B and which minimizes a function objectivizes CTX.
First iteration is obtained after the static scheduling (if any) and the time enclosed the inspection ones, the second with the dynamic scheduling (the time of the next iterations in case of the PREFIX directive) without the time of the inspection.
Version 
Time First iteration 
Time repetition 
Version 0 : TREE notation only (TriDenT) 
190s 
190s 
Irregular sets of processors (TriDenT + CoLUMBO) 
 
 
+ Loop Indexes and Communications 
8.56s 
6.83s 
+ DAG (SCHEDULE) 
6.42s 
3.36s 
+ DAG and threadbased execution system (SCHEDULE) 
6.18s 
3.31s 
+ PREFIX operation 
4.65s 
2.04s 
PasTiX : high optimized version with MPI (en C, source LABRI) 
1.80s 
1.80s 
Articles
Propositions de Gestion de l’Irregularité avec HPF2, Frédéric Brégier, Conférence RenPar’10, Prix IEEE Jeune Chercheur (meilleure présentation en parallélisme) , Strasbourg, 912 Juin 1998.
Contribution to Better Handling of Irregular Problems in HPF2 , Thomas Brandes (1), Frederic Bregier (2), Marie Christine Counilh (2), Jean Roman (2) , (1) GMD/SCAI, (2) LaBRI, ENSERB, EuroPar’98, Southampton, LNCS Vol. 1470, pp 639649, SpringerVerlag Ed., 14 September 1998.
Asynchronous Progressive Irregular Prefix Operation in HPF2 , Frederic Bregier, Marie Christine Counilh, Jean Roman, PDP’2000 (IEEE), pp 275282, 2000.
Scheduling Loops with Partial LoopCarried Dependencies in HPF2 , Frederic Bregier, Marie Christine Counilh, Jean Roman, Parallel Computing for Irregular Applications, 2000.
Extensions d’OpenMP pour les Architectures a Hierarchies Memoire Multiple , Frederic Bregier, Barbara Chapman, A. Patil et A. Prabhakar RenPar’2001

Achieving Performance under OpenMP on ccNUMA and Software Distributed Shared Memory Systems , Frederic Bregier, Barbara Chapman, A. Patil et A. Prabhakar accepté à Concurrency: Practice and Experiment 2001
Interventions
RenPar’10, Strasbourg, 912 Juin 1998 (Rencontres Francophones du Parallélisme des Architectures et des Systèmes) .
HUG’98, Porto, 2526 June 1998 (HPF User Group meeting).
EuroPar’98, Southampton, 14 September 1998.
COCA’98, Baie de Somme, 1517 Septembre 1998 (Journées sur le Calcul Out of core & Compilation Adaptative).
Séminaire du PRiSM, Versailles Saint Quentin, Novembre 1998.
Séminaire du LaMI , Evry, Juin 1999.

PDP’2000 conference, Rhodos, Greese, January 2000
Thèse
