Hussaini, M. The purpose here is to assess the state of the art in the areas of numerical analysis that are particularly relevant to computational fluid dynamics CFD , to identify promising new developments in various areas of numerical analysis that will impact CFD, and to establish a long-term perspective focusing on opportunities and needs. SEE all child document records. Author Hussaini, M. Publication Year Availability Type Acquire from other sources. Adaptive Mesh Refinement algorithms see Section 2 can generate imbalance on the parallelization.

This can be mitigated by migrating elements between neighboring subdomains [ 15 ]. However, at some point, it may be more efficient to evaluate a new partition and migrate the simulation results to it. In order to minimize the cost of this migration, we aim to maximize the intersection between the old and new subdomain for each parallel process, and this can be better controlled with geometric partitioning approaches. In the finite element method, the assembly consists of a loop over the elements of the mesh, while it consists of a loop over cells or faces in the case of the finite volume method.

- Handbook of Environmental Health, Fourth Edition, Volume II: Pollutant Interactions in Air, Water, and Soil.
- Computational fluid dynamics | Thoughts on management.
- Global Logistics Strategies: Delivering the Goods.
- Recommended for you.
- SIAM Journal on Numerical Analysis!
- The Arthritis Sourcebook.
- Primary menu;

We study the parallelization of such assembly process for distributed and shared memory parallelism, based on MPI and OpenMP programming models, respectively. Then, we briefly introduce some HPC optimizations. In the following, to respect tradition, we refer to elements in the FE context and to cells in the FV context. The choice for one or another depends on the discretization method used, on the parallel data structures required by the algebraic solvers Section 5. Finite element and cell-centered finite volume matrix assembly techniques.

Partial row matrix. Local matrices can be made of partial rows square matrices or full rows rectangular matrices. The first option is natural in the finite element context, where partitioning consists in dividing the mesh into disjoint element sets for each MPI process, and where only interface nodes are duplicated. In next section, we show how one can take advantage of this format to perform the main operation of iterative solvers, namely the sparse matrix vector product SpMV.

Full row matrix. The full row matrix consists in assigning rows exclusively to one MPI process.

**barkdrowtelsju.cf**

## Algorithmic Trends in Computational Fluid Dynamics | M.Y. Hussaini | Springer

The first option involves additional communications, while the second option duplicates the element integration on halo elements. The relative performance of partial and full row matrices depends on the size of the halos, involving more memory and extra computation, compared to the cost of additional MPI communications. Note that open-source algebraic solvers e. In cell-centered FV methods, the unknowns are located in the cells. This is the option selected in practice in FV codes, although a communication could be used to obtain the full row format without introducing halos on both sides only one side would be enough.

In fact, let us imagine that subdomain 1 does not hold the halo cell 3. To obtain the full row for cell 2, a communication could be used to pass coefficient A 23 from subdomain 2 to subdomain 1. Load balance. As far as load balance is concerned, the partial row method is the one which a priori enables one to control the load balance of the assembly, as elements are not duplicated. On the other hand, in the full row method, the number of halo elements depends greatly upon the partition. In addition, the work on these elements is duplicated and thus limits the scalability of the assembly: for a given mesh, the relative number of halo elements with respect to interior elements increases with the number of subdomains.

Hybrid meshes. In the FE context, should the work load per element be perfectly predicted, the load balance would only depend on the partitioner efficiency see Section 3. However, to obtain such a prediction of the work load, one should know the exact relative cost of assembling each and every type of element of the hybrid mesh hexahedra, pyramids, prisms, and tetrahedra. This is a priori impossible, as this cost not only depends on the number of operations, but also on the memory access patterns, which are unpredictable.

### Address, mail...

High-order methods. When considering high-order approximations, the situations of the FE and FV methods differ. In the first case, the additional degrees of freedom DOF appearing in the matrix are confined to the elements. Thus, only the number of interface nodes increases with respect to the same number of elements with a first-order approximation.

In the case of the FV method, high-order methods are generally obtained by introducing successive layer of halos, thus reducing the scalability of the method. Sparse matrix vector product. As mentioned earlier, the main operation of Krylov-based iterative solvers is the SpMV. We see in next section that the partial row and full row matrices lead to different communication orders and patterns.

The following is explained in the FE context but can be translated straightforwardly to the FV context. Finite element assembly consists in computing element matrices and right-hand sides A e and b e for each element e , and assembling them into the local matrices and RHS of each MPI process i , namely A i and b i. From the point of view of each MPI process, the assembly can thus be idealized as in Algorithm 1.

OpenMP pragmas can be used to parallelize Algorithm 1 quite straightforwardly, as we will see in a moment. So why this shared memory parallelism has been having little success in CFD codes? When using MPI, most of the computational kernels are parallel by construction, as they consist of loops over local meshes entities such as elements, nodes, and faces, even though scalability is obviously limited by communications.

One example of possible sequential kernel is the coarse grain solver described in Section 5. As an example, Alya code [ 24 ] has more than element loops. However, the situation is changing, for two main reasons. First, nowadays, supercomputers offer a great variety of architectures, with many cores on nodes e. Thus, shared memory parallelism is gaining more and more attention as OpenMP offers more flexibility to parallel programming. In fact, sequential kernels can be parallelized at the shared memory level using OpenMP: one example is once more the coarse solve of iterative solvers; another example is the possibility of using dynamic load balance on shared memory nodes, as explained in [ 25 ] and introduced in Section 4.

As mentioned earlier, the parallelization of the assembly has traditionally been based on loop parallelism using OpenMP. Two main characteristics of this loop have led to different algorithms in the literature.

## Editorial: Algorithmic Aspects of High-Performance Computing for Mechanics and Physics

On the one hand, there exists a race condition. The race conditions comes from the fact that different OpenMP threads can access the same degree of freedom coefficient when performing the scatter of element matrix and RHS, in step 4 of Algorithm 1. On the other hand, spatial locality must be taken care of in order to obtain an efficient algorithm. Shared memory parallelism techniques using OpenMP. The cost of the ATOMIC comes from the fact that we do not know a priori when conflicts occur and thus this pragma must be used at each loop iteration.

This lowers the IPC defined in Section 1. Loop parallelism using element coloring. The second method consists in coloring [ 26 ] the elements of the mesh such that elements of the same color do not share nodes [ 27 ], or such that cells of the same color do not share faces in the FV context. The main drawback is that spatial locality is lessened by construction of the coloring. In [ 28 ], a comprehensive comparison of this technique and the previous one is presented. Loop parallelism using element partitioning. In order to preserve spatial locality while disposing of the ATOMIC pragma, another technique consists in partitioning the local mesh of each MPI process into disjoint sets of elements e.

Then, one defines separators as the layers of elements which connect neighboring subdomains. By doing this, elements of different subdomains do not share nodes. Task parallelism using multidependences. Task parallelism could be used instead of loop parallelism, but the three algorithmics presented previously would not change [ 30 , 31 , 32 ]. There are two new features implemented in OmpSs a forerunner for OpenMP that are not yet included in the standard that can help: multidependences and commutative.

## Algorithmic Trends in Computational Fluid Dynamics

These would allow us to express incompatibilities between subdomains. The mesh of each MPI process is partitioned into disjoint sets of elements, and by prescribing the neighboring information in the OpenMP pragma, the runtime will take care of not executing neighboring subdomains at the same time [ 33 ]. As explained in Section 1. The x -axis is time, while the y -axis is the MPI process number, and the dark grey color represents the element loop assembly Algorithm 1.

After the assembly, the next operation is a reduction operation involving MPI, the initial residual norm of the iterative solver quite common in practice. Therefore, this is a synchronization point where MPI processes are stuck until all have reached this point. We can observe in the figure that one of the cores is taking almost the double time to perform this operation, resulting in a load imbalance.

Load imbalance has many causes: mesh adaptation as described in Section 2. The example presented in the figure is due to wrong element weights given to METIS partitioner for the partition of a hybrid mesh [ 28 ].

- About?
- Complete Guide to Investing in Property.
- PCCFD Conference - PCCFD;
- chapter and author info.
- Nutrition Labeling Handbook (Food Science and Technology)?

There are several works in the literature that deal with load imbalance at runtime. We can classify them into two main groups, the ones implemented by the application may be using external tools and the ones provided by runtime libraries and transparent to the application code. In the first group, one approach would be to perform local element redistribution from neighbors to neighbors.

Thus, only limited point-to-point communications are necessary, but this technique provides also a limited control on the global load balance. Another option consists in repartitioning the mesh, to achieve a better load distribution. In order for this to be efficient, a parallel partitioner e.

In addition, this method is an expensive process so that imbalance should be high to be an interesting option. In general, these libraries will detect the load imbalance and migrate objects or specific data structures between processes.

They usually require to use a concrete programming language, programming model, or data structures, thus requiring high levels of code rewriting in the application. Finally, the approach that has been used by the authors is called DLB [ 25 ] and has been extensively studied in [ 28 , 33 , 36 ] in the CFD context.

Principles of dynamic load balance with DLB [25], via resources sharing at the shared memory level. Threads running on cores 3 and 4 are clearly responsible for the load imbalance. When using DLB, threads running in core 1 and core 2 lend their resources as soon as they enter the synchronization point, for example, an MPI reduction represented by the orange bar.