[handbook] Update parallel

824eb40b · Timo Koch · 5cb67f60 · 824eb40b
Commit 824eb40b authored 6 years ago by Timo Koch
--- a/doc/handbook/5_parallel.tex
+++ b/doc/handbook/5_parallel.tex
 \section{Parallel Computation}
 \label{sec:parallelcomputation}
-Multicore processors are standard nowadays and parallel programming is the key to gain
-performance from modern computers. This section explains how \Dumux can be used 
-on multicore systems, ranging from the users desktop computer to high performance
-computing clusters.  
+This section explains how \Dumux can be used
+on multicore / multinode systems.

 There are different concepts and methods for parallel programming, which are
-often grouped in \textit{shared-memory} and \textit{distributed-memory}  
-approaches. The parallelization in \Dumux is based on the 
-\textit{Message Passing Interface} (MPI), which is usually called MPI parallelization (distributed-memory approach). 
-It is the MPI parallelization that allows the user to run
-\Dumux applications in parallel on a desktop computer, the users laptop or 
-large high performance clusters. However, the chosen \Dumux 
-model must support parallel computations. 
-This is the case for most \Dumux applications, except for multidomain and 
-freeflow.
-
-The main idea behind the MPI parallelization is the concept of \textit{domain 
-decomposition}. For parallel simulations, the computational domain is split into 
-subdomains and one process (\textit{rank}) is used to solve the local problem of each 
-subdomain. During the global solution process, some data exchange between the 
-ranks/subdomains is needed. MPI is used to send data to other ranks and to receive 
-data from other ranks. 
-Most grid managers contain own domain decomposition methods to split the 
-computational domain  into subdomains. Some grid managers also support external 
+often grouped in \textit{shared-memory} and \textit{distributed-memory}
+approaches. The parallelization in \Dumux is based on the model supported by Dune which is currently based on
+\textit{Message Passing Interface} (MPI) (distributed-memory approach).
+
+The main idea behind the MPI parallelization is the concept of \textit{domain
+decomposition}. For parallel simulations, the computational domain is split into
+subdomains and one process (\textit{rank}) is used to solve the local problem of each
+subdomain. During the global solution process, some data exchange between the
+ranks/subdomains is needed. MPI is used to send data to other ranks and to receive
+data from other ranks. The domain decomposition in Dune is handled by the grid managers.
+The grid is partitioned and distributed on several nodes. Most grid managers contain own domain decomposition methods to split the
+computational domain  into subdomains. Some grid managers also support external
 tools like METIS, ParMETIS, PTScotch or ZOLTAN for partitioning.
+On the other hand linear algebra types such as matrices and vectors
+do not know that they are in a parallel environment. Communication is then handled by the components of the
+parallel solvers. Currently, the only parallel solver backend is \texttt{Dumux::AMGBackend}, a parallel AMG-preconditioned
+BiCGSTAB solver.

-Before \Dumux can be started in parallel, an 
-MPI library (e.g. OpenMPI, MPICH or IntelMPI) 
-must be installed on the system and all \Dune modules and \Dumux must be recompiled.  
-
+In order for \Dumux simulation to run in parallel, an
+MPI library (e.g. OpenMPI, MPICH or IntelMPI) implementation
+must be installed on the system.

 \subsection{Prepare a Parallel Application}
-Not all parts of \Dumux can be used in parallel. One example are the linear solvers
-of the sequential backend. However, with the AMG backend \Dumux provides 
-a parallel solver backend based on Algebraic Multi Grid (AMG) that can be used in
-parallel. 
-If an application uses not already the AMG backend, the 
-user must switch the backend to AMG to run the application also in parallel.
-
-First, the header file for the parallel AMG backend must be included.
+Not all parts of \Dumux can be used in parallel. In order to switch to the parallel \texttt{Dumux::AMGBackend}
+solver backend include the respective header

 \begin{lstlisting}[style=DumuxCode]
 #include <dumux/linear/amgbackend.hh>
 \end{lstlisting}

-so that the backend can be used. The header file of the sequential backend
+Second, the linear solver must be switched to the AMG backend

 \begin{lstlisting}[style=DumuxCode]
-#include <dumux/linear/seqsolverbackend.hh>
+using LinearSolver = Dumux::AMGBackend<TypeTag>;
 \end{lstlisting}
-can be removed.

-Second, the linear solver must be switched to the AMG backend 
+and the application must be recompiled. The parallel \texttt{Dumux::AMGBackend} instance has to be
+constructed with a \texttt{Dune::GridView} object and a mapper, in order to construct the
+parallel index set needed for communication.

 \begin{lstlisting}[style=DumuxCode]
-using LinearSolver = Dumux::AMGBackend<TypeTag>;
+auto linearSolver = std::make_shared<LinearSolver>(leafGridView, fvGridGeometry->dofMapper());
 \end{lstlisting}

-and the application must be compiled. 
-
 \subsection{Run a Parallel Application}
-The starting procedure for parallel simulations depends on the chosen MPI library. 
+The starting procedure for parallel simulations depends on the chosen MPI library.
 Most MPI implementations use the \textbf{mpirun} command

 \begin{lstlisting}[style=Bash]
 mpirun -np <n_cores> <executable_name>
 \end{lstlisting}

-where \textit{-np} sets the number of cores (\texttt{n\_cores}) that should be used for the 
-computation. On a cluster you usually have to use a queueing system (e.g. slurm) to 
-submit a job. 
+where \textit{-np} sets the number of cores (\texttt{n\_cores}) that should be used for the
+computation. On a cluster you usually have to use a queuing system (e.g. slurm) to
+submit a job. Check with your cluster administrator how to run parallel applications on the cluster.

 \subsection{Handling Parallel Results}
-For most models, the results should not differ between parallel and serial 
-runs. However, parallel computations are not naturally deterministic. 
-A typical case where one can not assume a deterministic behavior are models where
-small differences in the solution can cause large differences in the results 
-(e.g. for some turbulent flow problems). Nevertheless, it is useful to expect that
-the simulation results do not depend on the number of cores. Therefore you should double check 
-the model, if it is really not deterministic. Typical reasons for a wrong non-deterministic
-behavior are errors in the parallel computation of boundary conditions or missing/reduced
-data exchange in higher order gradient approximations. Also, you should keep in mind that 
-for iterative solvers differences in the solution can occur due to the error threshold.
-
-
-For serial computations, \Dumux produces single vtu-files as default output format. 
-During a simulation, one vtu-file is written for every output step. 
-In the parallel case, one vtu-file for each step and processor is created. 
-For parallel computations, an additional variable "process rank" is written 
-into the file. The process rank allows the user to inspect the subdomains 
-after the computation.
-
-\subsection{MPI scaling}
-For parallel computations, the number of cores must be chosen 
-carefully. Using too many cores will not always lead to more performance, but 
-can lead to inefficiency. One reason is that for small subdomains, the 
-communication between the subdomains becomes the limiting factor for parallel computations. 
-The user should test the MPI scaling (relation between the number of cores and the computation time) 
-for each specific application to ensure a fast and efficient use of the given resources.   
+For serial computations, \Dumux produces single vtu-files as default output format.
+During a simulation, one vtu-file is written for every output step.
+In the parallel case, one vtu-file for each step and processor is created.
+For parallel computations, an additional variable \texttt{"process rank"} is written
+into the file. The process rank allows the user to inspect the subdomains
+after the computation. The parallel vtu-files are combined in a single pvd file
+like in sequential simulations that can be opened with e.g. ParaView.