The new generation of P3DFFT library (dubbed P3DFFT++ or P3DFFT v.3) is a generalization of the concept of P3DFFT for more arbitrary data formats and transform functions.

P3DFFT++ is written in C++ and contains wrappers providing easy interfaces with C and Fortran.

For C++ users all P3DFFT++ objects are defined within the p3dfft namespace, in order to avoid confusion with user-defined objects. For example, to initialize P3DFFT++ it is necessary to call the function p3dfft::setup(), and to exit P3DFFT++ one should call p3dfft::cleanup() (alternatively, one can use namespace p3dfft and call setup() and cleanup()). From here on in this document we will omit the implicit p3dfft:: prefix from all C++ names.

In C and Fortran these functions become p3dfft_setup and p3dfft_cleanup. While C++ users can directly access P3DFFT objects such as grid class, C and Fortran users will access these through handles provided by corresponding wrappers (see more details below).

P3DFFT++ currently operates on four main data types:

- float (single precision floating point)
- double (double precision floating point)
- mycomplex (single precision complex number) (equivalent to complex<float>)
- complex_double (double precision complex number) (equivalent to complex<double>)

While P3DFFT had the assumption of predetermined 2D pencils in X and in Z dimensions as the primary data storage, P3DFFT++ relaxes this assumption to include more general formats, such as arbitrary-shape and memory order 2D pencils as well as 3D blocks. Below is the technical description of how to specify the data layout formats.

A basic P3DFFT++ descriptor is the "grid" construct. It defines all necessary information about decomposition of a grid among parallel tasks/processors. In C++ it is defined as a class, while in C and in Fortran it is defined through handles to a C++ object through inter-language wrappers. Below is the technical description of the definition for each language.

The following is the main constructor call for the grid class:

grid(int gdims[3],int pgrid[3],int proc_order[3],int mem_order[3],MPI_Comm mpicomm);

Here:

*gdims*is the three global dimensions of the grid to be decomposed*pgrid*is the processor grid to be used in decomposition. For example, a 2D pencil with the first dimension local (X-pencil) would be described as having pgrid={1,P1,P2}, where P1 and P2 are the dimensions of 2D decomposition such that P1 x P2 = P, the total number of tasks. Of course a 2D grid could be defined as a Y-pencil (pgrid={P1,1,P2}) or a Z pencil (P1,P2,1). 1D decomposition (slabs) would be defined as (1,1,P), or (1,P,1) or (P,1,1), depending on the orientation of the slabs. 3D decomposition is also possible where each of the three values of pgrid is more than 1.*proc_order*is the ordering of the processor grid dimensions with respect to the default layout of MPI tasks. For example, the simplest ordering where proc_order={0,1,2} and pgrid={1,P1,P2} corresponds to a grid with the second dimension decomposed among P1 MPI tasks adjacent to each other in the default MPI topology, such as tasks on the same node and/or neighboring nodes on the network, while the third dimension would be decomposed among P2 tasks on non-neighboring nodes (with stride equal to P1). On the other hand, proc_order={0,2,1} with the same pgrid would correspond to the third dimension of the grid being split over the P2 tasks adjacent in the MPI/network neighborhood, while the second dimension would be split over P1 tasks distant in the topology with stride equal to P2.*mem_order*is the relative ordering of the three dimensions in memory within the local portion of the grid. The simplest ordering {0,1,2} corresponds to the first logical dimension being stored with the fastest changing index (memory stride=1), followed by the second (stride=L0) and the third dimension (stride=L0*L1). As another example, ordering {2,0,1} means that the second dimension is stored with the fastest-changing index (memory stride=1), the third dimension dimension with the medium stride =L1, and the first dimension is stored with the slowest index, stride=L1*L2. Here L0,L1 and L2 are dimensions of the local portion of the grid for a given task/processor.*mpicomm*is the initial communicator for all subsequent library operations. It is recommended that the users define or duplicate a communicator for P3DFFT++ to be different from the one(s) used in their code, in order to avoid interference.

For example:

main() {

...

int gdims[3],pgrid[3],proc_order[3], mem_order[3];

MPI_Comm mpicomm;

...

grid mygrid(gdims, pgrid, proc_order, mem_order, mpicomm);

}

Upon construction the grid object defines several useful parameters, available by accessing the following public class members:

*int ldims[3] *: dimensions of the local portion of the grid (ldims[0]=gdims[0]/pgrid[0] etc)

*int nd* : number of dimensions of the processor grid (1, 2 or 3).

*int L[3]*: 0 to 3 local dimensions (i.e. not split).

*int D[3]*: 0 to 3 split dimensions

*int glob_start[3]*: coordinates of the lowest element of the local grid within the global array

and other useful information. The grid class also provides a copy constructor.

For C users grid initialization is accomplished by a call to p3dfft_init_grid, returning a pointer to an object of type Grid. This type is a C structure containing a large part of the C++ class grid. Calling p3dfft_init_grid initializes the C++ grid object and also copies the information into a Grid object accessible from C, returning its pointer. For example:

int xdim;

Grid *grid1;

grid1 = p3dfft_init_grid(gdims, pgrid, proc_order, mem_order, mpicomm);

xdim = grid1->ldims[0]; /* Size of zero logical dimension of the local portion of the grid for a given processor */

For Fortran users the grid object is represented as a handle of type *integer(C_INT)*. For example:

integer(C_INT) grid1

integer ldims(3),glob_start(3),gdims(3),pgrid(3),proc_order(3),mem_order(3),mpicomm

grid1 = p3dfft_init_grid(ldims, glob_start, gdims, pgrid, proc_order, mem_order, mpicomm)

This call initializes a C++ grid object as a global variable and assigns an integer ID, returned in this example as *grid1*. In addition this call also returns the dimensions of the local portion of the grid (*ldims*) and the position of this portion within the global array (*glob_start*).

Other elements of the C++ grid object can be accessed through respective functions, such as p3dfft_grid_get_...

P3DFFT++ aims to provide a versatile toolkit of algorithms/transforms in frequent use for solving multiscale problems. To give the user maximum flexibility there is a range of algorithms from top-level algorithms operating on the entire 3D array, to 1D algorithms which can function as building blocks the user can arrange to suit his/her needs. In addition, inter-processor exchanges/transposes are provided, so as to enable the user to rearrange the data from one orientation of pencils to another, as well as other types of exchanges. In P3DFFT++ the one-dimensional transforms are assumed to be expensive in terms of memory bandwidth, and therefore such transforms are performed on local data (i.e. in the dimension that is not distributed across processor grid). Transforms in three dimensions consist of three transforms in one dimension, interspersed by inter-processor interchange as needed to rearrange the data. The 3D transforms is a convenient tool saving the user work in arranging the 1D transforms and transposes, as well as often providing superior performance. We recommend to use 3D transforms if they fit within the user's program.

As mentioned above, three-dimensional (3D) transforms consist of three one-dimensional transforms in sequence (one for each dimension), interspersed by inter-processor transposes. In order to specify a 3D transform, three main things are needed:

- Initial grid (as described above, grid here defines all of the specifics of grid dimensions, memory ordering and distribution among processors).
- Final grid.
- The type of 3D transform.

In order to define the 3D transform type one needs to know three 1D transform types comprising the 3D transform. (See next section for possible 1D types).

Usage of 3D transforms is different depending on the language used.

In C++ 3D transform type is interfaced through a class trans_type3D, which is constructed as follows:

trans_type3D name_type3D(int types1D[3]);

Here types1D is the array of three 1D transform types. Copy constructor is also provided for this class.

3D transforms are provided as the class template:

template<class Type1,class Type2> class transform3D;

Here Type1 and Type2 are initial and final data types. Most of the times these will be the same, however some transforms have different types on input and output, for example real-to-complex FFT. In all cases the floating point precision (single/double) of the initial and final types should match.

The constructor of transform3D takes the following arguments:

transform3D<Type1,Type2> my_transform_name(grid1,grid2,type,inplace);

Here type is a 3D transform type, inplace is a bool variable indicating whether this is an in-place transform (if the input can be rewritten), and grid1 and grid2 are initial and final grid objects. Calling a transform3D constructor creates a detailed step-by-step plan for execution of the 3D transform.

Once a 3D transform has been defined and planned, execution of a 3D transform can be done by calling

my_transform_name.exec(Type1 *in,Type2 *out, int OW);

Here in and out are initial and final data arrays of appropriate types. These are assumed to be one-dimensional contiguous arrays containing the three-dimensional grid for input and output, according to the dimensions and memory ordering specified in the grid1 and grid2 objects, respectively.