void p3dfft_setup()

*Function:* called once in the beginning of use to initialize P3DFFT++.

void p3dfft_cleanup()

*Function:* called once before exit and after the use to free up P3DFFT++ structures.

Grid *p3dfft_init_grid(int gdims[3],int pgrid[3],int proc_order[3],int mem_order[3],MPI_Comm mpicomm);

*Function:* Initializes a new grid with specified parameters.

*Arguments:*

* gdims:* three global grid dimensions (logical order - X, Y, Z)

*pgrid: *up to three dimensions of processor grid, decomposing the global grid array. Value =1 means the grid is not decomposed but is local in that logical dimension.

* proc_order:* a permutation of the 3 integers: 0, 1 and 2. Specifies the topology of processor grid on the interconnect. The dimension with lower value means the MPI tasks in that dimension are closer in ranks, e.g. value=0 means the ranks are adjacent (stride=1), value=1 means they are speard out with the stride equal to the pgrid value of the dimension with stride=1 etc

*mem_order:* a permutation of the 3 integers: 0, 1 and 2. Specifies mapping of the logical dimension and memory storage dimensions for local memory for each MPI task. mem_order[i0] = 0 means that the i0's logical dimension is stored with stride=1 in memory. Similarly, mem_order[i1] =1 means that i1's logical dimension is stored with stride=ldims[i0] etc

*mpicomm:* the MPI communicator in which this grid lives

*Return value: *a pointer to the newly initialized Grid structure that can later be used for grid operations and to get information about the grid.

The Grid structure is defined as follows:

struct Grid_struct {

int taskid,numtasks;

int nd; //number of dimensions the volume is split over

int gdims[3]; //Global dimensions

int mem_order[3]; //Memory ordering inside the data volume

int ldims[3]; //Local dimensions on THIS processor

int pgrid[3]; //Processor grid

int proc_order[3]; //Ordering of tasks in processor grid, e.g. (1,2,3) : first dimension - adjacent tasks,then second, then third dimension

int P[3]; //Processor grid size (in inverse order of split dimensions, i.e. rows first, then columns etc

int D[3]; //Ranks of Dimensions of physical grid split over rows and columns correspondingly

int L[3]; //Rank of Local dimension (p=1)

int grid_id[3]; //Position of this pencil/cube in the processor grid

int grid_id_cart[3];

int glob_start[3]; // Starting coords of this cube in the global grid

MPI_Comm mpi_comm_glob; // Global MPi communicator we are starting from

MPI_Comm mpi_comm_cart;

MPI_Comm mpicomm[3]; //MPI communicators for each dimension

} ;

typedef struct Grid_struct Grid;

void p3dfft_free_grid(Grid *gr)

*Function: *frees up grid.

*Arguments:*

*gr:* pointer to Grid structure.

1D transforms can be done with or without data exchange and/or memory reordering. In general, combining a transform with an exchange/reordering can be beneficial for performance due to cache reuse, compared to two separate calls to a transform and an exchange.

The following predefined 1D transforms are available:

P3DFFT_EMPTY_TYPE - empty transform

P3DFFT_R2CFFT_S, P3DFFT_R2CFFT_D - real-to-complex forward FFT (as defined in FFTW manual), in single and double precision respectively

P3DFFT_C2RFFT_S, P3DFFT_C2RFFT_D - complex-to-real backward FFT (as defined in FFTW manual), in single and double precision respectively

P3DFFT_CFFT_FORWARD_S, P3DFFT_CFFT_FORWARD_D - complex forward FFT (as defined in FFTW manual), in single and double precision respectively

P3DFFT_CFFT_BACKWARD_S, P3DFFT_CFFT_BACKWARD_D - complex backward FFT (as defined in FFTW manual), in single and double precision respectively

P3DFFT_DCT<x>_REAL_S, P3DFFT_DCT1_REAL_D - cosine transform for real-numbered data, in single and double precision, where <x> stands for the variant of the cosine transform, such as DCT1, DCT2, DCT3 or DCT4

P3DFFT_DST<x>_REAL_S, P3DFFT_DST1_REAL_D - sine transform for real-numbered data, in single and double precision, where <x> stands for the variant of the cosine transform, such as DST1, DST2, DST3 or DST4

P3DFFT_DCT<x>_COMPLEX_S, P3DFFT_DCT1_COMPLEX_D - cosine transform for complex-numbered data, in single and double precision, where <x> stands for the variant of the cosine transform, such as DCT1, DCT2, DCT3 or DCT4

P3 DFFT_DST<x>_COMPLEX_S, P3DFFT_DST1_COMPLEX_D - sine transform for complex-numbered data, in single and double precision, where <x> stands for the variant of the cosine transform, such as DST1, DST2, DST3 or DST4

P3DFFT_CHEB_REAL_S, P3DFFT_CHEB_ REAL_D - Chebyshev transform for real-numberes data, in single and double precision

P3DFFT_CHEB_COMPLEX_S, P3DFFT_CHEB_COMPLEX_D - Chebyshev transform for complex-numbered data, in single and double precision

int p3dfft_plan_1Dtrans(Grid *gridIn, Grid *gridOut, int type1D, int dim, int inplace)

*Function:* defines and plans a 1D transform of a 3D array

*Arguments:*

* gridIn* and *gridOut* are pointers to the C equivalent of P3DFFT++ grid object (initial and final)

* dim* is the logical dimension of the transform (0, 1 or 2). Note that this is the logical dimension rank (0 for X, 1 for Y, 2 for Z), and may not be the same as the storage dimension, which depends on mem_order member of gridIn and gridOut. The transform dimension of the grid is assumed to be MPI task-local.

*inplace* indicates that this is not an in-place transform (a non-zero argument would indicate in-place).

*Return value:* The function returns a handle for the transform that can be used in other function calls.

*Notes:*

This initialization/planning needs to be done once per transform type.

void p3dfft_exec_1Dtrans_double(int mytrans, double *IN, double *OUT)

void p3dfft_exec_1Dtrans_single(int mytrans, float *IN, float *OUT)

*Function:* executes double or single precision 1D transform, respectively, of a 3D array

*Arguments: *

* mytrans* is the handle for the 1Dtransform

* IN* and *OUT *are pointers to one-dimensional input and output arrays containing the 3D grid stored contiguously in memory based on the local grid dimensions and storage order of *gridIn* and *gridOut*.

*Notes:*

1) The execution can be performed many times with the same handle and same or different input and output arrays.

2) In case of out-of-place transform the input and output arrays must be non-overlapping.

3) Both input and output arrays must be local in the dimension of transform

int p3dfft_init_3Dtype(int type_ids[3])

*Function:* Defines a 3D transform type

*Arguments: *

type_ids: an array of three 1D transform types.

*Return value:* a handle for 3D transform type.

*Example:*

*int type_rcc, type_ids[3];*

*type_ids[0] = P3DFFT_R2CFFT_D;type_ids[1] = P3DFFT_CFFT_FORWARD_D;type_ids[2] = P3DFFT_CFFT_FORWARD_D;*

*type_rcc = p3dfft_init_3Dtype(type_ids);*

In this example *type_rcc* will describe the real-to-complex (R2C) 3D transform (R2C in 1D followed by two complex 1D transforms).

int p3dfft_plan_3Dtrans(Grid *gridIn, Grid *gridOut, int type3D, int inplace)

*Function:* plans a 3D transform

*Arguments:*

* gridIn* and *gridOut* are pointers to initial and final grid objects

* type3D* is the 3D transform type defined as above

* inplace* is an integer indicating an in-place transform if it's non-zero, out-of-place otherwise.

*Return value:* The function returns an integer handle to the 3D transform that can be called multiple times by an execute function.

*Notes: *

1) This initialization/planning needs to be done once per transform type.

2) The final grid may or may not be the same as the initial grid. First, in real-to-complex and complex-to-real transforms the global grid dimensions change for example from (n0,n1,n2) to (n0/2+1,n1,n2), since most applications attempt to save memory by using the conjugate symmetry of the Fourier transform of real data. Secondly, the final grid may have different processor distribution and memory ordering, since for example many applications with convolution and those solving partial differential equations do not need the initial grid configuration in Fourier space. The flow of these applications is typically 1) transform from physical to Fourier space, 2) apply convolution or derivative calculation in Fourier space, and 3) inverse FFT to physical space. Since forward FFT's last step is 1D FFT in the third dimension, it is more efficient to leave this dimension local and stride-1, and since the first step of the inverse FFT is to start with the third dimension 1D FFT, this format naturally fits the algorithm and results in big savings of time due to elimination of several extra transposes.

void p3dfft_exec_3Dtrans_single(int mytrans, float *In, float *Out, int overwrite)

void p3dfft_exec_3Dtrans_double(int mytrans, double *In, double *Out, int overwrite)

*Function: *execute 3D transform in single or double precision, respectively

*Arguments: *

*In* and *Out* are pointers to input and output arrays, assumed to be the local portion of the 3D grid array

*overwrite* parameter indicates whether it is permitted to overwrite the input array (this is relevant only for out-of-place transforms).

*Notes:*

1) Unless inplace was defined in the planning stage of mytrans, In and Out must be non-overlapping

2) These functions can be used multiple times after the 3D transform has been defined and planned.