mpi - Understanding Block and Block-Cyclic Matrix Distributions -
in working parallel decompositions of matrices, i'm familiar block distribution, have (say) 4 processes, each own subregion of matrix:
so instance here have number of processes in row (procrows
) equal 2, number of processes in column (proccols
) equal two, , sub-matrices a_local
have size n/2 x m/2
if original matrix size n x m
.
i reading example uses "block-cyclic" distribution, , in part:
/* begin cblas context */ /* assume have 4 processes , place them in 2-by-2 grid */ int ctxt, myid, myrow, mycol, numproc; int procrows = 2, proccols = 2; cblacs_pinfo(&myid, &numproc); cblacs_get(0, 0, &ctxt); cblacs_gridinit(&ctxt, "row-major", procrows, proccols);
they have procrows
, proccols
hardcoded, fine, matrix that's read in, there's header:
nb , mb number of rows , columns of blocks [of matrix]
i don't understand this; aren't nb
, mb
determined n, m, procrows, , proccols?
edit
from running example can see submatrix on process 0 has elements of left corner of matrix, in picture above, comes in contradiction jonathan's answer. however, works fine scalapack's cholesky.
block decompositions of matrices you've described in question valid way of distributing matrix, not way it.
in particular, block data distributions (breaking matrix procrows x process
sub-matrices) little inflexible. if matrix size not divisible number of processes in row or column - , have no control on size of matrix, , flexibility procrows/proccols - can end serious load balancing issues. also, it's handy able "over decompose" problem; break more pieces have tasks. in particular, mpi, since each task process, it's useful able have multiple sub-matrices each process operate on can tackle additional level of parallelism threading (which built in single-process linear algebra libraries).
the way most flexibility load-balancing, , have highest degree of inter-process parallelism available, purely cyclic distribution. in 1d cyclic distribution, dividing 15 items between 4 processors, processor 1 item 1, 2 item 2, 3 item 3, 4 4, , processor 1 item 5, , on; round robin items across processors.
on other hand, in 1d block decomposition, processor 1 items 1-4, processor 2 5-9, , on.
a figure useful llnl parallel computing tutorial follows, each color labelling processor got region of data:
so cyclic decomposition maximally parallelism , load-balancing, it's terrible data access; every neighbouring piece of data you'd want able access linear algebra operations off-processor. on other hand, block decomposition maximally data access; have large contiguous chunk of data possible, can matrix operations on nice big sub-matrices; it's inflexible parallelism , can cost in terms of load-balancing.
block-cyclic interpolation between two; on decompose matrix blocks, , cyclicly distribute blocks across processes. lets tune tradeoff between data access contiguity , flexibility. if block-cyclic block sizes 1, have cyclic distribution; if n/procrows
or n/proccols
have block distribution; can have in between.
note in 2d can in principle choose different decompositions along rows , columns, , that's useful if matrix going used in 1 sort of computation; more usual case decomposition same among dimensions, when people "block decomposition" or "block-cyclic decomposition" mean along dimensions.
a description of on scalapack pages @ netlib.
Comments
Post a Comment