CPSC 521 assignment 7

Back to assignment list.

Exploring MPI

In this assignment, since I was somewhat disappointed with the scalability of assignment 2, the matrix version, I decided to see whether I could come up with a more efficient version using other MPI features.

This assignment does matrix multiplication (not just powers of the same matrix), splitting up each matrix into equal-sized blocks as in assignment 2. The number of processes must be a square number, and the dimensions of the matrix should be divisible by the square root of the number of processes.

Communication Algorithm

I wanted to use MPI_Comm_split to set up a communicator per row and per column of the matrix. My ideas for communication structure were as follows:

Here is the process described in more detail. First, if we are computing the matrix product LaTeX, the left-hand matrix and the right-hand matrix are split into blocks, and the process LaTeX owns block LaTeX from both LaTeX and LaTeX. Now this process needs to multiply its block from LaTeX by the LaTeX'th block on its row from LaTeX, and similarly for any process in the same row; so whichever process in the row owns the LaTeX'th block broadcasts it to its peers (see image, left).

Now each process can compute the first part of the product which will appear at its location. Then, each process performs a ring rotation to shift its data from LaTeX upwards in the column. Now a new iteration begins, where everyone in the row needs subblock LaTeX of LaTeX, so whichever process in the row owns that, broadcasts it. And so on.

It is worth noting that this communication mechanism only sends data that will be useful to the recipient, unlike the code in assignment 2, and every process has some computation to do in each round.

Other communication details

Of course there are other details: for example, how is the ring communication performed, in a way that works even if MPI_Send is blocking? I use the method that might be called an even/odd staggered send/receive: here odd-numbered processes send then receive, and even-numbered processes do the opposite. This scheme will work no matter how much data there is and whether MPI_Send blocks or not.

The broadcast communication is done with MPI_Bcast, from a single source (whoever owns the appropriate block of LaTeX) to the row. This is somewhat inefficient, but it seems necessary unless a systolic computation is done (which will leave processes idle since the program is set up to multiply a series of matrices, and that cannot be begun until the previous product is known). Also, the broadcasts are only performed within a row, which will cut down on network traffic. This is a very useful way of restricting communication to only those processes which need the data: create an MPI communicator with the interested parties, and do broadcasts.

In the current setup, process 0 reads the input data and before each multiplication, splits it up and sends it to the other processes. This is implemented in two ways: sequentially, using send/recv; and using the MPI collectives MPI_Scatter and MPI_Gather. Which version is used depends on whether USE_SCATTER_GATHER in config.h is defined as 0 or 1. As expected, the collectives version was slightly faster so it is used by default. The difference is not large since not much time is spent distributing the matrix (unless there are a lot of really small matrices). The collectives version only works with square matrices but other parts of the code probably also have this restriction, although I tried to avoid this when possible.

Performance

Just for fun, I ran this program versus the matpow program from assignment 2 to see the performance differences. Unfortunately, it is definitely not a fair comparison, since the matpow program knows it is computing powers of a single matrix. The matcomm program on the other hand (to specify a matrix-power problem) must parse repeated copies of the same matrix from its input and use twice as much memory keeping both matrices around.

This disadvantage is so significant, in fact, that for one process run directly (without MPI), the new program takes 11 seconds to compute the first product of a 1024x1024 matrix while the old one takes 0.8 seconds. And profiling shows that the new program is spending 98% of its runtime in the muladd routine (i.e. actually performing the multiplication). It is simply much easier to perform powers of a single matrix. This is probably because of the memory access patterns of the algorithms, combined with compiler optimizations (I used -O2 for more consistent results). Certainly it's much more friendly to the cache to use the same memory locations for one of the sources and the destination (as opposed to a different blocks of memory for each).

Probably there is a better way to do the basic multiplication, but keep in mind this factor-of-ten difference when looking at the timing results below. We were interested in coming up with a more sophisticated communication structure, not necessarily duplicating exactly the functionality of the first program. Regardless, here are the timing results:

Matrix sizeRoundsProcessesmatpow timematcomm time
32x32101 0.0080.057
32x32104 0.0100.044
32x3210160.1280.149
32x3210640.1830.188
32x321001 0.0420.462
32x321004 0.0410.287
32x32100160.2020.407
32x32100640.3570.614
32x3210001 0.0453.407
32x3210004 0.0351.603
32x321000160.3871.669
32x321000641.9793.698
32x3220001 0.0866.566
32x3220004 0.0613.033
32x322000160.5682.915
32x322000643.2256.726

Although the new program is still slower, for the reasons I mentioned, it is far more scalable. In the 1000 rounds case, the older matpow program slows down by a factor of forty times when going from 1 to 64 processes, while the new matcomm program slows down by a factor of 1.08. (A slowdown of 40x is rather extreme but the old program usually shows slowdowns of at least 10x, even with optimizations disabled.) This is because the new program is only sending the information that is useful over the network, while the old one had an quadratic increase in the amount of network traffic (and a linear increase in how much of that traffic was useful). Better scalability was the main goal for this new system and it seems that we have achieved this.

MPI features used

Just to tally up the other MPI features used in this program: extensive use of MPI communicators, broadcast combined with ring communication, and scatter and gather collectives. It would have been interesting to explore MPI's Cartesian topology support, but that would have precluded the interesting communicator setup.

Downloads

Page generated on Tue Oct 24 00:35:18 2017