CPSC 521 assignment 7
In this assignment, since I was somewhat disappointed with the scalability of assignment 2, the matrix version, I decided to see whether I could come up with a more efficient version using other MPI features.
This assignment does matrix multiplication (not just powers of the same matrix), splitting up each matrix into equal-sized blocks as in assignment 2. The number of processes must be a square number, and the dimensions of the matrix should be divisible by the square root of the number of processes.
I wanted to use
MPI_Comm_split to set up a communicator per row and per column of the matrix. My ideas for communication structure were as follows:
- My initial idea was to split up the left-hand and right-hand matrices and give each thread its corresponding block, then have ring communication going along each row and column. But that does not result in blocks arriving at the right places and at the right times.
- One way to fix this would be to have each process place the ring-communicated data in a queue, to be delayed until the corresponding block arrives. But this comes at a high storage cost (if several threads are running on one machine, there might be many copies of the matrix), and not all processes are computing in each round. In fact, twice as many rounds would be neccessary to finish the computation. Essentially the data is being delayed so that a systolic matrix multiplication can be performed (which is wasteful for just one multiplication).
- The final idea was this. As right-hand blocks are being shifted in ring communication around each column, there is a certain pattern of left-hand blocks that need to be matched -- and these are all in the same row. In fact every process in a row needs the row's first block, then the second block, etc. So I have each process own a block in the row and, when it is needed, broadcast the block to the whole row.
Here is the process described in more detail. First, if we are computing the matrix product , the left-hand matrix and the right-hand matrix are split into blocks, and the process owns block from both and . Now this process needs to multiply its block from by the 'th block on its row from , and similarly for any process in the same row; so whichever process in the row owns the 'th block broadcasts it to its peers (see image, left).
Now each process can compute the first part of the product which will appear at its location. Then, each process performs a ring rotation to shift its data from upwards in the column. Now a new iteration begins, where everyone in the row needs subblock of , so whichever process in the row owns that, broadcasts it. And so on.
It is worth noting that this communication mechanism only sends data that will be useful to the recipient, unlike the code in assignment 2, and every process has some computation to do in each round.
Other communication details
Of course there are other details: for example, how is the ring communication performed, in a way that works even if
MPI_Send is blocking? I use the method that might be called an even/odd staggered send/receive: here odd-numbered processes send then receive, and even-numbered processes do the opposite. This scheme will work no matter how much data there is and whether
MPI_Send blocks or not.
The broadcast communication is done with
MPI_Bcast, from a single source (whoever owns the appropriate block of ) to the row. This is somewhat inefficient, but it seems necessary unless a systolic computation is done (which will leave processes idle since the program is set up to multiply a series of matrices, and that cannot be begun until the previous product is known). Also, the broadcasts are only performed within a row, which will cut down on network traffic. This is a very useful way of restricting communication to only those processes which need the data: create an MPI communicator with the interested parties, and do broadcasts.
In the current setup, process 0 reads the input data and before each multiplication, splits it up and sends it to the other processes. This is implemented in two ways: sequentially, using send/recv; and using the MPI collectives
MPI_Gather. Which version is used depends on whether
USE_SCATTER_GATHER in config.h is defined as 0 or 1. As expected, the collectives version was slightly faster so it is used by default. The difference is not large since not much time is spent distributing the matrix (unless there are a lot of really small matrices). The collectives version only works with square matrices but other parts of the code probably also have this restriction, although I tried to avoid this when possible.
Just for fun, I ran this program versus the matpow program from assignment 2 to see the performance differences. Unfortunately, it is definitely not a fair comparison, since the matpow program knows it is computing powers of a single matrix. The matcomm program on the other hand (to specify a matrix-power problem) must parse repeated copies of the same matrix from its input and use twice as much memory keeping both matrices around.
This disadvantage is so significant, in fact, that for one process run directly (without MPI), the new program takes 11 seconds to compute the first product of a 1024x1024 matrix while the old one takes 0.8 seconds. And profiling shows that the new program is spending 98% of its runtime in the muladd routine (i.e. actually performing the multiplication). It is simply much easier to perform powers of a single matrix. This is probably because of the memory access patterns of the algorithms, combined with compiler optimizations (I used -O2 for more consistent results). Certainly it's much more friendly to the cache to use the same memory locations for one of the sources and the destination (as opposed to a different blocks of memory for each).
Probably there is a better way to do the basic multiplication, but keep in mind this factor-of-ten difference when looking at the timing results below. We were interested in coming up with a more sophisticated communication structure, not necessarily duplicating exactly the functionality of the first program. Regardless, here are the timing results:
|Matrix size||Rounds||Processes||matpow time||matcomm time|
Although the new program is still slower, for the reasons I mentioned, it is far more scalable. In the 1000 rounds case, the older matpow program slows down by a factor of forty times when going from 1 to 64 processes, while the new matcomm program slows down by a factor of 1.08. (A slowdown of 40x is rather extreme but the old program usually shows slowdowns of at least 10x, even with optimizations disabled.) This is because the new program is only sending the information that is useful over the network, while the old one had an quadratic increase in the amount of network traffic (and a linear increase in how much of that traffic was useful). Better scalability was the main goal for this new system and it seems that we have achieved this.
MPI features used
Just to tally up the other MPI features used in this program: extensive use of MPI communicators, broadcast combined with ring communication, and scatter and gather collectives. It would have been interesting to explore MPI's Cartesian topology support, but that would have precluded the interesting communicator setup.
- matcomm-final.tar.gz (2.2 MB): 776-line C++ source code for matcomm, along with testing utility to compare with matpow