CPSC 521 assignment 4 (MPE visualizations)
In this assignment, I took the n-body program from assignment 2 and did some profiling work using MPE. The results were visualized with jumpshot. Then I manually added MPE events to the code for more useful visualizations. And here are the results.
Scaling up amount of work to approach 100% efficiency
These first two visualizations show assignment 2 running on four threads, ten iterations, granularity 10 (hence 40 bodies). The image on the left is using the automatic MPE visualization, and the image on the right uses hand-coded events (some init is cut off so that the scale matches the image on the left). Red and orange are init and finalize; green and blue are computation; and yellow is communication.
Notice how it's very chaotic and hard to figure out what's going on. This is because with only 10 bodies, each thread doesn't have enough work to do. Here is the same settings but with granularity 100 (hence 400 bodies overall):
Blue indicates computation involved in combining remote bodies with local bodies, and green is the local-local gravity computation. This is useful because the green stage is the last part done in each round, which allows us to see where the rounds end.
This is now showing very high efficiency. There's enough computational work to do in each round. Zooming in, we see very little time devoted to communication/blocking (yellow).
Taking this a step further by using 32 threads (on four machines), ten iterations, and granularity 2048 (hence 65536 bodies overall!), we see that virtually all time is spent computing. (See image below left.) Zooming in on this (image on right) we see no yellow.
Hence when there's enough work to do for each thread, the n-body program has very good efficiency.
Examining the communication structure
The trace of the n-body program with 32 threads (on 4 machines, 10 iterations and granularity 32) is quite complex, and it's difficult to see the structure of communication.
Zooming in however, we can see many interesting details. The image below and to the left is from the same run as the image above (granularity 32); the image on the right is very similar except from a run with granularity 2048.
Notice how the ring communication mechanism allows the rounds to start overlapping (on the left), as processes begin working the moment they receive their data. In the second case this shifting does occur, but much less, since computation takes comparatively longer and there is less likelihood of "faster" processes being fast enough to jump ahead.
Finally, we look at the same example (granularity 2048) near initialization (left) and finalization (right). Initialization doesn't take long, comparatively speaking, when there is a fair amount of computation to do. Finalization doesn't take long either, especially considering that all processes are communicating their final data to process 0 for printing.
One interesting part of these images is how the ends of rounds start out coinciding, and then shift as time progresses. Near the end they are offset more, which is bound to happen in ring communication; but they are also clearly offset in four groups, presumably corresponding to the higher communication times between the four machines that the threads were running on!
- assign2-mpe-ins-diff.txt (5.1 KB): Source code diff showing the changes required to add manual MPE events
- assign2-mpe-ins.tar.gz (252 KB): A2 source modified for manual MPE events, plus input files
- assign2-mpe-clogs.tar.gz (514 KB): The generated clog2 logs that were used for the jumpshot visualizations