CPSC 521 assignment 4 (MPE visualizations)

Back to assignment list.


In this assignment, I took the n-body program from assignment 2 and did some profiling work using MPE. The results were visualized with jumpshot. Then I manually added MPE events to the code for more useful visualizations. And here are the results.


Scaling up amount of work to approach 100% efficiency

These first two visualizations show assignment 2 running on four threads, ten iterations, granularity 10 (hence 40 bodies). The image on the left is using the automatic MPE visualization, and the image on the right uses hand-coded events (some init is cut off so that the scale matches the image on the left). Red and orange are init and finalize; green and blue are computation; and yellow is communication.

Generated from auto-n4-g10.clog2 and ins-n4-g10.clog2

Notice how it's very chaotic and hard to figure out what's going on. This is because with only 10 bodies, each thread doesn't have enough work to do. Here is the same settings but with granularity 100 (hence 400 bodies overall):

Generated from auto-n4-g100.clog2 and ins-n4-g100.clog2

Blue indicates computation involved in combining remote bodies with local bodies, and green is the local-local gravity computation. This is useful because the green stage is the last part done in each round, which allows us to see where the rounds end.

This is now showing very high efficiency. There's enough computational work to do in each round. Zooming in, we see very little time devoted to communication/blocking (yellow).

Taking this a step further by using 32 threads (on four machines), ten iterations, and granularity 2048 (hence 65536 bodies overall!), we see that virtually all time is spent computing. (See image below left.) Zooming in on this (image on right) we see no yellow.

Generated from ins-n32-g2048.clog2

Hence when there's enough work to do for each thread, the n-body program has very good efficiency.

Examining the communication structure

The trace of the n-body program with 32 threads (on 4 machines, 10 iterations and granularity 32) is quite complex, and it's difficult to see the structure of communication.

Zooming in however, we can see many interesting details. The image below and to the left is from the same run as the image above (granularity 32); the image on the right is very similar except from a run with granularity 2048.

Generated from ins-n32-g32.clog2 and ins-n32-g2048.clog2

Notice how the ring communication mechanism allows the rounds to start overlapping (on the left), as processes begin working the moment they receive their data. In the second case this shifting does occur, but much less, since computation takes comparatively longer and there is less likelihood of "faster" processes being fast enough to jump ahead.

Finally, we look at the same example (granularity 2048) near initialization (left) and finalization (right). Initialization doesn't take long, comparatively speaking, when there is a fair amount of computation to do. Finalization doesn't take long either, especially considering that all processes are communicating their final data to process 0 for printing.

Generated from ins-n32-g2048.clog2

One interesting part of these images is how the ends of rounds start out coinciding, and then shift as time progresses. Near the end they are offset more, which is bound to happen in ring communication; but they are also clearly offset in four groups, presumably corresponding to the higher communication times between the four machines that the threads were running on!


Page generated on Tue Oct 24 00:35:30 2017