CPSC 521 assignment 2
Assignment description
This is assignment 1, generalized so that each process can handle multiple physics bodies. This page contains some timing information and a visualization program as well. The matrix multiplication part of the assignment is available here.
Assignment tweaks
This assignment several features from assignment 1, including SCALAR
and SCALAR_MPI_TYPE
for switching between single and double-precision floating point, as well as VISUALIZE_OUTPUT
for use with the visualizer and DEBUG_OUTPUT
for debugging purposes. It also has a feature DO_TIMING
which enables printing the wall-clock time to standard error.
Visualization
Just for fun, I implemented a visualization program for this assignment. If VISUALIZE_OUTPUT
is enabled, the assignment reports the position of bodies every so often. This output can be fed into the visualizer (directly or through ssh), and you will see the state of the system updating in real-time. A configuration might look something like this (click to enlarge):
See Assignment 4 for more visualizations.
Performance measurements
(Note: all times in this section are the average of three runs on non-head node(s).)
Parallelism and efficiency
My original implementation used an inefficient method of computing forces amongst process-local bodies, which involved n(n-1)
gravity calculations. With this version, timing information went like this (with 1024 bodies and 1000 rounds):
Cores*Machines | Time (seconds) | Speedup factor | Efficiency |
---|---|---|---|
1*1 | 54.306445 | 1.000 | 100.000% |
2*1 | 26.915745 | 2.018 | 100.900% |
4*1 | 13.418258 | 4.047 | 101.175% |
8*1 | 6.804002 | 7.982 | 99.775% |
The efficiency in this case is practically 100% all the time. This is because the system does so much computation locally that the parallel communication cost is negligible. (Notice two cases of an efficiency greater than 100%: these are probably artifacts of caching or random measurement error speeding up the execution.) An optimized version which performs approximately n(n-1)/2
local calculations runs much faster:
Cores*Machines | Time (seconds) | Speedup factor | Efficiency |
---|---|---|---|
1*1 = 1 | 33.240994 | 1.000 | 100.000% |
2*1 = 2 | 21.651886 | 1.535 | 76.750% |
4*1 = 4 | 12.110241 | 2.745 | 68.625% |
8*1 = 8 | 6.414370 | 5.182 | 64.775% |
8*2 = 16 | 3.486782 | 9.533 | 59.581% |
8*4 = 32 | 1.943801 | 17.101 | 53.441% |
Here there is more parallel work and less time wasted locally, and the efficiency decreases as Amdahl's Law dictates. The same information in graph form:
Changing the number of iterations
This is the system running on 8 cores of one machine, with different numbers of iterations.
Iterations | Time (seconds) |
---|---|
256 | 1.666184 |
512 | 3.305188 |
1024 | 6.567711 |
2048 | 13.103644 |
4096 | 26.313928 |
The system running on 16 cores over two machines.
Iterations | Time (seconds) |
---|---|
256 | 0.910340 |
512 | 1.801941 |
1024 | 3.561604 |
2048 | 7.097233 |
4096 | 14.195201 |
This is pretty much linear growth. Here's a graph.

The growth here is linear because the startup and shutdown times are negligible, most of the execution time is spent in the main simulation phase. But the amount of work to do here scales linearly based on the number of iterations, so we see linear growth in the amount of time taken to solve the problem.
Downloads
- nbody.c as HTML: web page with source code for assignment 2
- assign2.tar.gz (281 KB): Assignment 2, a chunking n-body simulation written with MPI
- assign2-visualize.tar.gz (15 KB): Simple OpenGL visualization program for assignments 1 and 2