CPSC 521 assignment 2

Back to assignment list.

Assignment description

This is assignment 1, generalized so that each process can handle multiple physics bodies. This page contains some timing information and a visualization program as well. The matrix multiplication part of the assignment is available here.

Assignment tweaks

This assignment several features from assignment 1, including SCALAR and SCALAR_MPI_TYPE for switching between single and double-precision floating point, as well as VISUALIZE_OUTPUT for use with the visualizer and DEBUG_OUTPUT for debugging purposes. It also has a feature DO_TIMING which enables printing the wall-clock time to standard error.

Visualization

Just for fun, I implemented a visualization program for this assignment. If VISUALIZE_OUTPUT is enabled, the assignment reports the position of bodies every so often. This output can be fed into the visualizer (directly or through ssh), and you will see the state of the system updating in real-time. A configuration might look something like this (click to enlarge):

See Assignment 4 for more visualizations.

Performance measurements

(Note: all times in this section are the average of three runs on non-head node(s).)

Parallelism and efficiency

My original implementation used an inefficient method of computing forces amongst process-local bodies, which involved n(n-1) gravity calculations. With this version, timing information went like this (with 1024 bodies and 1000 rounds):

Cores*MachinesTime (seconds)Speedup factorEfficiency
1*154.3064451.000100.000%
2*126.9157452.018100.900%
4*113.4182584.047101.175%
8*16.8040027.98299.775%

The efficiency in this case is practically 100% all the time. This is because the system does so much computation locally that the parallel communication cost is negligible. (Notice two cases of an efficiency greater than 100%: these are probably artifacts of caching or random measurement error speeding up the execution.) An optimized version which performs approximately n(n-1)/2 local calculations runs much faster:

Cores*MachinesTime (seconds)Speedup factorEfficiency
1*1 = 133.2409941.000100.000%
2*1 = 221.6518861.53576.750%
4*1 = 412.1102412.74568.625%
8*1 = 86.4143705.18264.775%
8*2 = 163.4867829.53359.581%
8*4 = 321.94380117.10153.441%

Here there is more parallel work and less time wasted locally, and the efficiency decreases as Amdahl's Law dictates. The same information in graph form:

Changing the number of iterations

This is the system running on 8 cores of one machine, with different numbers of iterations.

IterationsTime (seconds)
2561.666184
5123.305188
10246.567711
204813.103644
409626.313928

The system running on 16 cores over two machines.

IterationsTime (seconds)
2560.910340
5121.801941
10243.561604
20487.097233
409614.195201

This is pretty much linear growth. Here's a graph.

The growth here is linear because the startup and shutdown times are negligible, most of the execution time is spent in the main simulation phase. But the amount of work to do here scales linearly based on the number of iterations, so we see linear growth in the amount of time taken to solve the problem.

Downloads

Page generated on Tue Oct 24 00:36:33 2017