CPSC 521 assignment 3

Back to assignment list.

Introduction

In this assignment I ran the n-body program from assignment 2 with a number of different parameter values, and I've come up with a model to predict the program's runtime.

Gathering data

Firstly, I gathered data using two Perl scripts. The first one runs the program with the Cartesian product of sets of valid parameters (./tests-model.pl) and writes the data into an "out/" directory, and the second one averages and displays the data in a nice awk-able list (cd out ; ../values.pl).

The results, converted to HTML, are as follows. Note that the first Perl script runs each configuration three times, so all times listed are the average of three runs. Also, all times are in seconds.

ProcessesRoundsBodiesSCATTERSIMULATEGATHERTotal time
1256160.0000010.0038470.0000780.004184
1256640.0000030.0467330.0004230.047514
12562560.0000060.5418460.0007020.543214
1512160.0000010.0074570.0000790.007770
1512640.0000040.0773790.0001880.077926
15122560.0000061.0736590.0006451.074997
11024160.0000010.0145690.0000820.014884
11024640.0000030.1464760.0001690.146988
110242560.0000062.1360550.0006452.137334
12048160.0000010.0275200.0000550.027803
12048640.0000050.2785360.0001680.279063
120482560.0000064.2333810.0006614.234804
14096160.0000010.0423250.0000420.042564
14096640.0000030.5448190.0001670.545286
140962560.0000078.4803940.0006518.481804
2256160.0000150.0025730.0000830.002889
2256640.0000180.0339420.0005410.034875
22562560.0000250.3702550.0007060.371784
2512160.0000190.0059240.0000830.006267
2512640.0000170.0602290.0004540.061060
25122560.0000220.7272820.0013690.729381
21024160.0000170.0117090.0000900.012072
21024640.0000190.1117240.0002030.112302
210242560.0000181.4245050.0007321.425866
22048160.0000170.0229520.0000770.023278
22048640.0000190.2040800.0005210.204984
220482560.0000192.8293950.0013772.831503
24096160.0000160.0358310.0000430.036115
24096640.0000170.3681150.0004960.368979
240962560.0000215.6363120.0022745.639312
4256160.0000180.0029740.0000940.003310
4256640.0000230.0221770.0010130.023615
42562560.0000240.2110410.0013790.213251
4512160.0000140.0055380.0000870.005827
4512640.0000220.0388670.0001840.039432
45122560.0000170.4137470.0007140.415056
41024160.0000170.0109600.0000950.011287
41024640.0000200.0737170.0002590.074335
410242560.0000240.8086920.0025100.812047
42048160.0000170.0205310.0000750.020837
42048640.0000190.1178850.0003550.118612
420482560.0000231.6074350.0012771.609456
44096160.0000150.0334510.0001310.033814
44096640.0000190.2348820.0003670.235619
440962560.0000203.1996210.0019263.202188
8256160.0000200.0146080.0001490.015021
8256640.0000220.0164150.0005980.017420
82562560.0000400.1353120.0016270.137815
8512160.0000170.0311760.0001630.031578
8512640.0000200.0445160.0005720.045450
85122560.0000390.2416560.0020780.244588
81024160.0000190.0354250.0002350.035949
81024640.0000210.0590250.0006950.060128
810242560.0000380.4533830.0025600.456737
82048160.0000180.0579060.0000660.058220
82048640.0000220.0856360.0004990.086522
820482560.0000310.8891080.0020800.891838
84096160.0000190.0738260.0000750.074157
84096640.0000210.1596610.0002130.160245
840962560.0000401.7368640.0012071.738913
16256160.0000230.0426820.0001610.043117
16256640.0000250.0499970.0002270.050616
162562560.0000300.1009530.0007330.102539
16512160.0000250.0709690.0001900.071444
16512640.0000240.0969550.0003870.097745
165122560.0000290.1745170.0017070.177094
161024160.0000260.1357840.0001770.136278
161024640.0000250.1629990.0002760.163701
1610242560.0000290.3162400.0028830.320007
162048160.0000260.2618810.0002380.262437
162048640.0000250.2969540.0002890.297650
1620482560.0000280.6035850.0007840.605218
164096160.0000240.4737940.0001270.474218
164096640.0000280.5653350.0004810.566238
1640962560.0000291.1772870.0007761.178917

Now we just have to make sense of all this data...

Theoretical model

A theoretical model for predicting the runtime of this program might look something like LogP: it would have parameters to describe the speed of messages based on latency and bandwidth constraints, and it would also have estimates of how much computation is required and the asymptotic growth of that computation. For this particular system however -- as shown by the MPE tests in assignment 4 -- computation time dwarfs communication time, even for systems of modest size.

For systems with very large latency or very small bandwidth it would be worth characterizing communication time, but for the purposes of this assignment I think it is better to concentrate on evaluating the computation time. It is clear from the problem that the work each process will have to do grows quadratically in the number of bodies it is responsible for, and linearly in the number of other processes (though this decreases the number of bodies the process is in charge of).

Empirical equation

I will derive a model from the measurements above. Firstly, it makes sense to attribute the runtime to the three stages SCATTER, SIMULATE, and GATHER. It will turn out that we can represent the time for SCATTER as a single constant LaTeX; SIMULATE as a function LaTeX depending on the number of rounds, processes, and bodies; and GATHER as a function LaTeX. So the total time will be described by the formula LaTeX.

Now we will figure out the functions LaTeX using our empirical data. Again, just from the timing information we cannot distinguish between communication time and computation time, but from the MPE graphs we can see it will be primarily the latter.

Determining parameters of the model

SCATTER: The initialization time appears to increase if the number of processes LaTeX is greater than one (which makes sense), and it also increases slightly based on the number of bodies LaTeX, but perhaps only at a rate of LaTeX. However the total SCATTER time is so small that I'm going to represent this as a constant, namely the average of all SCATTER times, which is LaTeX.

GATHER: Surprisingly, the termination time does not seem to depend much on the number of processes. (Example: with LaTeX and LaTeX, the times go up then down! 0.000661, 0.001377, 0.001277, 0.002080, 0.000784.) But the GATHER time clearly depends on the number of bodies, which makes sense since the GATHER time includes communicating the last values back to process 0 for printing. Presumably this overshadows the actual time taken to shut down the processes. Now, at 16 bodies the average GATHER value is 0.000111; at 64, 0.000390; at 256, 0.001360. The average increase is by a factor of 3.5 for every 4-fold increase in LaTeX, so a good approximation of the GATHER time is LaTeX.

SIMULATE: This is the most complex term, of course. First of all, the SIMULATE time seems to grow at precisely the same rate as the number of rounds LaTeX, which makes perfect sense. And as the number of processes doubles, the time decreases by an average factor of 1.65 (in fact the factor decreases as LaTeX gets higher, but we won't try to model that here). This makes sense because as LaTeX increases, each process has less work to do.

And of course, the SIMULATE time is strongly affected by the number of bodies. From the way the physics simulation works, we know that the runtime should increase quadratically as LaTeX increases. Hence we'll model the SIMULATE time as LaTeX (using the data point for one process, 4096 iterations, and 256 bodies).

The full empirical model

In case you actually wanted to see it, here is the full model just for reference:

LaTeX

Using the model for prediction

Extrapolating the data outwards, I ran all the same tests as above except with 1024 bodies (which took 30 minutes!). Here are the results and the predicted values by the model. (Note: I recorded the SCATTER, SIMULATE, GATHER times as well, but showing them here would be tedious; for details, see the raw data in Downloads.)

ProcessesRoundsBodiesTotal timePredicted timeReal/predicted ratio
125610248.5160138.4851711251.00363479705308
1512102417.00240916.9655651251.00217168568972
11024102433.95468933.9263531251.0008352172394
12048102467.22855967.8479291250.990871200742783
140961024135.671593135.6910811250.999856378732939
225610245.6245375.144409852272731.09332987874501
2512102411.23205610.28404257954551.09218295365094
21024102422.44296120.56330803409091.09140810237307
22048102444.83580441.12183894318181.09031612282587
24096102489.46182582.23890076136371.0878285601068
425610243.1578973.119706050619841.0122418422635
451210246.3056136.234634976239671.01138447142949
41024102412.58557312.46449282747931.00971400715589
42048102425.14874624.92420852995871.00900881044112
44096102450.23256449.84363993491741.00780288248592
825610241.6867491.892612837496870.891227707316441
851210243.3743803.780448549993740.892587203707768
8102410246.7245277.556119974987480.889944445331698
82048102413.34003215.1074628249750.883009420876869
84096102426.60919730.21014852494990.880803249875586
1625610240.9356291.148919981058710.814355233980554
1651210241.8479172.293062837117420.805872813465065
16102410243.6034104.581348549234840.786539151360101
16204810247.1424729.157919973469670.779922954196107
164096102414.33889318.31106282193930.783072677945266

The prediction is quite good considering it is extrapolating out to a problem size four times larger than before; however, it tends to over-estimate for larger number of processes. This comes down to the parameter 1.65 inside LaTeX; if we use 1.7 or 1.75, the prediction is much better for 1024 bodies. Still, I will leave the parameter at 1.65 because this predicts the smaller cases better. What I really need is a way to express a geometric series where the multiplication factor is continuously decreasing. Perhaps LaTeX would work. But this model already seems complicated enough so I'm going to leave it as it is!

Exactly how much of this time is because of communication and how much is computation would require a somewhat different kind of analysis, measuring the performance of the hardware we have access to to determine parameters (or running on a different system with higher latencies). Still, as a predictor for the runtime on machines similar to this one, this model seems reasonable.

Downloads

Page generated on Tue Oct 24 00:36:20 2017