CPSC 521 assignment 2 (matrix)

Back to assignment list.

Assignment description

In this assignment I just took assignment 2, which solves the n-body problem, and changed the code to instead calculate powers of a matrix. The structure of the communication is the same, it's just what is done for computation that changed.

The most significant change was to enforce that the number of processes must be a power of two. This allowed each process to take an equal-sized square block of the matrix. It seemed like a good idea at the time, though I now realize that assigning a row (or column) to each process would be more efficient (see the next section).

Efficiency and ring communication challenges

The other main change was the following: it turns out that my ring communication from assignment 2 was too simplistic and had a bug. Every process would send on their own data, then receive data from the previous ring node. This hangs when MPI_Send() is blocking, as happens on our system when the amount of data to send is larger than 64KB. It was quite the mystery until I noticed that 90x90 blocks (memory size LaTeX bytes) would work while 91x91 blocks (memory size LaTeX bytes) did not; these numbers, of course, are on either side of 64K.

My solution was this: I just make process 0 do a recv before a send. This ensures that there will always be at least one process which can make progress, so no deadlock will occur. However, it also unnecessarily linearizes the communication. If there is enough work to do then the processes will begin to execute in a systolic, shifted fashion, and efficiency may be reasonably high but the system will waste time at the beginning, staggering itself.

I have since discovered that there are more efficient patterns to avoid this deadlock (e.g. even/odd staggered send/receive). But even making the communication nonblocking with MPI_Isend only improves the runtime by roughly 15%; there are other considerations that slow the program down so working on this was not a priority. UPDATE: this is implemented in assignment 7!

Case in point: although the program is improved by moderate parallelism, with too many processes it slows down dramatically.

$ mpiexec -n 1 ./matpow 1 32 in-medium-1 >/dev/null
Initialization time  0.000009
Simulation time      0.000433
Termination time     0.003950
Total walltime       0.004742
$ mpiexec -n 4 ./matpow 1 16 in-medium-1 >/dev/null
Initialization time  0.000014
Simulation time      0.000181
Termination time     0.002744
Total walltime       0.003290
$ mpiexec -n 16 ./matpow 1 8 in-medium-1 >/dev/null
Initialization time  0.000018
Simulation time      0.001058
Termination time     0.003826
Total walltime       0.005266
$ mpiexec -n 64 ./matpow 1 4 in-medium-1 >/dev/null
Initialization time  0.000041
Simulation time      1.401705
Termination time     2.500136
Total walltime       3.902264
$ 

This is entirely due to the communication structure of hooking together every block into one ring. Each block (owned by each process) will get sent to all other processes in the course of the ring communciation, which is way more communciation than is actually necessary. A process really only cares about the data which is in the same row or column.

The right way to do this would be to make each process own an entire row (or column) and send only the portions of each row that are needed to the other processes. This is complex non-ring communciation however. (My program is essentially the same as sending the whole row along, knowing that not all of it will be used, with the additional complication that there are LaTeX processes rather than LaTeX.) Another way of looking at it would be to establish independent MPI communicators per row and per column and establish a separate communication ring per row and per column. However, all of these schemes seemed quite complex so I left my solution as it is.

Verification of correctness

I created a 32-by-32 matrix of numbers from 0 to 9 (named LaTeX, below). Then I computed LaTeX using my matpow program in several ways, e.g.

$ mpiexec -n 1 ./matpow 1 32 in-medium-1  # one process, with the entire 32x32 matrix
$ mpiexec -n 4 ./matpow 1 16 in-medium-1  # 2x2 processes, each with a 16x16 block
$ mpiexec -n 16 ./matpow 1 8 in-medium-1  # 4x4 processes, each with an 8x8 block
$ mpiexec -n 64 ./matpow 1 4 in-medium-1  # 8x8 processes, each with a 4x4 block

Fortunately all the results were then same. Then I put the results (LaTeX) into an Octave script (doublecheck.m), which also computes the matrix power itself, and shows that the difference between the Octave-computed matrix and my computed matrix is a zero matrix.

$ cat doublecheck.m
A = [
    8 9 0 3 0 6 0 2 6 1 5 5 4 6 7 7 1 6 0 0 5 4 2 7 1 1 7 2 9 4 9 5 
    9 8 3 4 2 5 5 8 0 1 0 5 7 4 6 7 0 2 3 5 3 0 8 7 9 7 0 6 6 4 7 8 
    0 0 3 8 7 3 2 4 5 6 7 2 4 8 9 6 0 2 2 9 5 0 6 1 8 9 7 6 0 0 5 2 
    2 6 8 1 7 2 3 4 6 8 1 6 8 2 2 0 8 9 6 5 5 2 0 9 4 0 2 4 4 8 0 6 
    0 0 3 8 2 9 9 6 6 4 6 5 1 9 7 3 2 0 9 4 5 2 8 5 8 6 1 5 6 8 0 7 
    8 1 3 1 3 1 5 1 7 2 0 9 3 2 9 9 6 6 4 8 9 1 4 6 5 4 5 1 7 6 4 4 
    7 1 6 0 7 0 0 2 6 3 0 2 1 1 1 9 6 4 9 3 4 3 5 2 1 9 9 9 9 7 4 1 
    8 3 2 9 8 6 3 9 5 4 9 9 8 1 2 6 9 8 5 7 4 4 4 2 8 1 2 8 3 2 9 7 
    4 1 6 5 2 6 9 1 1 6 7 8 5 4 7 8 8 1 3 3 1 0 2 7 7 3 5 1 0 1 5 3 
    4 7 7 3 8 3 0 5 9 3 0 5 2 4 3 7 2 6 7 2 4 5 2 1 6 0 0 7 4 7 5 2 
    9 7 4 7 4 1 9 0 4 4 6 5 2 8 1 5 6 1 0 7 1 0 6 1 8 5 5 1 4 9 8 7 
    3 0 2 4 9 2 2 4 9 4 5 0 7 3 0 2 9 4 1 4 5 5 3 0 3 7 0 7 8 3 0 0 
    3 7 3 1 2 9 8 3 8 9 4 0 0 6 1 1 9 1 0 6 4 4 9 4 1 3 5 3 0 0 6 9 
    4 9 4 4 3 5 1 2 6 4 3 8 7 6 9 1 8 6 8 6 6 9 1 8 0 7 3 7 7 7 0 5 
    7 4 4 9 0 2 7 2 9 5 7 1 3 3 8 1 8 4 8 2 7 5 8 3 3 0 5 3 0 0 3 2 
    2 5 6 8 3 2 1 6 4 8 1 7 2 3 2 4 4 6 3 5 9 9 2 1 6 4 0 0 8 3 9 4 
    9 9 3 6 8 7 6 3 0 7 5 7 7 5 9 6 4 1 2 7 5 6 9 6 5 4 3 6 1 8 6 7 
    2 6 7 4 3 2 2 2 4 2 9 7 4 1 6 4 2 1 4 8 3 8 4 3 0 5 3 5 7 2 8 1 
    8 5 1 5 2 1 2 6 4 8 0 3 3 1 3 9 9 0 8 7 5 7 8 4 9 9 6 9 1 1 0 6 
    6 5 6 6 0 8 8 7 9 9 4 3 8 1 9 1 9 3 7 4 6 3 1 8 3 4 2 0 8 8 8 5 
    0 6 9 5 6 3 0 2 2 4 0 5 3 9 4 5 2 4 4 3 9 4 4 3 4 0 8 6 9 5 1 0 
    9 5 4 7 0 1 8 9 5 1 9 9 8 2 3 1 1 5 6 2 6 0 8 9 0 3 7 9 8 4 4 2 
    9 0 3 5 9 1 0 7 6 3 3 3 5 2 2 1 3 1 2 3 1 3 8 2 3 7 3 5 4 3 2 1 
    6 9 0 3 3 8 4 5 3 0 6 8 7 4 9 7 8 4 1 4 7 5 9 2 6 5 0 5 4 6 4 3 
    4 4 1 8 3 4 8 0 7 0 5 3 4 9 1 4 7 8 7 8 8 5 2 2 5 8 0 5 4 9 8 7 
    1 5 7 8 4 5 9 0 9 2 6 8 6 1 2 9 9 9 2 3 8 5 8 8 0 1 7 9 1 6 6 4 
    9 0 4 6 5 0 1 6 5 6 0 7 6 6 4 8 1 9 4 7 4 1 9 8 7 5 3 1 9 1 5 7 
    6 1 4 7 3 7 4 3 5 4 8 9 0 3 3 2 0 3 2 3 9 0 5 6 7 1 0 5 6 4 4 1 
    1 5 4 2 7 4 3 7 2 7 8 4 9 0 7 9 2 0 7 0 7 6 4 4 3 4 8 6 5 6 5 7 
    2 7 1 7 1 6 3 6 6 6 6 4 2 5 6 1 7 9 5 6 9 4 6 0 0 6 5 1 1 4 1 8 
    6 9 0 9 4 5 3 1 1 8 9 3 5 1 9 2 2 6 9 0 5 0 2 6 8 7 0 0 1 6 9 0 
    7 5 3 1 8 8 6 8 3 1 1 1 2 1 0 3 4 8 0 0 8 4 9 4 3 6 2 3 6 4 3 3 
];
B = [
    705 715 463 640 491 529 499 529 571 556 600 691 640 477 701 621 578 585 520 508 737 472 661 607 572 551 449 483 672 551 669 547
    766 725 484 746 617 684 580 626 789 597 650 683 665 505 661 662 769 709 606 594 891 574 753 615 624 667 472 665 701 669 754 628
    636 571 605 789 546 547 628 471 807 648 570 690 581 560 606 544 753 647 659 618 755 482 629 634 627 531 416 558 615 672 624 560
    630 668 497 660 582 681 581 610 666 598 626 651 584 575 707 662 669 492 558 668 752 525 779 517 630 648 456 669 587 589 601 608
    784 647 598 742 695 571 619 600 846 658 598 762 674 538 660 721 924 715 728 701 903 595 762 614 641 709 597 751 686 753 570 637
    660 673 557 751 576 602 540 565 732 674 635 647 655 543 726 671 738 598 615 574 844 617 711 569 565 622 539 593 710 609 673 545
    593 557 490 768 518 560 487 575 617 652 626 720 573 488 657 689 530 524 552 549 766 500 720 595 616 564 502 588 589 525 613 563
    920 826 633 901 755 775 771 690 871 804 820 862 757 663 759 735 867 720 712 787 933 600 836 756 827 744 522 730 834 850 886 753
    692 611 469 645 592 480 508 437 715 584 490 585 540 533 594 622 708 590 545 621 722 502 659 487 595 621 419 558 601 641 627 526
    580 567 462 738 476 570 533 564 603 590 637 674 558 522 659 572 577 503 605 596 733 455 615 546 647 608 412 586 575 529 583 542
    741 736 523 715 628 621 562 556 733 642 571 651 659 581 682 635 702 703 639 600 780 511 671 647 586 729 486 583 712 775 700 654
    496 525 507 606 471 565 590 434 561 551 601 658 505 476 553 543 552 393 486 500 616 399 590 558 480 412 467 552 485 583 540 542
    807 595 516 595 615 534 514 530 590 497 457 700 587 457 665 664 560 555 539 524 653 380 684 624 604 587 435 568 645 657 623 475
    839 825 626 807 608 723 740 706 821 712 749 843 783 588 838 755 836 634 660 710 942 607 941 768 664 669 623 807 732 691 675 701
    790 634 546 645 569 454 503 489 623 599 517 662 591 525 620 633 640 507 570 586 630 432 693 614 618 550 538 614 588 578 587 516
    597 689 569 721 631 544 538 598 661 620 644 674 707 486 643 595 625 612 664 559 771 496 607 601 595 569 474 679 642 658 644 520
    953 852 628 892 697 761 722 753 970 727 712 824 740 694 822 713 846 800 749 721 1009 601 942 756 742 765 568 760 876 844 786 734
    670 587 490 723 508 503 596 549 647 634 645 609 626 455 627 558 607 477 550 524 681 367 678 583 615 569 477 560 574 575 625 511
    842 699 585 810 634 642 615 675 785 627 581 808 688 546 600 715 721 706 595 628 870 574 800 707 680 576 471 736 729 691 737 639
    883 875 618 798 720 715 694 656 816 785 736 804 764 593 875 880 900 717 749 717 925 594 882 699 753 699 681 757 651 748 807 724
    541 549 523 679 553 498 437 584 636 611 536 661 613 564 672 564 536 518 613 618 756 482 649 558 597 588 451 618 652 548 490 549
    833 699 554 722 714 617 528 631 760 693 707 750 705 556 650 758 781 612 557 662 777 516 817 619 703 665 591 719 739 634 716 639
    536 460 392 606 456 525 495 458 606 485 569 582 503 443 502 490 544 461 438 448 563 360 571 518 480 415 398 504 469 497 509 504
    841 745 570 809 620 635 674 604 789 650 686 757 689 587 710 648 784 639 609 676 878 567 821 652 622 653 545 676 740 711 739 655
    752 835 649 772 606 682 680 570 750 730 681 803 705 557 760 686 835 690 725 680 914 599 753 764 590 728 613 706 748 806 695 652
    841 720 692 821 755 650 634 663 762 781 710 862 703 643 779 741 751 646 630 773 890 537 900 695 764 771 569 703 819 727 741 606
    685 740 580 758 681 683 561 658 820 678 712 724 740 542 713 674 789 662 649 615 843 679 715 631 616 661 491 685 743 678 743 611
    576 550 477 607 568 468 457 424 650 511 550 610 555 521 579 580 636 531 525 540 705 424 525 463 570 535 444 540 632 640 534 458
    788 678 564 833 648 613 643 691 772 702 631 736 588 612 611 711 713 663 637 664 873 516 901 585 763 645 531 678 683 630 696 676
    769 680 635 681 633 531 574 586 698 602 535 736 665 487 639 653 718 617 584 642 734 515 743 655 583 555 529 657 679 632 615 572
    690 779 492 698 500 515 579 442 709 567 549 630 559 542 627 647 727 603 635 611 735 499 692 552 639 577 452 606 537 679 641 605
    617 535 476 625 558 522 465 525 584 447 579 700 556 466 607 651 512 515 513 538 693 436 680 530 516 538 500 611 651 578 606 500
];
max(max(abs(A*A - B)))
$ octave -q doublecheck.m
ans = 0
$ 

As you can see, the difference matrix is entirely zero, so my matpow program computes the correct answer. I tried a similar setup with higher powers of the matrix, but the numbers become quite large so the details are not reproduced here.

Powers of a 1000-node matrix

In this part, I used data from a group at Stanford [J. McAuley and J. Leskovec. Learning to Discover Social Circles in Ego Networks. NIPS, 2012.]. They have collected, amongst other things, anonymized data on friends-relationships on Facebook. This has approximately 1000 nodes so I grabbed the information in the file 1912.edges and turned it into an undirected connectivity matrix, 1 for an edge and 0 for no edge.

Here is a simple example of what we can do with this. The original 8x8 matrix below (after 0 rounds) is a binary connectivity matrix; it shows when two nodes are connected directly with an edge. After one round, the matrix has been multiplied by itself once, and this shows the number of paths of length 2 from one node to another.

$ ./matpow 0 8 in-01
Initialization time  0.000001
Simulation time      0.000000
0 0 0 1 0 0 1 1
0 1 1 0 0 1 1 0
0 1 0 0 0 0 1 0
1 0 0 0 1 1 1 1
0 0 0 0 0 0 0 1
0 1 0 1 1 1 0 0
1 0 1 1 0 1 1 1
0 0 0 1 0 0 1 1
Termination time     0.000631
Total walltime       0.000711
$ ./matpow 1 8 in-01
Initialization time  0.000000
Simulation time      0.000013
2 0 1 2 1 2 3 3
1 3 2 2 1 3 3 1
1 1 2 1 0 2 2 1
1 1 1 4 1 2 3 4
0 0 0 1 0 0 1 1
1 2 1 1 2 3 2 2
2 2 1 4 2 3 5 4
2 0 1 2 1 2 3 3
Termination time     0.000685
Total walltime       0.000784
$ 

For example, node 0 isn't connected to node 0, but after the first power we see the number 2 in position LaTeX. This is because there are two paths connecting 0 to 0 of length 2: 0->3->0 and 0->6->0.

Now, as for the original graph itself, it is quite complicated. This is the attempt by neato (from Graphviz) to visualize it:

Notice however that there is one bunch of three nodes in the corner that is disconnected. We can examine what happens to (e.g.) node 1 as we raise powers of the matrix. Here sequences of at least three zeros are replaced by "0...0" so you can see what's going on.

$ mpiexec -n 16 ./matpow 0 186 graph1 2>/dev/null | head -n 2 | tail -n 1 | perl -pe 's/(0 ){2,}0/0...0/g'
0 0 1 0...0
$ mpiexec -n 16 ./matpow 1 186 graph1 2>/dev/null | head -n 2 | tail -n 1 | perl -pe 's/(0 ){2,}0/0...0/g'
0 1 0...0 1 0...0
$ mpiexec -n 16 ./matpow 2 186 graph1 2>/dev/null | head -n 2 | tail -n 1 | perl -pe 's/(0 ){2,}0/0...0/g'
0 0 2 0...0
$

At first, node 1 is connected to one other node (node 2); after one iteration, node 1 is connected to itself and to node 60, meaning there are paths of length 2 from 1->2->1 and 1->2->60. After two iterations, node 1 is only connected to node 2 again, and there are two such paths of length 3: 1->2->1->2, and 1->2->60->2. And so on.

The other edge weights of the graph explode combinatorially because the graph is so connected. For node zero after one iteration, the weights to other nodes are

109 0 0 47 74 60 8 21 0 32 1 1 19 60 17 0 13 0 1 60 23 19 0 1 14 50
0 0 17 0 0 1 26 49 56 1 1 4 69 1 47 54 0 3 27 3 1 48 58 0 16 14 49
0 1 1 0 20 2 0 0 15 1 17 1 33 28 0 0 21 0 0 1 22 16 1 2 1 11 0 15
1 1 0 23 0 1 32 29 0 1 27 19 18 29 34 11 25 1 20 65 0 26 38 18 11 5
3 0 12 0 30 1 1 1 22 0 25 7 45 24 49 3 1 13 70 1 0 0 3 1 57 55 1 0
47 1 0 1 1 65 4 0 14 0 51 52 64 2 1 19 1 1 0 17 20 1 1 5 1 2 68 2 0
64 61 1 0 12 0 0 1 0 68 1 1 0 1 2 72 1 0 1 0 2 1 0 1 4 14 1 6 0 26
65 1 0 0 0 1 1 1 9 75 1 0 0 1 30 52 1 0 0 0 0 1 1 55 2 61 34 0 0 2
30 62 44 49 1 22 65 1 1 3 4 48 51 7 0 2 48 61 1 46 0 65 0 8 0 25 0
21 0 31 0 11 0 1 1 0 0 55 0 29 1 0 64 2 41 2 0 0 63 0 0 63 2 4 2 76
1 55 1 45 17 0 48 52 0 28 67 0 1 19 56 0 2 1 0 3 8 2 9 1 0 4 30 1 0
2 0 1 4 10 60 70 16 9 0 1 1 1 1 34 4 2 19 0 1 0 19 2 61 6 19 1 0 42
47 0 11 52 0 2 1 55 0 1 1 0 27 1 1 0 0 30 0 6 43 82 0 1 3 39 8 1 1
1 4 63 24 0 59 69 23 27 0 1 1 64 1 3 60 25 53 9 6 14 2 0 1 0 59 2
21 6 1 0 0 2 5 0 7 3 0 37 0 1 4 65 0 0 5 1 3 20 0 51 63 1 35 1 46
59 3 0 67 21 55 0 1 0 1 41 41 19 16 55 1 0 0 52 0 1 1 13 1 0 1 0 0
1 3 1 28 25 6 21 38 1 1 0 54 0 1 0 0 39 29 15 0 1 0 1 74 27 1 3 0 9
11 1 1 2 48 2 24 0 56 28 10 1 0 0 0 0 5 1 1 2 1 0 0 8 2 9 5 44 0 46
41 0 0 1 0 4 1 2 2 5 1 0 3 1 27 16 61 0 16 0 2 0 0 1 0 46 1 25 1 2
0 55 1 0 0 0 0 52 25 0 55 3 70 1 30 0 1 17 1 63 38 24 2 41 1 0 2 13
3 0 1 0 0 1 0 0 0 2 1 65 1 0 19 1 14 0 40 0 0 18 21 0 0 0 1 1 59 52
66 64 46 0 1 0 40 1 14 15 0 1 4 0 0 5 4 0 0 16 8 9 0 54 0 0 1 41 26
0 0 0 3 62 16 0 1 26 0 0 1 2 0 0 0 65 0 4 0 1 3 1 0 0 1 0 2 59 2 0
14 5 1 0 1 0 0 1 1 0 4 52 24 1 12 1 0 46 52 1 0 56 1 42 0 57 74 7 2
2 1 17 2 6 1 1 43 21 10 2 5 0 0 1 52 54 1 1 21 0 0 3 1 1 9 15 1 57
1 1 0 15 0 10 19 4 1 24 24 2 48 51 0 27 3 21 24 63 0 0 12 47 1 0 0 20

There are clearly a lot of paths of length 2 in this graph.

Last note: the program takes less than eight seconds to compute this large matrix with zero, one, two, or three iterations (using 16 threads and granularity 186), which isn't too bad. Considering such an inefficient communication mechanism is being used.

Downloads

Page generated on Tue Oct 24 00:37:25 2017