LAB 2: MPI (C++)
Submission due 8/31/ 20 – 11:55 pm
NOTE: Skeletons are provided for Part B only. Data files will be provided for all problems.
NOTE: Use MPI_Scatter to distribute the data.
Part A (Emulate a distributed system using MPI messaging):
You are given a grayscale image with 8-bit pixels (as matrix D). You goal is to compute an “intensity” histogram of D in a distributed way. Use the BFS algorithm (see below) to visit every node and establish parent-child relationship. A parent node collects partial histograms of its children and adds its own before sending its partial histogram to its parent. A leaf node just returns its histogram.
Your program will emulate a distributed system by only sending messages between neighboring node. You are given a matrix A describing the DS connectivity (an adjacency bit matrix). Row i in the matrix A describes nodes that node i is connected to (a 1 in position A[i,j] means node j is a neighbor of i). A node can send messages only to its neighbors (per the adjacency matrix). You can to use MPI_Recv specifying source and tag as MPI_ANY_SOURCE, MPI_ANY_TAG. These are wild cards, used when source is unknown (as in a message from a parent). After the blocking Recv returns you can find out the sender and tag in the “status” returned by Recv. You can also use non-blocking MPI_Irecv with MPI_Iprobe. Send/Recv routines are the only ones you can use after initialization.
An “intensity” histogram calculates the number of pixels that have the same intensity in an image, e.g. the same pixel value, for every pixel value between 0 and 255.
The computation is initiated by node X, where X is read from the input file.
Your implementation will take four command line arguments (in this order): Grayscale PGM image, path to text file with the adjacency matrix, initiator rank, and output file name.
Use the grayscale PGM image from Lab 1. You can use the code from the lab to read an image into the matrix D.
A sample text file with an adjacency matrix with 4 nodes will look as follows:
0 | 1 | 0 | 1 |
1 | 0 | 1 | 0 |
0 | 1 | 0 | 1 |
1 | 0 | 1 | 0 |
It shows that Node 1 is connected to 2 and 4, Node 2 is connected to 1 and 3, Node 3 is connected to 2 and 4, and Node 4 is connected to 1 and 3.
Print histogram output into the output file where each line will represent the # of pixels for each intensity. Therefore, the output file MUST have 256 lines only.
The BFS algorithm is shown at the end of this document.
Part B1 (Word Frequency: Reduction):
Implement a simple and parallel version of an algorithm to count the word frequency using MPI. To aggregate your results in the Master process you must use MPI_Reduce.
int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
Part B2 (Word Frequency: Ring topology):
Use the same implementation of the algorithm as PartB1 but instead of using MPI_reduce, pass the partial result between the tasks as messages. You will use a point-to-point communication between “adjacent” tasks, e.g. i and i-1.
Look at the provided skeleton for CL args and for choosing between part B1 and B2.
Timing (Part B1 and B2):
Report the timings for Part B1 and B2 with 2, 4, 8, and 16 processes. Be careful when timing your code, you want to time only the processing time, not reading the input. Only the Master process should time the execution, not the worker threads. Create a graph from that timing data and explain what they show.
Also explain:
1. What is the topology used by MPI to implements its Reduce?
2. How does the ring topology affect the runtime?
3. Is there a case where the ring topology may be faster?
4. How many messages are passed between machines when program is run with 2, 4, 8, and 16 processes?
See below for the submission instructions. How to time the implementation will be discussed in the discussion.
NOTE: See the MPI How-To guide on the assignment web page before running your implementations on openlab servers.
Submit via Canvas: YOU MUST USE the file names below
1. Implementation of your C++ program: ImplementationA.cpp and ImplementationB.cpp (includes both B1 and B2)
2. Analysis of the timing for Part B1 and B2: Timing.pdf
Point Breakdown:
Part A: Implementation -> 40pts
Part B1 and B2: Implementation -> 50pts (25pts each), Timing -> 10pts
USEFUL LINKS:
NOTE: The TA cannot solve the exercises for you, but he can answer your questions!
The BFS Algorithm:
Assumption: An undirected network of processes.
The algorithm performs a complete network traversal as in