Lab 7 - Map Reduce


Expand for the usual lab directions
  • Submit your completed lab online to CourSys (not Canvas, not GitHub).
    • Labs are marked on completion, not correctness, so complete each part to the best of your ability and learn.
    • Lab is due Sunday and may be submitted up to 3 days late. Further extensions possible only in exceptional cases.
  • It is recommended that students attend in-person lab sections for lots of support!
    • You are invited to come to any (or more than one) lab section.
    • There is no need to attend the section you are enrolled in.
    • There is no need to attend any lab sections: there is no attendance taken.
    • You can complete the lab on your own time or own computer if you like.
    • Time suggestions are given here to guide students who are working on the lab during lab times. However, the entire lab must be completed by everyone for marks.
  • While completing these labs, you are encouraged to help your classmates and receive as much help as you like. Assignments, however, are individual work and you must not work with another person on assignments.
    • I recommend coming to the lab and working with someone.
    • Failing that, I recommend getting together with someone in the class outside of lab time to work through the lab together.
    • Failing that, you are welcome to complete the lab exercise without any collaboration.
    • Each student must submit their lab to get marks.
    • Academic honesty expectations for labs are that each student types their own lab solution. It is OK to share ideas and help each others with specific coding issues and design. It is not permitted to submit another student's file for credit.
  • If using CSIL lab PC:
    • [Only CSIL PC] If in Windows, reboot to Linux; while booting, select Ubuntu from boot menu.
    • [Only CSIL PC] Don't setup GitHub tokens in the cmpt201 container because it may be shared with other users.
    • [Only CSIL PC] Delete any possible previous docker container before starting the lab:
      docker rm cmpt201
    • [Only CSIL PC] When done your lab, copy your solution out of the container, then execute the above docker command to delete your docker container to leave it clean for the next user.
  • You do not need to use record.


Prerequisite Skills for This Lab


  • Understand MapReduce
    • Read through assignment A10 where you will have to implement a MapReduce system.
    • Understand map.
    • Understand reduce.
    • Read at least the abstract of MapReduce.
  • C Skills
    • Able to use structs and arrays.
    • Able to use pointers: dereference.
    • Able to compile, run, and debug C programs.
    • Understand in-out parameters to a function.


Task 1: Debugging with the Debugger


  1. Create a folder for Lab 7 in your cmpt201 docker container. Copy in the following files:
  2. Understand MapReduce
    • MapReduce is a common programming model to process a large amount of data.
    • The data we are working with are key-value pairs, such as [(1, 4), (2, 35), (3, 5)].
    • The map() function processes a list of key-value pair to generate a list of intermediate key-value pairs.
    • The groupByKey() function processes a list of intermediate key-value pairs and groups them together. Usually the group function would group all occurrences of the same key together; however, in this example we're going to do a similar task of grouping together values.
    • The reduce() function processes the intermediate key-value pairs to generate the final output.
    • This MapReduce approach is well suited to big-data processing systems which perform operations on data in parallel to speed up the system.
    • In this lab you'll implement a small version of map, group, and reduce.
       
  3. Understand Lab 7's code:
    • The program's main() reads some input values from the user. It stores these values along with a key (the line number, 1, 2, ...) in an array of structs.
    • The program will then map each of these input values to an intermediate value by doubling the user's input value. It preserves the key (the line number).
    • Next it groups together the key-value pairs by which ones have the same (doubled) values.
    • Finally, it reduces each grouped key-value pair by printing it to the screen.
    • Here is the process in a diagram:
      This is a diagram of MapReduce; if you are seeing it, the image likely didn't load
       
  4. Implement map()
    • main() calls your map function for each key-value input pair: void map(Input *input, IntermediateInput *intermediate_input)
    • Look at where it calls your function and study the arguments it passes you:
      • The input struct contains:
        • line_number: the key for our pair, corresponding to the order the user typed in the value: 1, 2, 3, ...
        • value: the value the user entered.
      • You must write code in the map() function to fully populate the intermediate_input struct:
        • *Set its value to be the input's value times two (i.e. ` 2`).**
        • Preserve the line number (key) from the input into the intermediate_input struct.
    • Hint - # LinesYour solution will be about 2 lines of code!
    • Hint - PointersWatch out for pointers in all of this lab.
    • Hint - Fill the structJust copy the fields from the input to the intermediate_input structs, but make sure to multiply the value by two.
    • Hint - Pointer to structIn C, if you have a pointer-to-a-struct named input, you can access its line_number field using: input->line_number.

       

  5. Implement groupByKey()
    • main() calls this function once for each intermediate input struct.
    • Look at where it calls your function and study the arguments it passes you:
      • input: A single intermediate input which contains the line number and the (doubled) value.
      • output: Array of Output structs which main() has previously filled in using your map() function.
      • result_count: In-out parameter. It records the number of structs in the output array. Change this if you use another struct in the output array.
    • Grouping Algorithm
      • If the doubled_value in the input already has an entry in the output array, then add its line number to the line_numbers array in the correct struct in the output array.
      • If the doubled_value in the input does not already have an entry in the output array, then add a new entry to the output array by recording the doubled_value and the line_number.
    • Hints
      • Hint - Track countsWhen you add an element to an array, you need to increase the variable that tracks how many
      • Hint - No overflowNotice that main() preallocates each array for us to be at least as big as we will need. You don't need to worry about overflows. You don't need to dynamically allocate any memory.
      • Hint - SearchThis function will be given one input struct, and need to search through the array of output structs to determine which of the existing output structs (of which we know there are *result_count of them filled in) has the same doubled value as the input struct.
      • Hint - New structIf the input does not match any existing output struct, then you'll need to use the next "unused" output struct in the array. Since the array is pre-allocated to be large, we can just use the next one, but we have to increment *result_count too.

         

  6. Implement reduce()
    • main() calls your reduce function for each of the grouped key-value pairs:
      void reduce(Output *output)
    • See the Output struct in the provided code for the fields in the struct you are given.
    • You must print out the doubled value, along with all the line numbers, in the format matching:
      (10, [1, 2, 4])
      This format is the modified value followed by a comma-separated list of line numbers.
    • Hint - Just printThere is no processing going on in this function, you are just displaying one Output struct nicely. The calling code will hand you Output structs one at a time.
    • Hint - CommasWatch the placement of commas.
    • Hint - BracketsEnsure you get your brackets and linefeeds correct in the output.

       

  7. Test and debug
    • Run your program with the inputs: 5 5 2 5 3 2 (one line per) and check your output matches the figure above.
    • Try other values, such as just a single number, numerous copies of just a single number, etc.


3. Reviewing (10 mins)


  • During the last 10 minutes of the in-person lab, TAs will walk through the solution.
  • Discussion of Code
    • Show each function and explain how it implements its goal.
    • Discuss the use of pointers and arrays, highlighting them in the solution.
  • Discussion of MapReduce
    • Explain the big picture of MapReduce as a tool for solving problems.
    • Show the calling code in main() and explain that it does not depend on the operations being done in map(), groupByKey(), or reduce(), so it is a general framework for manipulating data.
  • Now that you have seen the solution:
    • Finish your solution: don't just copy the solution, but it is OK if you use the ideas you saw to help you finish.


4. [Optional] Challenges


  1. Optional: Assert the size
    • In your code, every time you are going to add an element to an array, assert that it will fit in the array.
    • Check how the code declares the size of the arrays.
    • Test your code by reducing the size of the arrays.
  2. Optional: Refactor code
    • Look at your code, and the code in main, and improve it.
    • Suggestion - ScopeFor each variable ensure it has the minimum scope. Check out the value variable in main().
    • Suggestion - SimplifyLook at your code to simplify it. For example, it's easy to get complicated code in the groupByKey() function. How do you use the array of structs? Would the code be simpler to make a pointer to the current struct in the array?
    • Suggestion - Search functionIs the code in groupByKey() simpler if you make a function that searches an array of Output structs for a matching value and returns a pointer to that? Can you make it so that you handle the "add to existing" and the "create new" cases with one piece of code?


Submission


Submit your C code to CourSys by Sunday midnight. The file name must be an exact match to what CourSys is expecting, otherwise it won't accept it. Remember you can submit up to 3 days late with no penalty if needed (but please plan to get it done on time to stay up to date on the course!)

Submissions will be marked for completion. You do not need to complete any optional steps.