4. Partitioning and Sorting

Get Started. It's Free
or sign up with your email address
Rocket clouds
4. Partitioning and Sorting by Mind Map: 4. Partitioning and Sorting

1. Partitioners

1.1. Overview

1.1.1. Every MR job has

1.1.2. Determine which Reducer gets records

1.1.3. Attempt even distribution of map output

1.1.3.1. Although like keys go to the same Reducer

1.1.4. Default is HashPartitioner

1.1.4.1. Uses hashCode method of Object along with modulus

1.1.5. Data Skew: small number of Reducers processing large number of records.

1.1.5.1. Understand what data looks like to avoid

1.2. How it works

1.2.1. Default Partitioner

1.2.1.1. org.apache.hadoop.mapreduce.Partitioner parent of HashPartitioner.

1.2.1.2. Same-key records go to same partition

1.2.1.3. getPartition() invoked for each <K, V> output from Mapper

1.2.1.3.1. Both Key and Value can be used in partitioning logic

1.2.1.3.2. Many times value is ignored

1.2.1.4. int return must be 0...numReduceTasks

1.2.1.4.1. Done with modulus

1.2.1.5. Define custom partitioner when HashCode uneven distribution

1.2.2. Custom Partitioner

1.2.2.1. Extends Partitioner

1.2.2.2. Implement the getPartition method

1.2.2.2.1. return int 0...numReduceTasks

1.2.2.2.2. Input parm types match class generics

1.2.2.3. Use Job's setPartitionClass() to configure

1.2.2.4. Configure in run method

1.2.3. TotalOrderPartitioner

1.2.3.1. org.apache.hadoop.mapreduce.lib.partition package

1.2.3.2. Output from all Reducers sorted

1.2.3.2.1. Partition file defines how keys split across partitions

1.2.3.2.2. Partition file generated using InputSampler class.

2. Sorting

2.1. 2 Key tasks

2.1.1. Keys sorted in natural order

2.1.1.1. Key type is WriteableComparable

2.1.1.2. Forces compareTo to be defined

2.1.2. Equal keys are grouped together.

2.1.2.1. Configurable component Grouping Comparator decides key equality

2.2. Secondary sort

2.2.1. Move part of value into key

2.2.2. Sorting work done during Shuffle/Sort

2.2.2.1. 1. Write custom key class that contains secondary key

2.2.2.1.1. CustomKey

2.2.2.2. 2. Write a custom grouping comparator

2.2.2.2.1. Custom grouping comparator

2.2.2.3. 3. Write a custom partitioner that ensures grouped keys are sent to the same reducer.