Partitioning
The aim of most partitioning operations is to end up
with a set of partitions that are as near equal size as possible, ensuring an
even load across your processors.
When performing some operations however, you will need
to take control of partitioning to ensure that you get consistent results. A
good example of this would be where you are using an aggregator stage to
summarize your data. To get the answers you want (and need) you must ensure
that related data is grouped together in the same partition before the summary
operation is performed on that partition.
Round robin partitioner
The first record goes to the first processing node,
the second to the second processing node, and so on. When WebSphere DataStage
reaches the last processing node in the system, it starts over. This method is
useful for resizing partitions of an input data set that are not equal in size.
The round robin method always creates approximately equal-sized partitions.
This method is the one normally used when WebSphere DataStage initially
partitions data.
Random partitioner
Records are randomly distributed across all processing
nodes. Like round robin, random partitioning can rebalance the partitions of
an input data set to guarantee that each processing node receives an
approximately equal-sized partition. The random partitioning has a slightly
higher overhead than round robin because of the extra processing required to
calculate a random value for each record.
Entire partitioner
Every
instance of a stage on every processing node receives the complete data set as
input.
It is useful when you want the benefits of parallel execution, but every
instance of the operator needs access to the entire input data set. You are
most likely to use this partitioning method with stages that create lookup
tables from their input.
Same partitioner
The
stage using the data set as input performs no repartitioning and takes as input
the partitions output by the preceding stage. With this partitioning method,
records stay on the same processing node; that is, they are not redistributed. Same
is the fastest partitioning method. This is normally the method WebSphere
DataStage uses when passing data between stages in your job.
Hash partitioner
Partitioning is based on a function of one or more
columns (the hash partitioning keys) in each record. The hash partitioner
examines one or more fields of each input record (the hash key fields). Records
with the same values for all hash key fields are assigned to the same
processing node.
This method is useful for ensuring that related
records are in the same partition, which may be a prerequisite for a processing
operation.
For example, for a remove duplicates operation, you can hash partition records
so that records with the same partitioning key values are on the same node. You
can then sort the records on each node using the hash key fields as sorting key
fields, then remove duplicates, again using the same keys. Although the data is
distributed across partitions, the hash partitioner ensures that records with
identical keys are in the same partition, allowing duplicates to be found.
Hash partitioning does not necessarily result in an
even distribution of data between partitions. For example, if you hash
partition a data set based on a zip code field, where a large percentage of
your records are from one or two zip codes, you can end up with a few partitions
containing most of your records. This behavior can lead to
bottlenecks because some nodes are required to process more records than other
nodes.
Modulus partitioner
Partitioning is based on a key column modulo the
number of partitions. This method is similar to hash by field, but involves
simpler computation.
In data mining, data is often arranged in buckets,
that is, each record has a tag containing its bucket number. You can use the
modulus partitioner to partition the records according to this number. The
modulus partitioner assigns each record of an input data set to a partition of
its output data set as determined by a specified key field in the input data
set. This field can be the tag field.
The partition number of each record is calculated as
follows:
partition_number = fieldname mod number_of_partitions
where: fieldname is a numeric field of the
input data set and number_of_partitions is the number of processing
nodes on which the partitioner executes. If a partitioner is executed on three
processing nodes it has three partitions.
Range partitioner
Divides a data set into approximately equal-sized
partitions, each of which contains records with key columns within a specified
range.
This method is also useful for ensuring that related records are in the same
partition. A range partitioner divides a data set into approximately equal
size partitions based on one or more partitioning keys.
In order to use a range partitioner, you have to make
a range map. You can do this using the Write Range Map stage. The
range partitioner guarantees that all records with the same partitioning key
values are assigned to the same partition and that the partitions are
approximately equal in size so all nodes perform an equal amount of work when
processing the data set.
Range partitioning is not the only partitioning method
that guarantees equivalent-sized partitions. The random and round robin
partitioning methods also guarantee that the partitions of a data set are
equivalent in size. However, these partitioning methods are keyless; that is,
they do not allow you to control how records of a data set are grouped together
within a partition.
DB2 partitioner
Partitions
an input data set in the same way that DB2® would
partition it. For example, if you use this method to partition an input data
set containing update information for an existing DB2 table, records are
assigned to the processing node containing the corresponding DB2 record. Then,
during the execution of the parallel operator, both the input record and the
DB2 table record are local to the processing node. Any reads and writes of the
DB2 table would entail no network activity.
Auto partitioner
The
most common method you will see on the WebSphere DataStage stages is Auto. This
just means that you are leaving it to WebSphere DataStage to determine the best
partitioning method to use depending on the type of stage, and what the
previous stage in the job has done. Typically WebSphere DataStage would use
round robin when initially partitioning data, and same for the intermediate
stages of a job.
Collecting
Collecting is the process of joining the multiple
partitions of a single data set back together again into a single partition.
There may be a stage in your job that you want to run sequentially rather than
in parallel, in which case you will need to collect all your partitioned data
at this stage to make sure it is operating on the whole data set.
Note that collecting methods are mostly
non-deterministic. That is, if you run the same job twice with the same data,
you are unlikely to get data collected in the same order each time. If order
matters, you need to use the sorted merge collection method.
Round robin collector
Reads a record from the first input partition, then
from the second partition, and so on. After reaching the last partition, starts
over. After reaching the final record in any partition, skips that partition in
the remaining rounds.
Ordered collector
Reads all records from the first partition, then all
records from the second partition, and so on. This collection method preserves
the order of totally sorted input data sets. In a totally sorted data set, both
the records in each partition and the partitions themselves are ordered. This
may be useful as a preprocessing action before exporting a sorted data set to a
single data file.
Sorted merge collector
Read records in an order based on one or more columns
of the record. The columns used to define record order are called collecting
keys. Typically, you use the sorted merge collector with a partition-sorted
data set (as created by a sort stage). In this case, you specify as the
collecting key fields those fields you specified as sorting key fields to the
sort stage.
The data type of a collecting key can be any type
except raw, subrec, tagged, or vector.
Auto collector
The most common method you will see on the parallel
stages is Auto. This normally means that WebSphere DataStage will eagerly
read any row from any input partition as it becomes available, but if it detects
that, for example, the data needs sorting as it is collected, it will do that.
This is the fastest collecting method.