Parallel processing = executing your application on multiple CPUs
Parallel
processing environments
The environment in which you run your parallel jobs is
defined by your system’s architecture and hardware resources. All parallel
processing environments are categorized as one of:
1. SMP
(symmetric multiprocessing), in which some hardware resources may be shared
among processors. The processors communicate via shared memory and have
a single operating system.
2. Cluster
or MPP (massively parallel processing), also known as shared-nothing, in
which each processor has exclusive access to hardware resources. MPP
systems are physically housed in the same box, whereas cluster systems can be
physically dispersed. The processors each have their own operating system,
and communicate via a high-speed network.
Pipeline
Parallelism
1.
Extract, Transform and Load processes
execute simultaneously
2.
The downstream process starts while the
upstream process is running like a conveyor belt moving rows from process to
process
3.
Advantages: Reduces disk usage for staging
areas and Keeps processors busy
4.
Still has limits on scalability
Pipeline
Parallelism
1.
Divide the incoming stream of data into
subsets known as partitions to be processed separately
2.
Each partition is processed in the same
way
3.
Facilitates near-linear scalability.
However the data needs to be evenly distributed across the partitions;
otherwise the benefits of partitioning are reduced
Within parallel jobs pipelining, partitioning and
repartitioning are automatic. Job developer only identifies
1.
Sequential or Parallel mode (by stage)
2.
Partitioning Method
3.
Collection Method
4.
Configuration file