Datastage - Parallel Processing Environment | Pipeline and Partitioning

Parallel processing = executing your application on multiple CPUs

 

Parallel processing environments

The environment in which you run your parallel jobs is defined by your system’s architecture and hardware resources. All parallel processing environments are categorized as one of:

1.       SMP (symmetric multiprocessing), in which some hardware resources may be shared among processors. The processors communicate via shared memory and have a single operating system.

2.       Cluster or MPP (massively parallel processing), also known as shared-nothing, in which each processor has exclusive access to hardware resources. MPP systems are physically housed in the same box, whereas cluster systems can be physically dispersed. The processors each have their own operating system, and communicate via a high-speed network.

 

Pipeline Parallelism

1.       Extract, Transform and Load processes execute simultaneously

2.       The downstream process starts while the upstream process is running like a conveyor belt moving rows from process to process

3.       Advantages: Reduces disk usage for staging areas and Keeps processors busy

4.       Still has limits on scalability

 

Pipeline Parallelism

1.       Divide the incoming stream of data into subsets known as partitions to be processed separately

2.       Each partition is processed in the same way

3.       Facilitates near-linear scalability. However the data needs to be evenly distributed across the partitions; otherwise the benefits of partitioning are reduced

 

Within parallel jobs pipelining, partitioning and repartitioning are automatic. Job developer only identifies

1.       Sequential or Parallel mode  (by stage)

2.       Partitioning Method

3.       Collection Method

4.       Configuration file

Post a Comment

Previous Post Next Post

Contact Form