Datastage Configuration File

One of the great strengths of Datastage Parallel Architecture is that, when designing parallel jobs, you don’t have to worry too much about the underlying structure of your system, beyond appreciating its parallel processing capabilities. If your system changes, is upgraded or improved, or if you develop a job on one platform and implement it on another, you don’t necessarily have to change your job design.

WebSphere DataStage learns about the shape and size of the system from the configuration file. It organizes the resources needed for a job according to what is defined in the configuration file. When your system changes, you change the file not the jobs.

The WebSphere DataStage Designer provides a configuration file editor to help you define configuration files for the parallel engine. To use the editor, choose Tools → Configurations, the Configurations dialog box appears.

You specify which configuration will be used by setting the $APT_CONFIG_FILE environment variable. This is set on installation to point to the default configuration file, but you can set it on a project wide level from the WebSphere DataStage Administrator or for individual jobs from the Job Properties dialog.

Configuration files are text files containing string data. The general form of a configuration file is as follows:

{

node "n1" {

fastname "s1"

pool "" "n1" "s1" "app2" "sort"

resource disk "/orch/n1/d1" {}

resource disk "/orch/n1/d2" {"bigdata"}

resource scratchdisk "/temp" {"sort"}

}

Node names

Each node you define is followed by its name enclosed in quotation marks, for example: node "orch0"

For a single CPU node or workstation, the node’s name is typically the network name of a processing node on a connection such as a high-speed switch or Ethernet. Issue the following UNIX command to learn a node’s network name:

$ uname -n

On an SMP, if you are defining multiple logical nodes corresponding to the same physical node, you replace the network name with a logical node name. In this case, you need a fast name for each logical node. If you run an application from a node that is undefined in the corresponding configuration file, each user must set the environment variable APT_PM_CONDUCTOR_NODENAME to the fast name of the node invoking the parallel job.

Fastname

Syntax:

fastname "name"

This option takes as its quoted attribute the name of the node as it is referred to on the fastest network in the system, such as an IBM switch, FDDI, or BYNET. The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. For an SMP, all CPUs share a single connection to the network, and this setting is the same for all parallel engine processing nodes defined for an SMP. Typically, this is the principal node name, as returned by the UNIX command uname -n.

Node pools and the default node pool

Node pools allow association of processing nodes based on their characteristics. For example, certain nodes can have large amounts of physical memory, and you can designate them as compute nodes. Others can connect directly to a mainframe or some form of high-speed I/O. These nodes can be grouped into an I/O node pool.

The option pools is followed by the quoted names of the node pools to which the node belongs. A node can be assigned to multiple pools, as in the following example, where node1 is assigned to the default pool (″″) as well as the pools node1, node1_css, and pool4.

node "node1"

{
fastname "node1_css"
pools "" "node1" "node1_css" "pool4"
resource disk "/orch/s0" {}
resource scratchdisk "/scratch" {}

}

A node belongs to the default pool unless you explicitly specify a pools list for it, and omit the default pool name (″″) from the list.

Once you have defined a node pool, you can constrain a parallel stage or parallel job to run only on that pool, that is, only on the processing nodes belonging to it. If you constrain both a stage and a job, the stage runs only on the nodes that appear in both pools.

Nodes or resources that name a pool declare their membership in that pool.

We suggest that when you initially configure your system you place all nodes in pools that are named after the node’s name and fast name. Additionally include the default node pool in this pool, as in the following example:

node "n1"

{
fastname "nfast"

pools "" "n1" "nfast"
}

By default, the parallel engine executes a parallel stage on all nodes defined in the default node pool. You can constrain the processing nodes used by the parallel engine either by removing node descriptions from the configuration file or by constraining a job or stage to a particular node pool.

Disk and scratch disk pools and their defaults

When you define a processing node, you can specify the options resource disk and resource scratchdisk. They indicate the directories of file systems available to the node. You can also group disks and scratch disks in pools. Pools reserve storage for a particular use, such as holding very large data sets.

Pools defined by disk and scratchdisk are not combined; therefore, two pools that have the same name and belong to both resource disk and resource scratchdisk define two separate pools.

A disk that does not specify a pool is assigned to the default pool. The default pool may also be identified by ″″ by and { } (the empty pool list). For example, the following code configures the disks for node1:

node "n1" {

resource disk "/orch/s0" {pools "" "pool1"}

resource disk "/orch/s1" {pools "" "pool1"}

resource disk "/orch/s2" { } /* empty pool list */

resource disk "/orch/s3" {pools "pool2"}

resource scratchdisk "/scratch" {pools "" "scratch_pool1"}

}

In this example:

1. The first two disks are assigned to the default pool.

2. The first two disks are assigned to pool1.

3. The third disk is also assigned to the default pool, indicated by { }.

4. The fourth disk is assigned to pool2 and is not assigned to the default pool.

5. The scratch disk is assigned to the default scratch disk pool and to scratch_pool1.

Buffer scratch disk pools

Under certain circumstances, the parallel engine uses both memory and disk storage to buffer virtual data set records.The amount of memory defaults to 3 MB per buffer per processing node. The amount of disk space for each processing node defaults to the amount of available disk space specified in the default scratchdisk setting for the node. The parallel engine uses the default scratch disk for temporary storage other than buffering. If you define a buffer scratch disk pool for a node in the configuration file, the parallel engine uses that scratch disk pool rather than the default scratch disk for buffering, and all other scratch disk pools defined are used for temporary storage other than buffering.

Here is an example configuration file that defines a buffer scratch disk pool:

{

node node1 {

fastname "node1_css"
pools "" "node1" "node1_css"
resource disk "/orch/s0" {}
resource scratchdisk "/scratch0" {pools "buffer"}
resource scratchdisk "/scratch1" {}

}
node node2 {
fastname "node2_css"
pools "" "node2" "node2_css"
resource disk "/orch/s0" {}
resource scratchdisk "/scratch0" {pools "buffer"}
resource scratchdisk "/scratch1" {}
}
}

In this example, each processing node has a single scratch disk resource in the buffer pool, so buffering will use /scratch0 but not /scratch1. However, if /scratch0 were not in the buffer pool, both /scratch0 and /scratch1 would be used because both would then be in the default pool.

Datastage Configuration File

Post a Comment

How to write Complex SQL Queries? Practice with examples | Must do for Interviews !

Contact Form