One of the great strengths of Datastage Parallel Architecture is that, when designing parallel jobs, you don’t have to worry too much about the underlying structure of your system, beyond appreciating its parallel processing capabilities. If your system changes, is upgraded or improved, or if you develop a job on one platform and implement it on another, you don’t necessarily have to change your job design.
WebSphere DataStage learns about the shape and size of
the system from the configuration file. It organizes the resources
needed for a job according to what is defined in the configuration file. When
your system changes, you change the file not the jobs.
The WebSphere DataStage Designer provides a
configuration file editor to help you define configuration files for the
parallel engine. To use the editor, choose Tools → Configurations, the
Configurations dialog box appears.
You specify which configuration will be used by
setting the $APT_CONFIG_FILE environment variable. This is set on
installation to point to the default configuration file, but you can set it on
a project wide level from the WebSphere DataStage Administrator or for
individual jobs from the Job Properties dialog.
Configuration files are text files containing string
data. The general form of a configuration file is as follows:
{
node "n1" {
fastname "s1"
pool "" "n1" "s1"
"app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2"
{"bigdata"}
resource scratchdisk "/temp"
{"sort"}
}
}
Node names
Each node you define is followed by its name enclosed
in quotation marks, for example: node
"orch0"
For a
single CPU node or workstation, the node’s name is typically the network name
of a processing node on a connection such as a high-speed switch or Ethernet.
Issue the following UNIX command to learn a node’s network name:
$ uname
-n
On an
SMP, if you are defining multiple logical nodes corresponding to the same
physical node, you replace the network name with a logical node name. In this
case, you need a fast name for each logical node. If you run an application
from a node that is undefined in the corresponding configuration file, each
user must set the environment variable APT_PM_CONDUCTOR_NODENAME to the
fast name of the node invoking the parallel job.
Fastname
Syntax:
fastname "name"
This option takes as its quoted attribute the name of
the node as it is referred to on the fastest network in the system, such as an
IBM switch, FDDI, or BYNET. The fastname is the physical node name that
stages use to open connections for high volume data transfers. The
attribute of this option is often the network name. For an SMP, all CPUs share
a single connection to the network, and this setting is the same for all
parallel engine processing nodes defined for an SMP. Typically, this is the
principal node name, as returned by the UNIX command uname -n.
Node pools and the default node pool
Node
pools allow association of processing nodes based on their characteristics. For
example, certain nodes can have large amounts of physical memory, and you can
designate them as compute nodes. Others can connect directly to a mainframe or
some form of high-speed I/O. These nodes can be grouped into an I/O node
pool.
The option pools is followed by the quoted names of
the node pools to which the node belongs. A node can be assigned to multiple
pools, as in the following example, where node1 is assigned to the default pool
(″″) as well as the pools node1, node1_css, and pool4.
node "node1"
{
fastname "node1_css"
pools "" "node1" "node1_css" "pool4"
resource disk "/orch/s0" {}
resource scratchdisk "/scratch" {}
}
A node belongs to the default pool unless you
explicitly specify a pools list for it, and omit the default pool name (″″)
from the list.
Once you have defined a node pool, you can constrain a
parallel stage or parallel job to run only on that pool, that is, only on the
processing nodes belonging to it. If you constrain both a stage and a job, the
stage runs only on the nodes that appear in both pools.
Nodes or resources that name a pool declare their
membership in that pool.
We suggest that when you initially configure your
system you place all nodes in pools that are named after the node’s name and
fast name. Additionally include the default node pool in this pool, as in the
following example:
node "n1"
{
fastname "nfast"
pools "" "n1" "nfast"
}
By default, the parallel engine executes a parallel
stage on all nodes defined in the default node pool. You can constrain the
processing nodes used by the parallel engine either by removing node
descriptions from the configuration file or by constraining a job or stage to a
particular node pool.
Disk and scratch
disk pools and their defaults
When you define a processing node, you can specify the
options resource disk and resource scratchdisk. They indicate the directories
of file systems available to the node. You can also group disks and scratch
disks in pools. Pools reserve storage for a particular use, such as holding
very large data sets.
Pools defined by disk and scratchdisk are not
combined; therefore, two pools that have the same name and belong to both
resource disk and resource scratchdisk define two separate pools.
A disk that does not specify a pool is assigned to the
default pool. The default pool may also be identified by ″″ by and { } (the
empty pool list). For example, the following code configures the disks for
node1:
node "n1" {
resource disk "/orch/s0" {pools ""
"pool1"}
resource disk "/orch/s1" {pools ""
"pool1"}
resource disk "/orch/s2" { } /* empty pool
list */
resource disk "/orch/s3" {pools
"pool2"}
resource scratchdisk "/scratch" {pools
"" "scratch_pool1"}
}
In this example:
1.
The first two disks are assigned to the
default pool.
2.
The first two disks are assigned to pool1.
3.
The third disk is also assigned to the default
pool, indicated by { }.
4.
The fourth disk is assigned to pool2 and
is not assigned to the default pool.
5.
The scratch disk is assigned to the
default scratch disk pool and to scratch_pool1.
Buffer scratch disk pools
Under
certain circumstances, the parallel engine uses both memory and disk storage to
buffer virtual data set records.The amount of memory defaults to 3 MB per
buffer per processing node. The amount of disk space for each processing node
defaults to the amount of available disk space specified in the default scratchdisk setting for the
node. The parallel engine uses the default scratch disk for temporary storage
other than buffering. If you define a buffer scratch disk pool for a node in
the configuration file, the parallel engine uses that scratch disk pool rather
than the default scratch disk for buffering, and all other scratch disk pools
defined are used for temporary storage other than buffering.
Here is an example configuration file that defines a
buffer scratch disk pool:
{
node node1
{
fastname
"node1_css"
pools "" "node1" "node1_css"
resource disk "/orch/s0" {}
resource scratchdisk "/scratch0" {pools "buffer"}
resource scratchdisk "/scratch1" {}
}
node node2 {
fastname "node2_css"
pools "" "node2" "node2_css"
resource disk "/orch/s0" {}
resource scratchdisk "/scratch0" {pools "buffer"}
resource scratchdisk "/scratch1" {}
}
}
In this example, each processing node has a single
scratch disk resource in the buffer pool, so buffering will use /scratch0 but
not /scratch1. However, if /scratch0 were not in the buffer pool, both
/scratch0 and /scratch1 would be used because both would then be in the default
pool.