The Data Set stage is a file stage. It
allows you to read data from or write data to a data set. The stage can have a
single input link or a single output link. It can be configured to execute in
parallel or sequential mode.
What is a data set? Parallel jobs use data
sets to manage data within a job. You can think of each link in a job as
carrying a data set. The Data Set stage allows you to store data being operated
on in a persistent form, which can then be used by other WebSphere DataStage
jobs. Data sets are operating system files, each referred to by a control file,
which by convention has the suffix .ds. Using data sets wisely can be key to
good performance in a set of linked jobs. You can also manage data sets
independently of a job using the Data Set Management utility, available from
the WebSphere DataStage Designer or Director
A data set comprises a descriptor file and
a number of other files that are added as the data set grows. These files are
stored on multiple disks in your system.
The descriptor file for a data set
contains the following information:
1. Data
set header information.
2. Creation
time and date of the data set.
3. The
schema (metadata) of the data set.
4. A
copy of the configuration file used when the data set was created.
Data Sets are the structured internal
representation of data within the Parallel Framework
Consist of:
Framework Schema (format=name, type, nullability)
Data Records (data)
Partition (subset of rows for each node)
Virtual Data Sets exist in-memory
correspond to DataStage Designer links
Persistent Data Sets are stored on-disk
Descriptor file
(metadata, configuration file, data file locations,
flags)
Multiple Data Files
(one per node, stored in disk resource file systems)
node1:/local/disk1/…
node2:/local/disk2/…
There is no “DataSet” operator – the
Designer GUI inserts a copy operator
When to Use Persistent Data Sets
When writing intermediate results between DataStage EE
jobs, always write to persistent Data Sets (checkpoints)
Stored in native internal format (no conversion
overhead)
Retain data partitioning and sort order (end-to-end
parallelism across jobs)
Maximum performance through parallel I/O
Why Data Sets are not intended for long-term or
archive storage
Internal format is subject to change with new
DataStage releases
Requires access to named resources (node names, file
system paths, etc)
Binary format is platform-specific
For fail-over scenarios, servers should be able to
cross-mount filesystems
Can read a dataset as long as your current
$APT_CONFIG_FILE defines the same NODE names (fastnames may differ)
orchadmin –x lets you recover data from a dataset if
the node names are no longer available
Data Set Management
1. Viewing
the schema
Click the Schema icon from the tool bar to
view the record schema of the current data set. This is presented in text form
in the Record Schema window.
2. Viewing
the data
Click the Data icon from the tool bar to
view the data held by the current data set. This options the Data Viewer
Options dialog box, which allows you to select a subset of the data to view.
Rows to display. Specify the number of
rows of data you want the data browser to display.
Skip count. Skip the specified number of
rows before viewing data.
Period. Display every Pth record where P
is the period. You can start after records have been skipped by using the Skip
property. P must equal or be greater than 1.
Partitions. Choose between viewing the
data in All partitions or the data in the partition selected from the drop-down
list.Click OK to view the selected data, the Data Viewer window appears.
3. Copying
data sets
Click the Copy icon on the tool bar to
copy the selected data set. The Copy data set dialog box appears, allowing you
to specify a path where the new data set will be stored. The new data set will
have the same record schema, number of partitions and contents as the original
data set.
Note: You cannot use the UNIX cp command
to copy a data set because WebSphere DataStage represents a single data set
with multiple files.
4. Deleting
data sets
Click the Delete icon on the tool bar to
delete the current data set data set. You will be asked to confirm the
deletion.
Note: You cannot use the UNIX rm command
to copy a data set because WebSphere DataStage represents a single data set
with multiple files. Using rm simply removes the descriptor file, leaving the
much larger data files behind.
Orchadmin Commands
Orchadmin is a command line utility
provided by datastage to research on data sets.
The general callable format is :
$orchadmin <command> [options] [descriptor file]
Before using orchadmin, you should make sure
that either the working directory or the $APT_ORCHHOME/etc contains the file ”config.apt”
OR The environment variable $APT_CONFIG_FILE
should be defined for your session.
The
various commands available with orchadmin are
1. CHECK: $orchadmin check
Validates the configuration file contents
like , accesibility of all nodes defined
in the configuration file, scratch disk definitions and accesibility of all the
nodes etc. Throws an error when config file is not found or not defined
properly
2. COPY
: $orchadmin copy <source.ds> <destination.ds>
Makes a complete copy of the datasets of
source with new destination descriptor file name. Please not that
a. You cannot use UNIX cp command as it
justs copies the config file to a new name. The data is not copied.
b. The new datasets will be arranged in
the form of the config file that is in use but not according to the old confing
file that was in use with the source.
3. DELETE
: $orchadmin < delete |
The unix rm utility cannot be used to
delete the datasets. The orchadmin delete or rm command should be used to
delete one or more persistent data sets.
-f options makes a force delete. If some
nodes are not accesible then -f forces to delete the dataset partitions from accessible
nodes and leave the other partitions in inaccesible nodes as orphans.
-x forces to use the current config file
to be used while deleting than the one stored in data set.
4. DESCRIBE:
$orchadmin describe [options] descriptorfile.ds
This is the single most important command.
1. Without any option lists the
no.of.partitions, no.of.segments, valid segments, and preserve partitioning
flag details of the persistent dataset.
-c : Print the configuration file that is
written in the dataset if any
-p: Lists down the partition level
information.
-f: Lists down the file level information
in each partition
-e:
List down the segment level information .
-s: List down the meta-data schema of the information.
-v:
Lists all segemnts , valid or otherwise
-l : Long listing. Equivalent to -f -p -s
-v -e
5. DUMP: $orchadmin dump [options]
descriptorfile.ds
The dump command is used to dump (extract)
the records from the dataset. Without any options the dump command lists down
all the records starting from first record from first partition till last record
in last partition.
-delim ‘<string>’ : Uses the given
string as delimtor for fields instead of space.
-field <name> : Lists only the given
field instead of all fields.
-name : List all the values preceded by
field name and a colon
-n numrecs : List only the given number of
records per partition.
-p period(N) : Lists every Nth record from each partition starting from
first record.
-skip N: Skip the first N records from
each partition.
-x : Use the current system configuration
file rather than the one stored in dataset.
6. TRUNCATE:
$orchadmin truncate [options] descriptorfile.ds
Without options deletes all the data(ie
Segments) from the dataset.
-f: Uses force truncate. Truncate
accessible segments and leave the inaccesible ones.
-x: Uses current system config file rather
than the default one stored in the dataset.
-n N: Leaves the first N segments in each
partition and truncates the remaining.
7. HELP:
$orchadmin -help OR $orchadmin <command> -help
Help manual about the usage of orchadmin or orchadmin commands.