Datastage - Dataset

The Data Set stage is a file stage. It allows you to read data from or write data to a data set. The stage can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode.

What is a data set? Parallel jobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other WebSphere DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs. You can also manage data sets independently of a job using the Data Set Management utility, available from the WebSphere DataStage Designer or Director

A data set comprises a descriptor file and a number of other files that are added as the data set grows. These files are stored on multiple disks in your system.

The descriptor file for a data set contains the following information:

1. Data set header information.

2. Creation time and date of the data set.

3. The schema (metadata) of the data set.

4. A copy of the configuration file used when the data set was created.

Data Sets are the structured internal representation of data within the Parallel Framework

Consist of:

Framework Schema (format=name, type, nullability)

Data Records (data)

Partition (subset of rows for each node)

Virtual Data Sets exist in-memory correspond to DataStage Designer links

Persistent Data Sets are stored on-disk

Descriptor file

(metadata, configuration file, data file locations, flags)

Multiple Data Files

(one per node, stored in disk resource file systems)

node1:/local/disk1/…

node2:/local/disk2/…

There is no “DataSet” operator – the Designer GUI inserts a copy operator

When to Use Persistent Data Sets

When writing intermediate results between DataStage EE jobs, always write to persistent Data Sets (checkpoints)

Stored in native internal format (no conversion overhead)

Retain data partitioning and sort order (end-to-end parallelism across jobs)

Maximum performance through parallel I/O

Why Data Sets are not intended for long-term or archive storage

Internal format is subject to change with new DataStage releases

Requires access to named resources (node names, file system paths, etc)

Binary format is platform-specific

For fail-over scenarios, servers should be able to cross-mount filesystems

Can read a dataset as long as your current $APT_CONFIG_FILE defines the same NODE names (fastnames may differ)

orchadmin –x lets you recover data from a dataset if the node names are no longer available

Data Set Management

1. Viewing the schema

Click the Schema icon from the tool bar to view the record schema of the current data set. This is presented in text form in the Record Schema window.

2. Viewing the data

Click the Data icon from the tool bar to view the data held by the current data set. This options the Data Viewer Options dialog box, which allows you to select a subset of the data to view.

Rows to display. Specify the number of rows of data you want the data browser to display.

Skip count. Skip the specified number of rows before viewing data.

Period. Display every Pth record where P is the period. You can start after records have been skipped by using the Skip property. P must equal or be greater than 1.

Partitions. Choose between viewing the data in All partitions or the data in the partition selected from the drop-down list.Click OK to view the selected data, the Data Viewer window appears.

3. Copying data sets

Click the Copy icon on the tool bar to copy the selected data set. The Copy data set dialog box appears, allowing you to specify a path where the new data set will be stored. The new data set will have the same record schema, number of partitions and contents as the original data set.

Note: You cannot use the UNIX cp command to copy a data set because WebSphere DataStage represents a single data set with multiple files.

4. Deleting data sets

Click the Delete icon on the tool bar to delete the current data set data set. You will be asked to confirm the deletion.

Note: You cannot use the UNIX rm command to copy a data set because WebSphere DataStage represents a single data set with multiple files. Using rm simply removes the descriptor file, leaving the much larger data files behind.

Orchadmin Commands

Orchadmin is a command line utility provided by datastage to research on data sets.

The general callable format is : $orchadmin <command> [options] [descriptor file]

Before using orchadmin, you should make sure that either the working directory or the $APT_ORCHHOME/etc contains the file ”config.apt” OR The environment variable $APT_CONFIG_FILE should be defined for your session.

The various commands available with orchadmin are

1. CHECK: $orchadmin check

Validates the configuration file contents like , accesibility of all nodes defined in the configuration file, scratch disk definitions and accesibility of all the nodes etc. Throws an error when config file is not found or not defined properly

2. COPY : $orchadmin copy <source.ds> <destination.ds>

Makes a complete copy of the datasets of source with new destination descriptor file name. Please not that

a. You cannot use UNIX cp command as it justs copies the config file to a new name. The data is not copied.

b. The new datasets will be arranged in the form of the config file that is in use but not according to the old confing file that was in use with the source.

3. DELETE : $orchadmin < delete | del | rm > [-f | -x] descriptorfiles….

The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm command should be used to delete one or more persistent data sets.

-f options makes a force delete. If some nodes are not accesible then -f forces to delete the dataset partitions from accessible nodes and leave the other partitions in inaccesible nodes as orphans.

-x forces to use the current config file to be used while deleting than the one stored in data set.

4. DESCRIBE: $orchadmin describe [options] descriptorfile.ds

This is the single most important command.

1. Without any option lists the no.of.partitions, no.of.segments, valid segments, and preserve partitioning flag details of the persistent dataset.

-c : Print the configuration file that is written in the dataset if any

-p: Lists down the partition level information.

-f: Lists down the file level information in each partition

-e: List down the segment level information .

-s: List down the meta-data schema of the information.

-v: Lists all segemnts , valid or otherwise

-l : Long listing. Equivalent to -f -p -s -v -e

5. DUMP: $orchadmin dump [options] descriptorfile.ds

The dump command is used to dump (extract) the records from the dataset. Without any options the dump command lists down all the records starting from first record from first partition till last record in last partition.

-delim ‘<string>’ : Uses the given string as delimtor for fields instead of space.

-field <name> : Lists only the given field instead of all fields.

-name : List all the values preceded by field name and a colon

-n numrecs : List only the given number of records per partition.

-p period(N) : Lists every Nth record from each partition starting from first record.

-skip N: Skip the first N records from each partition.

-x : Use the current system configuration file rather than the one stored in dataset.

6. TRUNCATE: $orchadmin truncate [options] descriptorfile.ds

Without options deletes all the data(ie Segments) from the dataset.

-f: Uses force truncate. Truncate accessible segments and leave the inaccesible ones.

-x: Uses current system config file rather than the default one stored in the dataset.

-n N: Leaves the first N segments in each partition and truncates the remaining.

7. HELP: $orchadmin -help OR $orchadmin <command> -help

Help manual about the usage of orchadmin or orchadmin commands.

Datastage - Dataset

Post a Comment

How to write Complex SQL Queries? Practice with examples | Must do for Interviews !

SQL Query - How to delete duplicates from a table?

SQL Interview Queries on Employee Salary Database - 6 SQL Queries

How to install Python and Jupyter Notebook

Contact Form