The Sequential File stage is a persistent storage file stage. It allows you to read data from or write data to one or more flat files. The stage can have a single input link or a single output link, and a single rejects link.
The stage executes in parallel mode
if reading multiple files but executes sequentially if it is only reading one
file. By default a complete file will be read
by a single node (although each node might read more than one file). For
fixed-width files, however, you can configure the stage to behave differently:
1. You can specify that single file can be read
by multiple nodes. This can improve performance on cluster systems.
2. You
can specify that a number of readers run on a single node. This means,
for example, that a single file can be partitioned as it is read (even though
the stage is constrained to running sequentially on the conductor node).
(These two options are mutually
exclusive.)
File
This property defines the flat file that
data will be read from. You can type in a pathname, or browse for a file. You
can specify multiple files by repeating the File property
File
pattern Specifies a group of files to import.
Specify file containing a list of files or a job parameter representing the
file. The file could also contain be any valid shell expression, in Bourne
shell syntax, that generates a list of file names.
Read
method This property specifies whether you are
reading from a specific file or files or using a file pattern to select files
(e.g., *.txt).
Missing
file mode Specifies the action to take if one of
your File properties has specified a file that does not exist. Choose from Error
to stop the job, OK to skip the file, or Depends, which means
the default is Error, unless the file has a node name prefix of *: in
which case it is OK. The default is Depends.
Keep
file partitions Set this to True to partition
the imported data set according to the organization of the input file(s). So,
for example, if you are reading three files you will have three partitions.
Defaults to False.
Reject
mode Allows you to specify behavior if a read
record does not match the expected schema (record does not match the metadata
defined in column definition). Choose from Continue to continue operation
and discard any rejected rows, Fail to cease reading if any rows are
rejected, or Save to send rejected rows down a reject link. Defaults to Continue.
Report
progress Choose Yes or No to enable
or disable reporting. By default the stage displays a progress report at each
10% interval when it can ascertain file size. Reporting occurs only if the file
is greater than 100 KB, records are fixed length, and there is no filter on the
file.
Number
Of readers per node This is an optional
property and only applies to files containing fixed-length records, it is
mutually exclusive with the Read from multiple nodes property. Specifies the
number of instances of the file read operator on a processing node. The default
is one operator per node per input data file. If numReaders is greater
than one, each instance of the file read operator reads a contiguous range of
records from the input file.
This provides a way of partitioning
the data contained in a single file. Each node reads a single file, but the
file can be divided according to the number of readers per node, and written to
separate partitions. This method can result in better I/O performance on an SMP
system.
Read
from multiple nodes This is an optional
property and only applies to files containing fixed-length records, it is
mutually exclusive with the Number of Readers Per Node property. Set this to
Yes to allow individual files to be read by several nodes. This can improve
performance on a cluster system. WebSphere DataStage knows the number of nodes
available, and using the fixed length record size, and the actual size of the
file to be read, allocates the reader on each node a separate region within the
file to process. The regions will be of roughly equal size.
Note that sequential row order cannot
be maintained when reading a file in parallel
File
update mode This property defines how the
specified file or files are updated. The same method applies to all files being
written to. Choose from Append to append to existing files, Overwrite
to overwrite existing files, or Create to create a new file. If you
specify the Create property for a file that already exists you will get
an error at runtime. By default this property is set to Overwrite.
Using RCP With Sequential Stages Runtime
column propagation (RCP) allows WebSphere DataStage to be flexible about the
columns you define in a job. If RCP is enabled for a project, you can just
define the columns you are interested in using in a job, but ask WebSphere
DataStage to propagate the other columns through the various stages. So such
columns can be extracted from the data source and end up on your data target
without explicitly being operated on in between.
Sequential files, unlike most other data
sources, do not have inherent column definitions, and so WebSphere DataStage
cannot always tell where there are extra columns that need propagating. You can
only use RCP on sequential files if you have used the Schema File property (see
″Schema File″ on page Schema File and on page Schema File) to specify a schema
which describes all the columns in the sequential file. You need to specify the
same schema file for any similar stages in the job where you want to propagate
columns. Stages that will require a schema file are:
1. Sequential
File
2. File
Set
3. External
Source
4. External
Target
5. Column
Import
6. Column Export
Improving Sequential File Performance
If the source file is fixed width, the Readers Per
Node option can be used to read a single input file in parallel at
evenly-spaced offsets. Note that in this manner, input row order is not
maintained.
If the input sequential file cannot be read in
parallel, performance can still be improved by separating the file I/O from the
column parsing operation. To accomplish this, define a single large string
column for the non-parallel Sequential File read, and then pass this to a
Column Import stage to parse the file in parallel. The formatting and column
properties of the Column Import stage match those of the Sequential File stage.
On heavily-loaded file servers or some RAID/SAN array
configurations, the environment variables
$APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE
can be used to improve I/O performance.
These settings specify the size of the read (import)
and write (export) buffer size in Kbytes, with a default of 128 (128K). Increasing
this may improve performance.
Finally, in some disk array configurations, setting
the environment variable
$APT_CONSISTENT_BUFFERIO_SIZE to a value equal to the
read/write size in bytes can significantly improve performance of Sequential
File operations.
$APT_CONSISTENT_BUFFERIO_SIZE - Some disk arrays have
read ahead caches that are only effective when data is read repeatedly in
like-sized chunks. Setting APT_CONSISTENT_BUFFERIO_SIZE=N will force
stages to read data in chunks which are size N or a multiple of N.
Partitioning Sequential File Reads
Care must be taken to choose the appropriate
partitioning method from a Sequential File read:
· Don’t read from Sequential File using SAME
partitioning! Unless more than one source file is specified, SAME will read
the entire file into a single partition, making the entire downstream flow run
sequentially (unless it is later repartitioned).
· When multiple files are read by a single
Sequential File stage (using multiple files, or by using a File Pattern), each
file’s data is read into a separate partition. It is important to use ROUND-ROBIN
partitioning (or other partitioning appropriate to downstream components) to
evenly distribute the data in the flow.
Sequential File (Export) Buffering
By default, the Sequential File (export operator)
stage buffers its writes to optimize performance. When a job completes
successfully, the buffers are always flushed to disk. The environment variable $APT_EXPORT_FLUSH_COUNT
allows the job developer to specify how frequently (in number of rows) that the
Sequential File stage flushes its internal buffer on writes. Setting this value
to a low number (such as 1) is useful for realtime applications, but there is a
small performance penalty associated with increased I/O.
Reading from and Writing to Fixed-Length
Files
Particular attention must be taken when processing
fixed-length fields using the Sequential File stage:
· If the incoming columns are variable-length data
types (eg. Integer, Decimal, Varchar), the field width column property
must be set to match the fixed-width of the input column.
Double-click on the column number in the grid dialog
to set this column property.
· If a field is nullable, you must define the null field value and length in the Nullable
section of the column property. Double-click on the column number in the
grid dialog to set these