The Lookup File Set stage is a file stage. It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link.
When creating Lookup file sets, one file
will be created for each partition. The individual files are referenced by a
single descriptor file, which by convention has the suffix .fs.
When performing lookups, Lookup File Set
stages are used with Lookup stages.
When you use a Lookup File Set stage as a source for lookup data, there are special considerations about column naming. If you have columns of the same name in both the source and lookup data sets, the source data set column will go to the output data. If you want this column to be replaced by the column from the lookup data source, you need to drop the source data column before you perform the lookup
A Hashed File is only available in server
jobs. It uses a hashing algorithm (without building an index) to determine the
location of keys within its structure. It is not amenable to parallelism. The
contents of a hashed file may be cached in memory when using the Hashed File
stage to service a reference input link. New rows to be written to a hashed
file may first be written to a memory cache, then flushed to disk. All writes
to a hashed file using an existing key overwrite the previous row. Duplicate
key values are not permitted.
A Lookup File Set is only available in parallel jobs. It uses an index (based
on a hash table) to determine the location of keys within its structure. It is
a parallel structure; it has its records spread over the processing nodes
specified when it was created. The records in the Lookup File Set are loaded
into a virtual Data Set before use, and the index is also loaded into memory.
Duplicate key values are (optionally) permitted. If the option is not selected,
duplicates are rejected when writing to the Lookup File Set.
http://www.dsxchange.com/viewtopic.php?t=93287
I did testing on a Windows machine
processing 100,000 primary rows against 100,000 lookup rows with a 1 to 1
match. Two key fields of char 255 and two non key fields also of char 255. I
deliberately chose fat key fields. The dataset as a lookup took 2-3 minutes.
The fileset as a lookup took about 40 seconds. Ran it a few times with the same
results.
One interesting result was memory utilisation, the fileset was consistently
lighter then the dataset, by as much as 30% on RAM memory. This may be due to
the keep/drop key field option of the fileset stage. If you set keep to false
the key fields in the fileset are not loaded into memory as they are not
required on the output side of the lookup. I am guessing that the fileset
version was moving and storing 510 char less for each lookup then the dataset
version. In a normal lookup these key fields travel up the reference link and
back down it again, in a lookup fileset they only travel up.
When I switch the same job onto an AIX box with several gig of RAM I get 7
seconds for the dataset and 4 for the fileset. With an increase to 500,000 rows
I get 23 seconds for the dataset and 7 seconds for the fileset. This difference
may not be so apparent if your key fields are shorter. The major drawback of a
lookup fileset is that it doesn't have the Append option of a dataset, you can
only overwrite it.
Creating a lookup
file set
1. In the
Input Link Properties Tab:
– Specify the key that the lookup on this file set
will ultimately be performed on. You can repeat this property to specify
multiple key columns. You must specify the key when you create the file set,
you cannot specify it when performing the lookup
– Specify the name of the Lookup File Set.
– Specify a lookup range, or accept the default
setting of No.
– Set Allow Duplicates, or accept the default setting
of False.
2. Ensure
column meta data has been specified for the lookup file set.
Looking up a
lookup file set
1. In the
Output Link Properties Tab specify the name of the lookup file set being used
in the lookup.
2. Ensure
column meta data has been specified for the lookup file set.
By default the stage will write to the file set in
entire mode. The complete data set is written to each partition. If the
Lookup File Set stage is operating in sequential mode, it will first collect
the data before writing it to the file using the default (auto) collection
method.