Datastage - Lookup File Set Stage

 The Lookup File Set stage is a file stage. It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link.

When creating Lookup file sets, one file will be created for each partition. The individual files are referenced by a single descriptor file, which by convention has the suffix .fs.

When performing lookups, Lookup File Set stages are used with Lookup stages.

When you use a Lookup File Set stage as a source for lookup data, there are special considerations about column naming. If you have columns of the same name in both the source and lookup data sets, the source data set column will go to the output data. If you want this column to be replaced by the column from the lookup data source, you need to drop the source data column before you perform the lookup

A Hashed File is only available in server jobs. It uses a hashing algorithm (without building an index) to determine the location of keys within its structure. It is not amenable to parallelism. The contents of a hashed file may be cached in memory when using the Hashed File stage to service a reference input link. New rows to be written to a hashed file may first be written to a memory cache, then flushed to disk. All writes to a hashed file using an existing key overwrite the previous row. Duplicate key values are not permitted.

A Lookup File Set is only available in parallel jobs. It uses an index (based on a hash table) to determine the location of keys within its structure. It is a parallel structure; it has its records spread over the processing nodes specified when it was created. The records in the Lookup File Set are loaded into a virtual Data Set before use, and the index is also loaded into memory. Duplicate key values are (optionally) permitted. If the option is not selected, duplicates are rejected when writing to the Lookup File Set.

http://www.dsxchange.com/viewtopic.php?t=93287

I did testing on a Windows machine processing 100,000 primary rows against 100,000 lookup rows with a 1 to 1 match. Two key fields of char 255 and two non key fields also of char 255. I deliberately chose fat key fields. The dataset as a lookup took 2-3 minutes. The fileset as a lookup took about 40 seconds. Ran it a few times with the same results.

One interesting result was memory utilisation, the fileset was consistently lighter then the dataset, by as much as 30% on RAM memory. This may be due to the keep/drop key field option of the fileset stage. If you set keep to false the key fields in the fileset are not loaded into memory as they are not required on the output side of the lookup. I am guessing that the fileset version was moving and storing 510 char less for each lookup then the dataset version. In a normal lookup these key fields travel up the reference link and back down it again, in a lookup fileset they only travel up.

When I switch the same job onto an AIX box with several gig of RAM I get 7 seconds for the dataset and 4 for the fileset. With an increase to 500,000 rows I get 23 seconds for the dataset and 7 seconds for the fileset. This difference may not be so apparent if your key fields are shorter. The major drawback of a lookup fileset is that it doesn't have the Append option of a dataset, you can only overwrite it.

Creating a lookup file set

1.       In the Input Link Properties Tab:

– Specify the key that the lookup on this file set will ultimately be performed on. You can repeat this property to specify multiple key columns. You must specify the key when you create the file set, you cannot specify it when performing the lookup

– Specify the name of the Lookup File Set.

– Specify a lookup range, or accept the default setting of No.

– Set Allow Duplicates, or accept the default setting of False.

2.       Ensure column meta data has been specified for the lookup file set.

 

Looking up a lookup file set

1.       In the Output Link Properties Tab specify the name of the lookup file set being used in the lookup.

2.       Ensure column meta data has been specified for the lookup file set.

 

By default the stage will write to the file set in entire mode. The complete data set is written to each partition. If the Lookup File Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default (auto) collection method.

 

 


Post a Comment

Previous Post Next Post

Contact Form