IBM DataStage ETL development

Project target

ETL, short for Extract - Transform - Load, is the process of integrating business data that come from technically unrelated operational data stores. The target data model is an abstraction of all relevant interrelated business entities. The integration is a precondition for any meaningful inference - the purpose of business intelligence.

As a rule, the processing framework is part of the application arhitecture which includes a sequence of load phases where data are moved from one layer to the next. According to best practices each load phase is implemented as a separate module or job.

In this case the challange was finding balance between fast time-to-market and limited resources: a number of source tables to be loaded - resulting in several hundreds of jobs to be written.

Our contribution

The developer guidelines of the customer prescribes the use of a specific ETL software: IBM DataStage. It is one of the many visual tools for the design, implementation and control of ETL jobs. Using a graphical user interface is in many cases more performant than a command line tool. If the tool offers an object oriented approach then a single class can cover most requirements for an arbitrary number of jobs.

Unfortunately very few ETL tools support object oriented design. The likely reason is the anti-object-oriented design of the SQL - the native language of most databases. The tool at hand does not even have an API for crafting an ETL module. In that case a scripting language can be useful to avoid the horror of repeating a long sequence of mouse operations several hundred times.

Our choice was Jawk - an interpreter / compiler for the AWK programming language implemented in Java. It provides an extensible execution environment around the well-known scripting language. It is also a nice example of free software - thanks to the author and the community.

Jawk was used to decompose a suitable DataStage job template and to generate a whole bunch of different building blocks for subsequent assembly. The generation was controlled via several parameter lists that described essential features of the source. Jawk is not a replacement for the missing API but the unconventional approach helped saving valuable resources.