Extract, Transform, and Load (ETL) scenarios sometimes are perceived as mundane and, dare I say it, even boring. Yet I find them fascinating opportunities to apply out-of-the box thinking. Data rarely presents in just the format expected by SQL*Loader or external table syntax. On a good day one builds a straightforward external table and job done. Or is that on a bad day? On a good day, on a truly good day, one has the opportunity to approach a load as if a child again playing with LEGO bricks. Taking the various features of Oracle Database, how do you snap them together to obtain a desired result?

The Problem

Recently I received a request for help from a reader seeking a way to load from a data file having one header record, one footer record, and many detail records sandwiched in between. Date and time values from the header record were to be applied to each of the detail rows. Following is an example data file that is essentially in the same form the reader was dealing with:

START;20121223;12:38:00
00001;Specialized Stumpjumper FSR;29;
00002;Salsa Big Mama;29;
00003;Gary Fisher Tassajara;26;
00004;Haro Beasley Singlespeed;27;
END;X

The desired result is to insert the data as if the date and time in the header row were included in each of the detail rows. For example:

INSERT INTO bike VALUES (00001, 'Specialized Stumpjumper FSR', 29,
    TO_TIMESTAMP('20121223' || '12:38:00', 'yyyymmddhh24:mi:ss'));

I enjoy this type of problem. There’s no built-in feature specifically for this scenario. The challenge of rummaging through the available feature set and snapping disparate pieces together into a well-crafted solution is stimulating. My first attempt at an “obvious” solution was a failure, but that served only to make the problem all the more interesting. The game was afoot!

Solution Approach

Figure 1 shows the solution I eventually settled upon. I’ll walk you through it in this article. It involves two external tables: one matching the header record, and the other matching the detail records. A preprocessing script on the first table ensures only the one header record is returned by any query against that table. With the external tables in place, the load is a perfect example of when PL/SQL can be used to good effect.

Solution Approach Flow Diagram

Figure 1. Solution flow diagram

Is a SQL*Loader solution possible? The answer is yes. I was able to conjure up not one, but actually two solutions based around SQL*Loader. Those solutions were fun hacks to invent, but I favor the more straightforward approach shown in Figure 1 for a production system.

Step 1: Create the Operating System Directories

First, create the following three operating system directories. The goal here is to keep the data files separate from the log files, and especially to keep control over the scripts directory.

[oracle@apex ~]$ ls -l
drwxr-xr-x 2 oracle oinstall 4096 Dec 25 22:00 etl_data
drwxr-xr-x 2 oracle oinstall 4096 Dec 23 22:55 etl_logs
drwxr-xr-x 2 oracle oinstall 4096 Dec 23 22:23 etl_scripts

It’s vital that you as the database administrator (DBA) keep tight control over the scripts directory. I can’t emphasize that enough. Preprocessing scripts are executed by the Oracle server software as the Oracle software owner. You can well imagine the risk from allowing just anyone to create and execute a script as user oracle. Keep a tight lid on preprocessing scripts. Review each one.

Step 2: Create the Oracle Directory Objects

Next is to create Oracle Database directory objects that map to the three operating system directories. These provide a level of operating-system independence in your ETL solutions. I logged in as SYSDBA to create these objects. For example:

CONNECT / AS SYSDBA
CREATE DIRECTORY bike_etl AS '/home/oracle/etl_data';
CREATE DIRECTORY etl_logs AS '/home/oracle/etl_logs';
CREATE DIRECTORY etl_scripts AS '/home/oracle/etl_scripts';

While you’re still logged in as SYSDBA, take the time to grant yourself execute access to the scripts directory. Here’s the command to run:

GRANT EXECUTE ON DIRECTORY etl_scripts TO GENNICK;

The need to grant access on the scripts directory is due to the security impact of preprocessing scripts running as the Oracle software owner. Oracle Database does not by default allow execution of such scripts. You as the DBA must consciously choose to grant the needed access. Keep a tight lid on whom you grant it to.

From here on out, everything can be done while logged in as the user who’ll ultimately be running the load. I logged in as user GENNICK.

Step 3: Write the Preprocessing Script

Preprocessing scripts are new in Oracle Database 11g Release 2. Arup Nanda has written a nice article about them that you may also wish to read. Using the feature you can completely transform a data file prior to its being read through an external table.

For our scenario, create a script having the following command to return just the first record of a file. The file’s name will be passed as the first parameter, indicated by the $1 marker. Name the script bike_inventory_head.sh. Place the script in your scripts directory.

/usr/bin/head -n 1 $1

Issue a chmod command to make the script executable:

chmod +x bike_inventory_head.sh

Switch to your data directory and test the script as follows:

[oracle@apex etl_data]$ ../etl_scripts/bike_inventory_head.sh bike_inventory.dat
START;20121223;12:38:00
[oracle@apex etl_data]$

You should get back the very first line from the data file. It’s a simple script. Debug it now if needed. It’s easier to test and debug these scripts from the command-line than when invoking them through an external table operation.

Step 4: Create the Target Table

You’ll want to create a target table if you’re following along by working the example. Create the table bike, as follows.

CREATE TABLE bike (
    bike_id                    NUMBER,
    bike_description           VARCHAR2(30),
    bike_nominal_tire_size     NUMBER,
    bike_etl_extract_time      TIMESTAMP
);

Table bike is the target. The ETL procedure to follow reads from bike_inventory.dat, and loads the data into bike.

Step 5: Create the First External Table

Now we’re getting to the fun part. Create the following, external table in the same schema as bike. The purpose of this first table is to give easy access to the header record. The one important detail to notice is the PREPROCESSER directive specifying the script created earlier in Step 3.

CREATE TABLE bike_load_stamp (
    bike_row_flag            VARCHAR2(5),
    bike_etl_extract_date    VARCHAR2(8),
    bike_etl_extract_time    VARCHAR2(8)
)
ORGANIZATION EXTERNAL (
    TYPE oracle_loader
    DEFAULT DIRECTORY bike_etl
    ACCESS PARAMETERS (
        RECORDS DELIMITED BY NEWLINE
        PREPROCESSOR etl_scripts:'bike_inventory_head.sh'
        NOBADFILE NODISCARDFILE
        LOGFILE etl_logs:'bike_load_stamp.log'
        FIELDS TERMINATED BY ';'
    )
    LOCATION ('bike_inventory.dat')
);

Create the table, and then query it as in the following example. You should get one row back. That row should contain the header information.

SQL> SELECT * FROM bike_load_stamp;

BIKE_ BIKE_ETL BIKE_ETL
----- -------- --------
START 20121223 12:38:00

What happened to all the other records in the input file? The answer is they aren’t there. As far as Oracle Database is concerned, the one row returned by the preprocessor script is all there is. Figure 2 walks you through the process.

Reading Header Record

Figure 2. Reading the header record.

If you aren’t on Oracle Database 11g Release 2 yet, you can still apply the technique of a preprocessing script. You’ll just need to somehow trigger execution of the script from outside the database, and you’ll need to write the one header row to a temporary file. You’ll have to do more work for the same effect. Oracle’s preprocessing feature saves work and eliminates the need for a temporary file. It’s a nice feature.

Interestingly, I at first missed the need for a preprocessing script. I mistakenly thought that I could simply use a WHERE ROWNUM = 1 predicate to read only the first row from the input file. My reward was an error due the reject limit being exceeded. For example:

SQL> SELECT * FROM bike_load_stamp WHERE ROWNUM = 1;
SELECT * FROM bike_load_stamp WHERE ROWNUM = 1
*
ERROR at line 1:
ORA-29913: error in executing ODCIEXTTABLEFETCH callout
ORA-30653: reject limit reached

Removing the reject limit by adding REJECT LIMIT UNLIMITED to the end of the CREATE TABLE statement solved one problem only to create another: A look at my log file showed that the database engine was reading all the records from the input file. It offends my sense of elegance to read an entire data file just to get the first record.

Next I tried an OPEN, FETCH, CLOSE sequence using a cursor from PL/SQL. That generated the same results – more records read than I really wanted, errors thrown, my sense of elegance offended. Using the preprocessor script keeps everything neat and tidy. Only one record is read. Simple. Elegant.

Step 6: Create the Second External Table

One preliminary left before performing the actual load, and that is to create a second external table. This time, we are creating the external table through which to read the data records. Following is the code. Notice in particular the LOAD WHEN clause.

CREATE TABLE bike_load (
    bike_id                     NUMBER,
    bike_description            VARCHAR2(30),
    bike_nominal_tire_size      NUMBER
)
ORGANIZATION EXTERNAL (
    TYPE oracle_loader
    DEFAULT DIRECTORY bike_etl
    ACCESS PARAMETERS (
        RECORDS DELIMITED BY NEWLINE
        LOAD WHEN ((1:5) != 'START' AND (1:3) != 'END')
        BADFILE etl_logs:'bike_inventory'
        DISCARDFILE etl_logs:'bike_inventory'
        LOGFILE etl_logs:'bike_inventory'
        FIELDS TERMINATED BY ';'
    )
LOCATION ('bike_inventory.dat')
);

The LOAD WHEN clause skips the header and footer records, which begin with “START” and “END” respectively. Query the table, and you’ll see the data from bike_inventory.dat. For example:

SQL> SELECT * FROM bike_load;

   BIKE_ID BIKE_DESCRIPTION               BIKE_NOMINAL_TIRE_SIZE
---------- ------------------------------ ----------------------
         1 Specialized Stumpjumper FSR                        29
         2 Salsa Big Mama                                     29
         3 Gary Fisher Tassajara                              26
         4 Haro Beasley Singlespeed                           27

All the pieces are in place. Now you can run the load.

Step 7: Run the Load

Remember the original requirement? It was to include the date and time from the header record with each new row inserted into the target table. You can implement the logic easily in PL/SQL. At a high level, the steps are simple:

  1. Query for the header record to get the date and time in text form.

  2. Convert the date and time into a value of type TIMESTAMP.

  3. Perform the load by executing an INSERT…SELECT FROM statement.

  4. Include the TIMESTAMP value as one of the selected columns.

Following is a PL/SQL block implementing the above logic. I’ve kept the exception handling simple to focus on the core logic of the block. However, the exception handling is still sufficient for the task. You’ll see that I’ve specifically written a check to verify the assumption that the first record is a header record beginning with START. It’s a good habit to include code to verify assumptions.

DECLARE
    row_flag VARCHAR2(5);
    extract_date VARCHAR2(8);
    extract_time VARCHAR2(8);

    extract_timestamp TIMESTAMP;
BEGIN
     /* The following will throw an error if for some reason our 
         preprocessor script fails and more than one row is returned. */
    SELECT bike_row_flag, bike_etl_extract_date, bike_etl_extract_time
    INTO row_flag, extract_date, extract_time
    FROM bike_load_stamp;

    /* Throw an error if the row_flag is not 'START' as expected. */
    IF (row_flag IS NULL) OR (row_flag != 'START') THEN 
        RAISE_APPLICATION_ERROR(-20987, 'No START Row!');
    END IF;

    /* Convert the date and time from text into an Oracle timestamp. 
        This conversion will throw an error */
    extract_timestamp := TO_TIMESTAMP (
        extract_date || extract_time, 'yyyymmddhh24:mi:ss');

    /* Load the detail records. */
    INSERT INTO bike
    SELECT bike_id, bike_description, bike_nominal_tire_size
         , extract_timestamp
    FROM bike_load;
END;
/

Execute this PL/SQL block. Query the bike table. You should have data:

SQL> SELECT * FROM bike;

BIKE_ID BIKE_DESCRIPTION            TIRE SIZE BIKE_ETL_EXTRACT_TIME
------- --------------------------- --------- ----------------------------
      1 Specialized Stumpjumper FSR        29 23-DEC-12 12.38.00.000000 PM
      2 Salsa Big Mama                     29 23-DEC-12 12.38.00.000000 PM
      3 Gary Fisher Tassajara              26 23-DEC-12 12.38.00.000000 PM
      4 Haro Beasley Singlespeed           27 23-DEC-12 12.38.00.000000 PM

It may seem a bit odd at first to have two external tables essentially reading from the same operating-system file. Don’t let that dissuade you. The approach works well in this case. It’s an elegant solution to the original requirement.

Subsequent loads are trivial. Once you have the external tables in place, and the preprocessing script written and debugged, subsequent loads are as easy as copying in a new data file and executing the PL/SQL block. If you like, you can convert the block into a stored procedure.

A SQL*Loader Solution

What? You just absolutely must use SQL*Loader?

Trust me. I understand.

Perhaps your site won’t allow preprocessing scripts due the security issues. Or maybe you need to get the load done today, and you can’t get on your DBA’s priority list until next week. External tables are wonderful, but using them tends to require some amount of DBA support. If you’re not the DBA and you need to get some data loaded, sometimes you just do what you’ve got to do. With that in mind, I present very quickly one possible solution built around SQL*Loader.

Begin by clearing the data you’ve just loaded using the external table solution:

DELETE FROM bike;

Then create the following table to hold the header row:

CREATE TABLE bike_load_header (
    bike_row_flag            VARCHAR2(5),
    bike_etl_extract_date    VARCHAR2(8),
    bike_etl_extract_time    VARCHAR2(8)
);

Now create the following PL/SQL package. The package caches the timestamp from the header record. SQL*Loader will invoke the package’s get_timestamp function for each detail row.

CREATE OR REPLACE PACKAGE bl03 AS
    extract_timestamp TIMESTAMP := NULL;
    FUNCTION get_timestamp RETURN TIMESTAMP;
END;
/

CREATE OR REPLACE PACKAGE BODY bl03 AS
    FUNCTION get_timestamp RETURN TIMESTAMP IS
    BEGIN
        RETURN extract_timestamp;
    END;
BEGIN
    SELECT TO_TIMESTAMP (
        bike_etl_extract_date || bike_etl_extract_time,
        'yyyymmddhh24:mi:ss')
    INTO extract_timestamp
    FROM bike_load_header;
END;
/

Create the following control file to tie everything together. The WHEN clauses in the control file direct the header record into the bike_load_header table, and all the detail records into the bike table. Name the control file bike_load.ctl.

LOAD DATA
        INFILE 'bike_inventory.dat'
                DISCARDFILE 'bike_inventory.dis'
        INTO TABLE bike_load_header
        REPLACE
        WHEN (bike_row_flag = 'START')
        (
        bike_row_flag           CHAR TERMINATED BY ';',
        bike_etl_extract_date   CHAR TERMINATED BY ';',
        bike_etl_extract_time   CHAR TERMINATED BY ';'
        )
        INTO TABLE bike
        APPEND
        WHEN (bike_id <> 'END') AND (bike_id <> 'START')
        TRAILING NULLCOLS
        (
        bike_id POSITION (1)    CHAR TERMINATED BY ';',
        bike_description        CHAR TERMINATED BY ';',
        bike_nominal_tire_size  CHAR TERMINATED BY ';',
        bike_etl_extract_time   CHAR TERMINATED BY ';'
                "bl03.get_timestamp"
        )

Finally, execute the load:

[oracle@apex etl_data]$ sqlldr gennick control=bike_load.ctl
Password:

SQL*Loader: Release 11.2.0.2.0 - Production on Thu Dec 27 20:51:09 2012

Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.

Commit point reached - logical record count 6
[oracle@apex etl_data]$

Here’s a short summary of how the load progresses:

  1. SQL*Loader reads the first record, which is the header record.

  2. The header record matches the first WHEN clause in the SQL*Loader control file, so an insert is generated into the bike_load_header table.

  3. The REPLACE option causes SQL*Loader to delete any header data from an earlier load.

  4. The insert completes, saving the header data in bike_load_header.

  5. Subsequent, non-header records match the second WHEN clause, generating inserts into table bike.

  6. The value of the fourth column for table bike is derived from a call to the function bl03.get_timestamp.

  7. The first invocation of the function triggers execution of package bl03‘s initialization code, which queries table bike_load_header. The resulting timestamp is stored in a package variable and will persist the duration of the session.

  8. That timestamp is returned for each invocation of the function, and is thus supplied to each of the detail records in the input file.

  9. The final record is a footer record. The footer record fails both WHEN clauses, making it essentially ignored.

The use of TRAILING NULLCOLS following the second WHEN clause prevents SQL*Loader from throwing an error because data records have only three fields, whereas the SQL*Loader control file describes four fields. Remove the TRAILING NULLCOLS option, and every one of the detail records will be perceived as being too short, and will throw an error.

The specification POSITION (1) in the second field listing is especially worth noticing. When you’re describing delimited fields, SQL*Loader moves left to right unless you specifically reset the position to the beginning. Without that use of POSITION (1), SQL*Loader would expect bike_id to be the fourth delimited column, bike_description to be the fifth delimited column, and so forth. Since the second field listing is a complete and alternate description of an input record, you must specify POSITION(1) so that SQL*Loader knows to back up and start over counting again from the beginning of the record.

Bottom Line

I prefer the external table solution. It’s simple and elegant. It’s a forward-looking solution making use of features that Oracle is driving towards rather than away from. Yet it requires some DBA assistance to put all the pieces in place, and the security considerations surrounding preprocessing scripts might be off-putting in some environments.

If preprocessing scripts aren’t an option – whether from site policy or from running an older release of Oracle Database – then the SQL*Loader solution is a good hack that can get the job done. Or perhaps you’re just more comfortable in using SQL*Loader.

An advantage in SQL*Loader’s favor is that it gives the option of loading to a remote database. Imagine, for example, that you’re running an Oracle Database instance through the Amazon Relational Database Service. You’ll have a listener endpoint through which to connect to the database, but you won’t have any server level access at all. SQL*Loader can get the job done in that scenario.

An advantage toward the external table approach is the ability to run a direct-path load. Simply add the APPEND hint as follows to the final INSERT statement in the PL/SQL block:

    /* Load the detail records. */
    INSERT /*+ APPEND */ INTO bike
    SELECT bike_id, bike_description, bike_nominal_tire_size
         , extract_timestamp
    FROM bike_load;

Run the block. You’ll get a direct-path load. You can tell, because you’ll need to commit before querying the target table from the same session.

A direct-path load isn’t possible in the SQL*Loader solution. That’s because SQL processing (i.e., the function call to get_timestamp) is involved during the load process.

More solutions are possible than I can show in this article. The SQL*Loader solution can be modified to use a trigger instead of a stored procedure. The external table solution can be rebuilt around a pipelined table function – an approach I plan to cover in a follow-up article. It really is like building with LEGO bricks. Put the pieces together in different ways. Explore your options. Ignore what others think. Choose a solution that works for you given your skill set and current working environment, and the amount and frequency of data to be loaded.

Acknowledgments: Credit and gratitude to Darl Kuhn, Tim Gorman, and Arup Nanda for reviewing an early draft of this article. Any errors I’ve managed to sneak by their careful scrutiny are entirely on my own head. Especial thanks to Vipul Jain for being the reader who set me down the path of exploring some fun solutions to an interesting and real-life problem.

Tags: , ,