ETL / Data Integration

Consideration on Success of ETL Architecture-

1- A proper logging/Error strategy is the key to the success of any ETL architecture

2- A well designed process support auditing of row counts, measures and other metrics

3- Understand from where data originated from, when it was loaded, how it was transformed is essential for the integrity of the downstream data

4- Create reusable code in ETL process that reduces the total efforts required to maintained the ETL architecture

Following points should take care during in ETL process designing-

1- ETL tools support properties or configuration file which can be used to store Source and

Target Connection and Global Variables information for ETL process. Using properties

files is easy in working different project environment (test, uat, prod) by changing details in a

one place instead of changing in all jobs and components.

2- Create Reusable connection and Metadata in start of the ETL process

3- Load external library in start of process.

4- Create you Source and Target field mapping document in an easy to read and accessible

format

5- Staging Data base make life easier of developer. Query or Extract data from staging

Data base instead of querying directly from OLTP/DSS system for data analysis and

migration

6- Perform all types of join at source system and make a query or view and use those

Queries and Views in ETL process

7- Always filter data at Source system instead of filtering data in ETL process. Only extract

minimum data from Source system using WHERE clause

8- If you ETL process required Filter, Sorting and Joining then always Filter your data

before Sorting and Joining which optimize memory consumption and performance of the

process

9- Used optimized queries, Select only required columns used in ETL process instead of

Select * from table

10- Sometime it is suggested disable all triggers, backend running process, emailing

application or any process which degrade the performance during data load.

11- Process data in a Bulk load in destination system, instead of row by row processing

12- Build and Test ETL process with small set of data until your ETL process is completely

finished

13- Try to divide ETL process into Smaller and Integrated Parts

14- Specify Label name, component name, link name, process name, sub-process name

in a user friendly naming convention and make a note on Each process for better

understanding

15- Don’t Perform Cached Lookup in your ETL tools against large table. Avoid direct lookup

to the Data base table or Cloud data base, which reduce the performance of the ETL

process, Extract lookup data in flat file and use files as a Lookup table in a process.

16- If data Volume is high then it is suggested that DISABLE non-clustered indexes, leaving

clustered index intact then Data load and after completion of Data load Rebuild non

clustered indexes.

17- List table/objects need to migrate and create Child, Parent dependency diagram which

will help in running or scheduling the ETL process.

18- Assign memory to the ETL tool or specific process to overcome the issue of out of

memory and timeout problem

19- Scheduled ETL process and make sure you are utilizing parallel processing

20- Plan for process recovery, Plan for process failure or time out, Plan for Process may

require start multiple times in a day and Plan for restart jobs.

21- If Files are in Source system then always keep backup or archived in a directory for

future reference.

22- Always generate Success and Error file from ETL process and Send Email notification

for any error in between the process

23- Run ETL process during Night or business off peak hours, because chances of Query

timeout during peak hours

ETL / Data Integration

Monday 29 August 2016

ETL Concepts