Monday 29 August 2016

ETL Concepts


Consideration on Success of ETL Architecture-

     1- A proper logging/Error strategy is the key to the success of any ETL architecture

     2- A well designed process support auditing of row counts, measures and other metrics

     3- Understand from where data originated from, when it was loaded, how it was transformed is essential for the integrity of the downstream data

     4- Create reusable code in ETL process that reduces the total efforts required to maintained the ETL architecture



Following points should take care during in ETL process designing-

1- ETL tools support properties or configuration file which can be used to store Source and 
Target Connection and Global Variables information for ETL process. Using properties
files is easy in working different project environment (test, uat, prod) by changing details in a
one place instead of changing in all jobs and components.

2- Create Reusable connection and Metadata in start of the ETL process

3- Load external library in start of process.

4- Create you Source and Target field mapping document in an easy to read and accessible 
format

5- Staging Data base make life easier of developer. Query or Extract data from staging 
Data base instead of querying directly from OLTP/DSS system for data analysis and 
migration

6- Perform all types of join at source system and make a query or view and use those 
Queries and Views in ETL process

7- Always filter data at Source system instead of filtering data in ETL process. Only extract 
minimum data from Source system using WHERE clause

8- If you ETL process required Filter, Sorting and Joining then always Filter your data 
before Sorting and Joining which optimize memory consumption and performance of the
process

9- Used optimized queries, Select only required columns used in ETL process instead of 
Select * from table

10- Sometime it is suggested disable all triggers, backend running process, emailing 
application or any process which degrade the performance during data load.

11- Process data in a Bulk load in destination system, instead of row by row processing

12- Build and Test ETL process with small set of data until your ETL process is completely 
finished

13- Try to divide ETL process into Smaller and Integrated Parts

14-  Specify Label name, component name, link name, process name, sub-process name 
in a user friendly naming convention and make a note on Each process for better
understanding

15- Don’t Perform Cached Lookup in your ETL tools against large table. Avoid direct lookup
to the Data base table or Cloud data base, which reduce the performance of the ETL
process, Extract lookup data in flat file and use files as a Lookup table in a process.

16-  If data Volume is high then it is suggested that DISABLE non-clustered indexes, leaving
clustered index intact then Data load and after completion of Data load Rebuild non
clustered indexes.

17- List table/objects need to migrate and create Child, Parent dependency diagram which 
will help in running or scheduling the ETL process.

18- Assign memory to the ETL tool or specific process to overcome the issue of out of 
memory and timeout problem

19- Scheduled ETL process and make sure you are utilizing parallel processing

20- Plan for process recovery, Plan for process failure or time out, Plan for Process may 
require start multiple times in a day and Plan for restart jobs.

21- If Files are in Source system then always keep backup or archived in a directory for 
future reference.

22- Always generate Success and Error file from ETL process and Send Email notification 
for any error in between the process

23- Run ETL process during Night or business off peak hours, because chances of Query 
timeout during peak hours