Consideration on Success of ETL Architecture-
1- A proper logging/Error strategy is the key to the success of any ETL architecture
2- A well designed process support auditing of row counts, measures and other metrics
3- Understand from where data originated from, when it was loaded, how it was transformed is essential for the integrity of the downstream data
4- Create reusable code in ETL process that reduces the total efforts required to maintained the ETL architecture
Following points should take care during in ETL process designing-
1- ETL tools support properties or configuration file which can be used to store Source and
Target Connection and Global Variables information for ETL process. Using properties
files is easy in working different project environment (test, uat, prod) by changing details in a
one place instead of changing in all jobs and components.
2- Create Reusable connection and Metadata in start of the ETL process
3- Load external library in start of process.
4- Create you Source and Target field mapping document in an easy to read and accessible
format
5- Staging Data base make life easier of developer. Query or Extract data from staging
Data base instead of querying directly from OLTP/DSS system for data analysis and
migration
6- Perform all types of join at source system and make a query or view and use those
Queries and Views in ETL process
7- Always filter data at Source system instead of filtering data in ETL process. Only extract
minimum data from Source system using WHERE clause
8- If you ETL process required Filter, Sorting and Joining then always Filter your data
before Sorting and Joining which optimize memory consumption and performance of the
process
9- Used optimized queries, Select only required columns used in ETL process instead of
Select * from table
10- Sometime it is suggested disable all triggers, backend running process, emailing
application or any process which degrade the performance during data load.
11- Process data in a Bulk load in destination system, instead of row by row processing
12- Build and Test ETL process with small set of data until your ETL process is completely
finished
13- Try to divide ETL process into Smaller and Integrated Parts
14- Specify Label name, component name, link name, process name, sub-process name
in a user friendly naming convention and make a note on Each process for better
understanding
15- Don’t Perform Cached Lookup in your ETL tools against large table. Avoid direct lookup
to the Data base table or Cloud data base, which reduce the performance of the ETL
process, Extract lookup data in flat file and use files as a Lookup table in a process.
16- If data Volume is high then it is suggested that DISABLE non-clustered indexes, leaving
clustered index intact then Data load and after completion of Data load Rebuild non
clustered indexes.
17- List table/objects need to migrate and create Child, Parent dependency diagram which
will help in running or scheduling the ETL process.
18- Assign memory to the ETL tool or specific process to overcome the issue of out of
memory and timeout problem
19- Scheduled ETL process and make sure you are utilizing parallel processing
20- Plan for process recovery, Plan for process failure or time out, Plan for Process may
require start multiple times in a day and Plan for restart jobs.
21- If Files are in Source system then always keep backup or archived in a directory for
future reference.
22- Always generate Success and Error file from ETL process and Send Email notification
for any error in between the process
23- Run ETL process during Night or business off peak hours, because chances of Query
timeout during peak hours