SSIS – PART 3
SSIS – Performance Tuning
SSIS Interview Questions and Answers Part 3
Here we are publishing series of posts on SSIS Interview questions and answers Part 3 for experienced and freshers. Below is the series 3.
Q. How to quickly load data into sql server table?
Fast Load option: This option is not set by default so most developers know this answer as otherwise the load is very slow.
Q. What are the fast load options available in SSIS?
The OLE DB Destination provides more than one way to load data in the destination (5 types of Data Access Mode). Use Fast Load option while loading data into the destination.
- Data Access Mode – It allows to define the method to upload data into the destination. The fast load option will use BULK INSERT statement instead of INSERT statement. If the fast load option is not selected then by default INSERT is used.
- Keep Identity – If selected, the identity values of source are preserved and the same are uploaded into the destination table. Else destination table will create its own identity values if there is any column of identity type.
- Keep Nulls – If selected, the null values of the source are preserved and are uploaded into the destination table. Else if any column has default constraint defined at destination table and NULL value is coming from the source for that column then in that case, default value will be inserted into the destination table.
- Table Lock – If selected, the TABLOCK is acquired on the table during data upload. It is the recommended option if table is not being used by any other application at the time of data upload as it removes the overhead of lock escalation.
- Check Constraints – Check constraints will always check for any constraint for the data that is coming through pipeline. It is preferable to uncheck this option if constraint checking is not required. This will reduce the overhead for the pipeline engine.
- Rows per batch – RowsPerBatch is the number of rows you would want in One Buffer. SSIS automatically sets this property based on the RowSize and MaxBufferRows property. The number of rows coming from the pipeline per batch can be defined by user. The default value is -1 if it is kept blank. You can specify the no. of rows as a positive integer (N) so that the records will come as small segments or batches, each segment containing N no. of rows.
- Maximum insert commit size – You can specify the batch size that the OLE DB destination tries to commit during fast load operations; it actually splits up chunks of data as they are inserted into your destination. If you provide a value for this property, the destination commits rows in batches that are the smaller from either (a) the Maximum insert commit size, or (b) the remaining rows in the buffer that is currently being processed.
- Network limitations: You can transfer data as fast as your network supports. But use them efficiently; you can customize SSIS to use the maximum bandwidth of your network. You can set the Packet Size property of the connection manager to an integer value that suits you. The max value that you can insert is 32767.
Q. What are the lookup cache modes available and how to use them?
In 2008 we have three different cache modes for lookup transformations.
- Full Cache – Default
- Partial Cache
- No Cache
The database is queried once during the pre-execute phase of the data flow. The entire reference set is pulled into memory. This approach uses the most memory, and adds additional startup time for your data flow. Lookup will not swap memory out to disk, so your data flow will fail if you run out of memory.
- When to use Full cache mode
- When you’re accessing a large portion of your reference set
- When you have a small reference table
- When your database is remote or under heavy load, and you want to reduce the number of queries sent to the server
In this mode, the lookup cache starts off empty at the beginning of the data flow. When a new row comes in, the lookup transform checks its cache for the matching values. If no match is found, it queries the database. If the match is found at the database, the values are cached so they can be used the next time a matching row comes in.
In 2008 there is a new Miss Cache feature that allows you to allocate a certain percentage of your cache to remembering rows that had no match in the database. This is useful in a lot of situations, as it prevents the transform from querying the database multiple times for values that don’t exist
- When to use this cache mode
- When you’re processing a small number of rows and it’s not worth the time to charge the full cache
- When you have a large reference table
- When your data flow is adding new rows to your reference table
As the name implies, in this mode the lookup transform doesn’t maintain a lookup cache (actually, not quite true – we keep the last match around, as the memory has already been allocated). In most situations, this means that you’ll be hitting the database for every row.
- When to use this cache mode
- When you’re processing a small number of rows
- When you have non-repeating lookup indexes
- When your reference table is changing (inserts, updates, deletes)
- When you have severe memory limitations
Q. What are the different types of Transformations in SSIS?
Non-Blocking – No blocking
Partial Blocking – The downstream transformations wait for certain periods, it follows start then stop and start over technique
Full Blocking: The downstream has to be waiting till the data has been released from the upstream transformation.
- Cache Transform
- Character Map
- Conditional Split
- Copy Column
- Data Conversion
- Derived Column
- Export Column
- Import Column
- OLE DB Command
- Percentage Sampling
- Script Component
- Slowly Changing Dimension
Partial blocking transformations
- Data Mining
- Merge Join
- Term Lookup
Fully Blocking Transformations
- Fuzzy grouping
- Fuzzy lookup
- Row Sampling
- Term Extraction
If you clearly observe Sort is a fully blocking transformation, so it’s better to sort your data using the SQL command in OLE DB Source instead of using sort transformation. Merge transform requires Sort but not Union All, so use Union All wherever possible.
Q. Consider a scenario where I am using “Sort” transformation and my requirement is to after sort operation completed I have to remove all duplicate records. Duplicate records are defined based on sort values for example I am sorting result set based on three columns, when these 3 columns are having same values those rows are considered as duplicates. Now my question is which transformation we have to use to ignore all these duplicate records?
We need not use any specific transformation to remove duplicate records based on sort columns. There is a feature available at “Sort” transformation itself. We can find an option “Remove duplicate sort values” at the bottom left corner of SORT transformation editor. Just check that box.
Q. How to avoid the sort transformation in SSIS?
Input datasets are to be in sorted order while dealing with the “Merge” or “Merge Join” transformations. To avoid the sort transformation we can directly use a “Query” with order by clause at data source. But remember we can do that when data source is OLEDB or Excel. When it comes to flat file we don’t have a choice but choose the best way to implement sort transformation.
For example if there is an aggregate required then apply aggregate before applying the sort transformation. If possible load flat file data into stage tables, apply sort at database level and load them at destination. So that we should have two data flows one is to load data from flat files to stage tables and other is to loading data into destination from stage tables hence we can use parallelism property.
Q. How an SSIS package or a data flow knows that the input dataset / source dataset is in sorted order?
If the source dataset is in sorted order or we are using a query with order by at source we have to explicitly mention this information at “OLEDB” source.
There are two properties that we need to change at OLEDB source.
1. Open OLEDB source advanced editor
2. Goto tab “Input and Output Properties”
3. Select “OLEDB source OUTPUT”
4. In the properties select the value “True” for the property “IsSorted”
5. Expand Output Column list and select the column name and set “SortKeyPosition” value to one.
6. Repeat the step 5 for all columns in order by cluase by giving appropriate priority
Q. What data providers supported for OLEDB connection manager for cache option when lookup transformation?
Q. From your customer side one of the architect asked you the below information. “I just wanted to know how many number of execution trees are being created for SSIS package which loads data on daily basis.”
How do we know this information?
We can actually use custom log events to capture this information.
The log entry “PipelineExecutionTrees” helps us know about the execution trees created at run time. It includes lots of info for example number of rows stored in a buffer while executing a transformation etc.
For more info please have a look at below link
Q. Do you know when an execution tree created and when it ends in a dataflow? Simply what is the scope of an execution tree?
The work to be done in the data flow task is divided into multiple chunks, which are called execution units, by the dataflow pipeline engine. Each represents a group of transformations. The individual execution unit is called an execution tree, which can be executed by separate thread along with other execution trees in a parallel manner. The memory structure is also called a data buffer, which gets created by the data flow pipeline engine and has the scope of each individual execution tree. An execution tree normally starts at either the source or an asynchronous transformation and ends at the first asynchronous transformation or a destination. During execution of the execution tree, the source reads the data, then stores the data to a buffer, executes the transformation in the buffer and passes the buffer to the next execution tree in the path by passing the pointers to the buffers.
Q. While running SSIS package, after 15 min of execution it went to hung state. How you troubleshoot?
There are three common reasons that hold / hung the SSIS execution.
- Resource Bottleneck: Memory / CPU / IO / Network
- Blocking / Deadlock: Blocking happens at database level or In accessing a file or reading writing variables from script task.
- Poor Performance query: If SSIS stops at Execute SQL Task look for query using inside the task and tune it.
Looking through above aspects one can identify the issue, based on that we can provide the resolution. If everything looks good but still SSIS is in hung state then check the latest service pack is applied if that’s also passed collect the hung dump file using ADPlus and contact Microsoft support center.
Q. SSIS 2008 uses all available RAM, and after package completes Memory is not released?
This is not actually a problem. You have allowed SQL Server to use x amount of memory, so it does. SQL Server takes that memory as required, up to the limit set, but it does not release it. It can respond to request from OS, again read up on the fine details, but by default once it has got hold of some memory it will keep it even if it is not using it currently. The simple reason is that finding and taking hold of memory is quite expensive to do, so once it has it it keeps it and then any subsequent operations that need memory will have it available much faster. This makes perfect sense when you remember that SQL Server is a service application and more often than not runs on a dedicated machine.
Q. What is the property “RunInOptimized”? How to set this property?
If this property is set to true then the SSIS engine ignore the unused/unmapped columns. Means it does not allocate memory to store data for those columns. At the compilation phase itself SSIS engine identifies what are the columns from source are using across the package, if it finds any columns are neither using nor mapping to destination, it simply ignores all those columns.
Q. Does using “RowCount” transformation affects the package performance?
Q. A SSIS 2008 package has been crashed due to low memory. How to resolve low memory issues with SSIS package?
Q. How to enable containers continue to run even a task failed inside the container? Suppose you have an application, where we need to loop through some log table based on the IDs & load data into the destination. Now, in this scenario there might be the situation where some of the tasks in foreach loop container may fail. But your requirement is even though the inner tasks fail we should process the other sources which are available with us.
We can do this by updating the propagation property of a task / container to “False”. It means that the loop or sequence container ignores the failure of an internal task.
Assume we have designed a foreach loop container with a dataflow task. As per our requirement DFT is loading 100 files into database if DFT is failed to load 47th file it should skip the error and should continue to load from 48th file.
Steps to accomplish this are:
Select the Data Flow Task and goto eventhandler
Enable the OnError Event handler.
In the Event Handler tab, click on the “Show System Variables”.
Now select the “Propogate” property & change its value to “False”.
This will ensure that the parent control i.e. ForEach loop will not know about the error in the child task.
If incase the foreach loop container is having more than one task, instead of setting the property to all these tasks, add all these tasks to sequence container and change the “Propagate” property of sequence container.
Note: When this kind of situation comes to the single task instead of a loop we can actually use a property called “ForceExecutionValue” to “True” and give the value to “ForcedExecutionValue”“1”. This means that irrespective of execution result ssis engine forces the outcome to success.
ForceExecution is a property of Controlflow elements in SSIS. If it is enabled to any of the element then ssis engine follows the execution result as per the given parameters. In other words to control the execution result of any control flow element we can use this property.
1- Utilize parallelism: It is easy to utilize parallelism in SSIS. All you need to do is to recognize which Data Flow Tasks (DFTs) could be started at the same time and set the control flow constraints of your package in the way that they all can run simultaneously.
2- Synchronous vs. Asynchronous components: A synchronous transformation of SSIS takes a buffer, processes the buffer, and passes the result through without waiting for the next buffer to come in. On the other hand, an asynchronous transformation needs to process all its input data to be able to give out any output. This can cause serious performance issues when the size of the input data to the asynchronies transformation is too big to fit into memory and needs to be transferred to HDD at multiple stages.
3- Execution tree: An execution tree starts where a buffer starts and ends where the same buffer ends. These execution trees specify how buffers and threads are allocated in the package. Each tree creates a new buffer and may execute on a different thread. When a new buffer is created such as when a partially blocking or blocking transformation is added to the pipeline, additional memory is required to handle the data transformation; however, it is important to note that each new tree may also give you an additional worker thread.
4-OLE DB Command transformation: OLE DB Command is a row-by-row transformation, meaning that it runs the command in it on each one of its input rows. This make sit to be damn too slow when the number of the rows goes up. The solution for boosting performance is to stage data into a temporary table and use Execute SQL Task outside that DFT.
5-SQL Server Destination vs. OLE DB Destination: There is multiple reason why to use OLE DB Destination and not use SQL Server Destination:
- OLE DB Destination is mostly faster,
- OLE DB Destination is a lot clearer when it fails (The error message is more helpful),
- SQL Server Destination works only when SSIS is installed on the destination server.
6- Change Data Capture (CDC): Try to reduce the amount of data to be transferred to the maximum level you can, and do it as close to the source as you can. A Modified On column on the source table(s) helps a lot in this case.
7- Slowly Changing Dimension (SCD) transformation: There is only one advice about SSIS’s Slowly Changing Dimension transformation, and that is get rid of it! The reasons are:
- It doesn’t use any cached data, and goes to the data source every single time it is called,
- It uses many OLE DB Command transformations,
- Fast Data Load is off by default on its OLE DB Destination.
8. Choose the best way in designing Data flow between SQL and SSIS: Remember SSIS is good at Row by Row operations where AS SQL is not. So depends on the situation design data flow using DFT components instead of executing a query using “Execute SQL Task”.
9. Use queries for selecting data rather than selecting a table and checking off the columns you want. This will reduce the initial record set before SSIS gets it rather than ignoring the fields
10. Carefully deal with your connections. By default, your connection manager will connect to the database as many times as it wants to. You can set the RetainSameConnection property so it will only connect once. This can allow you to manage transactions using an ExecuteSQL task and BEGIN TRAN / COMMIT TRAN statements avoiding the overhead of DTC.
11. While running the package with in BIDS ensure you set the package to run in optimized mode.
12. While loading data into destination tables it’s helpful to use the “Fast Load option”.
13. Wherever possible Consider aggregating and (un)pivotting in SQL Server instead doing it in SSIS package – SQL Server outperforms Integration Services in these tasks;
14. Avoid manipulating large datasets using T-SQL statements. All T-SQL statements cause changed data to write out to the transaction log even if you use Simple Recovery Model.
15. For large datasets, do data sorts at the source if possible.
16. Use the SQL Server Destination if you know your package is going to run on the destination server, since it offers roughly 15% performance increase over OLE DB because it shares memory with SQL Server.
17. Increase the network packet size to 32767 on your database connection managers. This allows large volumes of data to move faster from the source servers.
18. If using Lookup transforms, experiment with cache sizes – between using a Cache connection or Full Cache mode for smaller lookup datasets, and Partial / No Cache for larger datasets. This can free up much needed RAM.
19. Make sure “Lock Options” is using while loading very large datasets as bulk insert happens when it satisfies the below conditions.
20. Experiment with the DefaultBufferSize and DefaulBufferMaxRows properties. You’ll need to monitor your package’s “Buffers Spooled” performance counter using Perfmon.exe, and adjust the buffer sizes upwards until you see buffers being spooled (paged to disk), then back off a little.
21. Do all setbased, aggregations and sort operations at source or destination using T-SQL.
22. If possible always use “NOLOCK” at source and “LOCK” at destination.
23. While loading to data warehouses try to disable the indexes while loading.
Rows per batch – The default value for this setting is -1 which specifies all incoming rows will be treated as a single batch. You can change this default behaviour and break all incoming rows into multiple batches. The allowed value is only positive integer which specifies the maximum number of rows in a batch.
Maximum insert commit size – The default value for this setting is ‘2147483647’ (largest value for 4 byte integer type) which specifies all incoming rows will be committed once on successful completion. You can specify a positive value for this setting to indicate that commit will be done for those number of records. You might be wondering, changing the default value for this setting will put overhead on the dataflow engine to commit several times. Yes that is true, but at the same time it will release the pressure on the transaction log and tempdb to grow tremendously specifically during high volume data transfers.
The data flow task in SSIS (SQL Server Integration Services) sends data in series of buffers. How much data does one buffer hold? This is bounded by DefaultBufferMaxRows and DefaultBufferMaxSize, two Data Flow properties. They have default values of 10,000 and 10,485,760 (10 MB), respectively. That means, one buffer will contain either 10,000 rows or 10 MB of data, whichever is less.
Native .Net providers\ or
.Net providers for OLEDB\
A great way to check if your packages are staying within memory is to review the SSIS performance counter Buffers spooled, which has an initial value of 0; above 0 is an indication that the engine has started swapping to disk.
Below are the performance counters which can help us in finding memory details.
Process / Private Bytes (DTEXEC.exe): The amount of memory currently in use by Integration Services.
Process / Working Set (DTEXEC.exe): The total amount of allocated memory by Integration Services.
SQL Server: Memory Manager / Total Server Memory: The total amount of memory allocated by SQL Server. Because SQL Server has another way to allocate memory using the AWE API, this counter is the best indicator of total memory used by SQL Server..
Memory / Page Reads / sec: Represents to total memory pressure on the system. If this consistently goes above 500, the system is under memory pressure.
Q. See there is a scenario: We have a package which has to be open using BIDS / SSDT and has to be modified different elements. But from the location where the SSIS has to be open and modified is not having permissions to access the databases hence all connection managers and other location constraints will fail in validation phase and it takes lot of time to validate all of these connections. Do you have any idea how to control this validation phase?
Q. SSIS performance matters:
For more details on SSIS performance tuning you can check below references.
For more MSBI stuff please have a look at below references: