Google
Information Storage and Retrieval: February 2010

Pages

Monday, February 22, 2010

'LIKE' does'nt like UNDERSCORE

Write a query to return those values from an Oracle table where the column myword contains an UNDERSCORE ( _ ) sign in its values. e.g if the column 'myword' contains following values:

myword
----------
FAC_FAC
FACFAC
_FACE
FACE
FACE_

The output should be:

myword
----------

FAC_FAC
_FACE
FACE_

Solution:

Generally, most people will write the query as follows:

select myword from mytable where myword like '%_%'.

However this query will not give expected results. It will return all the values of column myword.

The correct query will be as follows:

select myword from mytable where myword like '%\_%' escape '\'

We will have to escape the '_' character to exactly match it.

Thursday, February 18, 2010

Generate a Pyramid

You have table pyramid_table with 2 columns CH and NUM. It contains 1 row. The column CH contains a single character and column NUM contains a number.

Write a SQL query to Generate a pyramid of CH characters of height NUM e.g if CH is '*' and NUM is 5, the output should be :

       *
     ***
   *****
 *******
*********

Solution:

SELECT
lpad(' ',num-rownum,' ')||substr(lpad(ch,2*num,ch),1,2*rownum-1)||lpad(' ',num-rownum,' ') as pyramid
from
pyramid_table,all_objects
where
rownum<=num

Wednesday, February 17, 2010

Count of a Particular Character in a string

Write a query to count a particular character in a string. e.g Return the number of times 'N' appears in string 'NITININ'. The result shoudl be 3.

Solution:

select length('NITININ') - length(replace('NITININ','N','')) from dual

Count of Individual Characters in a String

Write a query to display the count of individual characters in a given string. e.g if the given string is "mesopotamia" , the output should be :

m  2
e  1
i  1
s  1
o 2
p 1
t  1
a  2

Solution:

select ind, count(ind)
from
(
select substr(given_string,rownum,1) ind
from
(
select 'mesopotamia' given_string from all_objects
)
where
rownum<=length(given_string)
)
group by ind

Sunday, February 14, 2010

Informatica Scenario based Questions and Solutions (Part-1)

Scenario1:

We have a target source table containing 3 columns : Col1, Col2 and Col3. There is only 1 row in the table as follows:

Col1 Col2 Col3
-----------------
  a       b       c

There is target table containg only 1 column Col. Design a mapping so that the target table contains 3 rows as follows:

Col
-----
a
b
c

Solution: Not using a Normalizer transformation:

Create 3 expression transformations exp_1,exp_2 and exp_3 with 1 port each. Connect col1 from Source Qualifier to port in exp_1.Connect col2 from Source Qualifier to port in exp_2.Connect col3 from source qualifier to port in exp_3. Make 3 instances of the target. Connect port from exp_1 to target_1. Connect port from exp_2 to target_2 and connect port from exp_3 to target_3.

Scenario 2:

There is a source table that contains duplicate rows.Design a mapping to load all the unique rows in 1 target while all the duplicate rows (only 1 occurence) in another target.

Solution :

Bring all the columns from source qualifier to an Aggregator transformation. Check group by on the key column. Create a new output port count_col in aggregator transformation and write an expression count(key_column). Make a router transformation with 2 groups:Dup and Non-Dup. Check the router conditions count_col>1 in Dup group while count_col=1 in Non-dup group. Load these 2 groups in different targets.

Scenario 3:

There is a source table containing 2 columns Col1 and Col2 with data as follows:

Col1   Col2
 a          l
 b         p
 a         m
 a         n
 b         q
 x          y

Design a mapping to load a target table with following values from the above mentioned source:

Col1    Col2
  a        l,m,n
  b       p,q
  x        y

Solution:

Use a sorter transformation after the source qualifier to sort the values with col1 as key. Build an expression transformation with following ports(order of ports should also be the same):

1. Col1_prev : It will be a variable type port. Expression should contain a variable e.g val
2. Col1 : It will be Input/Output port from Sorter transformation
3. Col2 : It will be input port from sorter transformation
4. val : It will be a variable type port. Expression should contain Col1
5. Concatenated_value: It will be a variable type port. Expression should be decode(Col1,Col1_prev,Concatenated_value||','||Col2,Col1)
6. Concatenated_Final : It will be an outpur port conating the value of Concatenated_value

After expression, build a Aggregator Transformation. Bring ports Col1 and Concatenated_Final into aggregator. Group by Col1. Don't give any expression. This effectively will return the last row from each group.

Connect the ports Col1 and Concatenated_Final from aggregator to the target table.

Scenario 4:

Design an Informatica mapping to load first half records to 1 target while other half records to a separate target.

Solution:

You will have to assign a row number with each record. To achieve this, either use Oracle's psudo column rownum in Source Qualifier query or use NEXTVAL port of a Sequence generator. Lets name this column as rownumber.

From Source Qualifier, create 2 pipelines:

First Pipeline:
Carry first port Col1 from SQ transformation into an aggregator transformation. Create a new output port "tot_rec" and give the expression as COUNT(Col1). Do not group by any port. This will give us the total number of records in Source Table. Carry this port tot_rec to an Expression Transformation. Add another port DUMMY in expression transformation with default value 1.

Second Pipeline:
from SQ transformation, carry all the ports(including an additional port rownumber generated by rownum or sequence generator) to an Expression Transformation. Add another port DUMMY in expression transformation with default value 1.

Join these 2 pipelines with a Joiner Transformation on common port DUMMY. carry all the source table ports and 2 additional ports tot_rec and rownumber to a router transformation. Add 2 groups in Router : FIRST_HALF and SECOND_HALF. Give condition rownumber<=tot_rec/2 in FIRST_HALF. Give condition rownumber>tot_rec/2 in SECOND_HALF. Connect the 2 groups to 2 different targets.

Wednesday, February 3, 2010

Performance Tuning in Informatica

Performance Tuning is done to improve session performance. To tune a session, performance bottlenecks are identified and eliminated. The bottlenecks should be looked in following places in the given order:

1. Target
2. Source
3. Mapping
4. Session
5. System

General Techniques to identify bottlenecks:

1. Thread statistics should be analysed to identify source, target or transformation bottlenecks. By default, the Integration Service uses 1 reader thread, 1 tranformation thread and 1 writer thread to process a session. The thread with the highest busy percentage identifies the bottleneck.

2. Add a filter transformation after each source qualifier. Set the filter condition to false so that no data is passed fromk filter. If the time to run the session remains about the same, there is a source bottleneck.

3. Add a filter transformation before each target definition. Set the filter condition to false so that no data is loaded into the target. If the time taken to run the session is almost same as before, there is a mapping bottleneck.

 General Techniques to improve session performance:

1. If a target bottleneck is identified, consider dropping key constraints and indexes on the target tables before the session is run. Rebuild those constraints and indexes after the session run.

2. Use bulk loading incase of insertion of large amount of data. When bulk loading, the Integration Service bypasses the database log which speeds up performance. But there is a trade-off here. Without writing to the database log, the target database can not perform rollback.

3. If multiple source tables are joined in one source qualifier, optimize the query by using hints and/or adding indexes on the joined columns, GROUP BY or ORDER BY columns. Configure the database to run parallel queries to improve performance.

4. Reduce unnecessary transformations in the mapping and delete unnecessary links between transformations to optimize the mapping.

5. If the session reads from a flat file source, the performance can be improved by setting the number of bytes the Integration Service reads per line.By default, the Integration service reads 1024 bytes per line. If each line in source file is less than the default setting, the 'line sequential buffer length' should be decreased in the session properties.

6. Use filters as early as possible in the mapping.

7. Eliminate unnecessary datatype conversions. Try to use integer values in place of other datatypes when performing comparisons using lookup and filter transformations.

8. The integration service reads expressions written with operators faster than the expressions with functions. Wherever possible, operators should be used. e.g in place of CONCAT function use || operator.

9. When Aggregator Transformation is used, following guidelines should be used -

  • The group by should be done on simple columns. If possible, the number type columns should be preffered for GROUP BY.

  • Use sorted input. When sorted input option is used, the Integration Service assumes that all data is sorted by group. It does not wait for the whole data to come into memory for aggrgation.

  • Use incemental aggrgation if the changes in source affect less than half the target. In this case, the Integration Service updates the target incrementally, rather than processing the entire source and recalculating the same calculations everytime you run the session.

  • Filter data before aggregation. This will avoid unnecessary aggregation.

  • Unncessary port connections should be avoided to reduce the amount of data the aggregator transformation stores in the data cache.
10. When using Joiner Transformation, following things should be considered:

  • In case of large data sets, configure joiner transformation to use sorted input. In this case, the Integration service improves performance by minimizing disk input/output. Also, designate the master source as the source with fewer duplicate key values. When Integration Service processes a sorted joiner transformation, it caches rows for 100 unique keys at a time. If the master source contains many rows with the same key value, the Integration Service needs to cache more rows and hence performance can be slowed.

  • Designate master source as the source with fewer rows. This is because during a session run, the joiner transformation compares the value of each row of the detail source against the master source. The fewer rows in the master, the fewer iterations of the join comparison occur, which speds the join process.
  • Perform join in the database or source qualifier transformation if possible
11. When using Lookup Transformation, following things should be considered:

  • If you have privileges to modify the database containing a lookup table, you can improve lookup initialization time by adding an index to the lookup table.
  • You can improve performance by indexing the columns in the lookup ORDER BY.
  • By default, the Integration Service generates an ORDER BY clause for a cached lookup. The ORDER BY clause contains all lookup ports. To increase performance, you can suppress the default ORDER BY clause and enter an override ORDER BY with fewer columns. Place two dashes ‘--’ as a comment notation after the ORDER BY clause to suppress the ORDER BY clause that the Integration Service generates.
  • If you include more than one lookup condition, place the conditions in the following order to optimize lookup performance:- Equal to (=),Less than (<), greater than (>), less than or equal to (<=), greater than or equal to (>=), Not equal to (!=)
  • Improve session performance by caching small lookup tables.
  • If the lookup table is on the same database as the source table in the mapping and caching is not feasible, join the tables in the source database rather than using a Lookup transformation.
12. If possible, run sessions and workflows concurrently to improve performance.

13. Increase the commit interval. By doing so, the number of times, Integration Service commits data decreases and hence performance is gained. However, if the commit interval is too high, the Integration Service will fill the database log file and cause the session to fail.

14. If the session runs if 'high precision' setting enabled, then disabling this property may reult in performance gain.

15. Set the proper error tracing level in session. e.g Verbose Data setting should only be used for debugging mappings and should be avoided in the normal run.

16. Use pipeline partitioning to improve session performance. Increasing the number of partitions or partition points increases the number of threads. If the Integration Service nodes contain enough CPU bandwidth, this approach can significantly improve performance.