Google
Information Storage and Retrieval

Pages

Monday, February 22, 2010

'LIKE' does'nt like UNDERSCORE

Write a query to return those values from an Oracle table where the column myword contains an UNDERSCORE ( _ ) sign in its values. e.g if the column 'myword' contains following values:

myword
----------
FAC_FAC
FACFAC
_FACE
FACE
FACE_

The output should be:

myword
----------

FAC_FAC
_FACE
FACE_

Solution:

Generally, most people will write the query as follows:

select myword from mytable where myword like '%_%'.

However this query will not give expected results. It will return all the values of column myword.

The correct query will be as follows:

select myword from mytable where myword like '%\_%' escape '\'

We will have to escape the '_' character to exactly match it.

Thursday, February 18, 2010

Generate a Pyramid

You have table pyramid_table with 2 columns CH and NUM. It contains 1 row. The column CH contains a single character and column NUM contains a number.

Write a SQL query to Generate a pyramid of CH characters of height NUM e.g if CH is '*' and NUM is 5, the output should be :

       *
     ***
   *****
 *******
*********

Solution:

SELECT
lpad(' ',num-rownum,' ')||substr(lpad(ch,2*num,ch),1,2*rownum-1)||lpad(' ',num-rownum,' ') as pyramid
from
pyramid_table,all_objects
where
rownum<=num

Wednesday, February 17, 2010

Count of a Particular Character in a string

Write a query to count a particular character in a string. e.g Return the number of times 'N' appears in string 'NITININ'. The result shoudl be 3.

Solution:

select length('NITININ') - length(replace('NITININ','N','')) from dual

Count of Individual Characters in a String

Write a query to display the count of individual characters in a given string. e.g if the given string is "mesopotamia" , the output should be :

m  2
e  1
i  1
s  1
o 2
p 1
t  1
a  2

Solution:

select ind, count(ind)
from
(
select substr(given_string,rownum,1) ind
from
(
select 'mesopotamia' given_string from all_objects
)
where
rownum<=length(given_string)
)
group by ind

Sunday, February 14, 2010

Informatica Scenario based Questions and Solutions (Part-1)

Scenario1:

We have a target source table containing 3 columns : Col1, Col2 and Col3. There is only 1 row in the table as follows:

Col1 Col2 Col3
-----------------
  a       b       c

There is target table containg only 1 column Col. Design a mapping so that the target table contains 3 rows as follows:

Col
-----
a
b
c

Solution: Not using a Normalizer transformation:

Create 3 expression transformations exp_1,exp_2 and exp_3 with 1 port each. Connect col1 from Source Qualifier to port in exp_1.Connect col2 from Source Qualifier to port in exp_2.Connect col3 from source qualifier to port in exp_3. Make 3 instances of the target. Connect port from exp_1 to target_1. Connect port from exp_2 to target_2 and connect port from exp_3 to target_3.

Scenario 2:

There is a source table that contains duplicate rows.Design a mapping to load all the unique rows in 1 target while all the duplicate rows (only 1 occurence) in another target.

Solution :

Bring all the columns from source qualifier to an Aggregator transformation. Check group by on the key column. Create a new output port count_col in aggregator transformation and write an expression count(key_column). Make a router transformation with 2 groups:Dup and Non-Dup. Check the router conditions count_col>1 in Dup group while count_col=1 in Non-dup group. Load these 2 groups in different targets.

Scenario 3:

There is a source table containing 2 columns Col1 and Col2 with data as follows:

Col1   Col2
 a          l
 b         p
 a         m
 a         n
 b         q
 x          y

Design a mapping to load a target table with following values from the above mentioned source:

Col1    Col2
  a        l,m,n
  b       p,q
  x        y

Solution:

Use a sorter transformation after the source qualifier to sort the values with col1 as key. Build an expression transformation with following ports(order of ports should also be the same):

1. Col1_prev : It will be a variable type port. Expression should contain a variable e.g val
2. Col1 : It will be Input/Output port from Sorter transformation
3. Col2 : It will be input port from sorter transformation
4. val : It will be a variable type port. Expression should contain Col1
5. Concatenated_value: It will be a variable type port. Expression should be decode(Col1,Col1_prev,Concatenated_value||','||Col2,Col1)
6. Concatenated_Final : It will be an outpur port conating the value of Concatenated_value

After expression, build a Aggregator Transformation. Bring ports Col1 and Concatenated_Final into aggregator. Group by Col1. Don't give any expression. This effectively will return the last row from each group.

Connect the ports Col1 and Concatenated_Final from aggregator to the target table.

Scenario 4:

Design an Informatica mapping to load first half records to 1 target while other half records to a separate target.

Solution:

You will have to assign a row number with each record. To achieve this, either use Oracle's psudo column rownum in Source Qualifier query or use NEXTVAL port of a Sequence generator. Lets name this column as rownumber.

From Source Qualifier, create 2 pipelines:

First Pipeline:
Carry first port Col1 from SQ transformation into an aggregator transformation. Create a new output port "tot_rec" and give the expression as COUNT(Col1). Do not group by any port. This will give us the total number of records in Source Table. Carry this port tot_rec to an Expression Transformation. Add another port DUMMY in expression transformation with default value 1.

Second Pipeline:
from SQ transformation, carry all the ports(including an additional port rownumber generated by rownum or sequence generator) to an Expression Transformation. Add another port DUMMY in expression transformation with default value 1.

Join these 2 pipelines with a Joiner Transformation on common port DUMMY. carry all the source table ports and 2 additional ports tot_rec and rownumber to a router transformation. Add 2 groups in Router : FIRST_HALF and SECOND_HALF. Give condition rownumber<=tot_rec/2 in FIRST_HALF. Give condition rownumber>tot_rec/2 in SECOND_HALF. Connect the 2 groups to 2 different targets.

Wednesday, February 3, 2010

Performance Tuning in Informatica

Performance Tuning is done to improve session performance. To tune a session, performance bottlenecks are identified and eliminated. The bottlenecks should be looked in following places in the given order:

1. Target
2. Source
3. Mapping
4. Session
5. System

General Techniques to identify bottlenecks:

1. Thread statistics should be analysed to identify source, target or transformation bottlenecks. By default, the Integration Service uses 1 reader thread, 1 tranformation thread and 1 writer thread to process a session. The thread with the highest busy percentage identifies the bottleneck.

2. Add a filter transformation after each source qualifier. Set the filter condition to false so that no data is passed fromk filter. If the time to run the session remains about the same, there is a source bottleneck.

3. Add a filter transformation before each target definition. Set the filter condition to false so that no data is loaded into the target. If the time taken to run the session is almost same as before, there is a mapping bottleneck.

 General Techniques to improve session performance:

1. If a target bottleneck is identified, consider dropping key constraints and indexes on the target tables before the session is run. Rebuild those constraints and indexes after the session run.

2. Use bulk loading incase of insertion of large amount of data. When bulk loading, the Integration Service bypasses the database log which speeds up performance. But there is a trade-off here. Without writing to the database log, the target database can not perform rollback.

3. If multiple source tables are joined in one source qualifier, optimize the query by using hints and/or adding indexes on the joined columns, GROUP BY or ORDER BY columns. Configure the database to run parallel queries to improve performance.

4. Reduce unnecessary transformations in the mapping and delete unnecessary links between transformations to optimize the mapping.

5. If the session reads from a flat file source, the performance can be improved by setting the number of bytes the Integration Service reads per line.By default, the Integration service reads 1024 bytes per line. If each line in source file is less than the default setting, the 'line sequential buffer length' should be decreased in the session properties.

6. Use filters as early as possible in the mapping.

7. Eliminate unnecessary datatype conversions. Try to use integer values in place of other datatypes when performing comparisons using lookup and filter transformations.

8. The integration service reads expressions written with operators faster than the expressions with functions. Wherever possible, operators should be used. e.g in place of CONCAT function use || operator.

9. When Aggregator Transformation is used, following guidelines should be used -

  • The group by should be done on simple columns. If possible, the number type columns should be preffered for GROUP BY.

  • Use sorted input. When sorted input option is used, the Integration Service assumes that all data is sorted by group. It does not wait for the whole data to come into memory for aggrgation.

  • Use incemental aggrgation if the changes in source affect less than half the target. In this case, the Integration Service updates the target incrementally, rather than processing the entire source and recalculating the same calculations everytime you run the session.

  • Filter data before aggregation. This will avoid unnecessary aggregation.

  • Unncessary port connections should be avoided to reduce the amount of data the aggregator transformation stores in the data cache.
10. When using Joiner Transformation, following things should be considered:

  • In case of large data sets, configure joiner transformation to use sorted input. In this case, the Integration service improves performance by minimizing disk input/output. Also, designate the master source as the source with fewer duplicate key values. When Integration Service processes a sorted joiner transformation, it caches rows for 100 unique keys at a time. If the master source contains many rows with the same key value, the Integration Service needs to cache more rows and hence performance can be slowed.

  • Designate master source as the source with fewer rows. This is because during a session run, the joiner transformation compares the value of each row of the detail source against the master source. The fewer rows in the master, the fewer iterations of the join comparison occur, which speds the join process.
  • Perform join in the database or source qualifier transformation if possible
11. When using Lookup Transformation, following things should be considered:

  • If you have privileges to modify the database containing a lookup table, you can improve lookup initialization time by adding an index to the lookup table.
  • You can improve performance by indexing the columns in the lookup ORDER BY.
  • By default, the Integration Service generates an ORDER BY clause for a cached lookup. The ORDER BY clause contains all lookup ports. To increase performance, you can suppress the default ORDER BY clause and enter an override ORDER BY with fewer columns. Place two dashes ‘--’ as a comment notation after the ORDER BY clause to suppress the ORDER BY clause that the Integration Service generates.
  • If you include more than one lookup condition, place the conditions in the following order to optimize lookup performance:- Equal to (=),Less than (<), greater than (>), less than or equal to (<=), greater than or equal to (>=), Not equal to (!=)
  • Improve session performance by caching small lookup tables.
  • If the lookup table is on the same database as the source table in the mapping and caching is not feasible, join the tables in the source database rather than using a Lookup transformation.
12. If possible, run sessions and workflows concurrently to improve performance.

13. Increase the commit interval. By doing so, the number of times, Integration Service commits data decreases and hence performance is gained. However, if the commit interval is too high, the Integration Service will fill the database log file and cause the session to fail.

14. If the session runs if 'high precision' setting enabled, then disabling this property may reult in performance gain.

15. Set the proper error tracing level in session. e.g Verbose Data setting should only be used for debugging mappings and should be avoided in the normal run.

16. Use pipeline partitioning to improve session performance. Increasing the number of partitions or partition points increases the number of threads. If the Integration Service nodes contain enough CPU bandwidth, this approach can significantly improve performance.

Friday, January 29, 2010

Informatica Basics and Interview FAQs

Q1 What is Informatica Powercenter?
Ans Powercenter is a data integration software of Informatica Corporation which provides an environment that allows to load data into a centralized location such as data warehouse. Data can be extracted from multiple sources , can be transformed according to the business logic and can be loaded into files and relation targets. It has following components:
PowerCentre Domain
PowerCenter Repositiory
Administration Console
PowerCenter Client
Repository Service
Integration service
Web Services Hub
Data Analyzer
Metadata Manager
PowerCenter Repository Reports



Q2 What is Data Integration?
Ans Data Integration is the process of combining data residing at different sources and providing the user with a unified view of these data.


Q3 Explain PowerCenter Repository?
Ans Repository consist of database tables that store metadata. Metadata describes different types of objects , such as mappings or transformations , that you can create using PowerCenter Client tools. The interation service uses repository objects to extract , transform and load data. The repository also stores administrative information such as user names, passwords , permissions and previleges. When any task is performed through PowerCenter Client application such as creating users, analyzing sources , developing mapping or mapplets or creating workflows , Metadata is added to repository tables.
.
Q4. What is a Mapping?
Ans A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. Mappings represent the data flow between sources and targets. When the Integration Service runs a session, it uses the instructions configured in the mapping to read, transform, and write data.
.
Q5. What is a mapplet?
Ans A mapplet is a reusable object that contains a set of transformations and enables to reuse that transformation logic in multiple mappings.
.
Q6. What is Transformation?
Ans Transformation is a repository object that generates,modifies or passes data.Transformations in a mapping represent the operations the Integration Service performs on the data. Data passes through transformation ports that are linked in a mapping or mapplet.

Q7. What are Mapping Parameters and Variables? Whats the difference between them?

Mapping parameters and variables are used to make mappings more flexible.
A mapping parameter represents a constant value that can be defined before running a session. It retains the same value throughout the session. Using a parameter file, this value can be changed for subsequent sessions.

A mapping variable represents a value that can change through sessions. The Integration Service saves the value of a mapping variable to the repository at the end of each successful run and uses that value in the next run of the session.

Q8 What happens when you do not group values in an aggregator transformation?
When the values are not grouped in aggregator transformation, Integration service returns 1 row for all input rows. It typically returns the last row of each group (or the last row recieved) with the result of the aggregation. However, if you specify a particular row to be returned (e.g through FIRST function), then that row is returned.

Q9 How does using sorted input improves the performance of Aggregator Transformation?
When sorted input is used, the Integration Service assumes that all data is sorted by group and it performs aggregate calculations as it reads rows for a group. It doesnot wait for the whole data and hence this reduces the amont of data cached during the session and improves session performance. While when unsorted input is used, Integration service waits for the whole data and only then performs aggregation.

Sorted Input should not be used when either of the following conditions are true:
  •  The aggregate expression uses nested aggregate functions
  •  The session uses incremental aggregation
Q10 How does a join in Joiner transformation different from normal SQL join?

The joiner transformation join can be done on hetrogeneous sources but SQL join can be done only on tables.

Q11 What are the criteria for deciding Master and Detail sources for a joiner transformation?
  • The master pipeline ends at the joiner transformation while the detail pipeline continues to the target. So, accordingly the sources should be decided as master or detail
  • For optimal performance, if an unsorted joiner transformation is used, then designate the source with fewer rows as the master source. During a session run, the joiner transformation compares each row of the master source against the detail source.
  • For a sorted joiner transformation, designate the source with fewer duplicate key values as master.
Q12  Explain Informatica Domain, node and services

The PowerCenter domain is the fundamental administrative unit in PowerCenter. The domain supports the administration of the distributed services. A domain is a collection of nodes and services that you can group in folders based on administration ownership. A domain can run a single version of application services or multiple versions of application services. A domain that can run multiple versions of application services is called a mixed-version domain.

A node is the logical representation of a machine in a domain. One node in the domain acts as a gateway to receive service requests from clients and route them to the appropriate service and node. Services and processes run on nodes in a domain. The availability of a service or process on a node depends on how you configure the service and the node. A node can be a gateway node or a worker node. A gateway node is any node you configure to serve as a gateway for the domain. One node acts as the gateway at any given time. That node is called the master gateway. A gateway node can run application services, and it can serve as a master gateway node. The master gateway node is the entry point to the domain. The Service Manager on the master gateway node performs all domain operations on the master gateway node. The Service Managers running on other gateway nodes perform limited domain operations on those nodes.


Services for the domain include the Service Manager and a set of application services:
*
Service Manager. A service that manages all domain operations. It runs the application services and performs domain functions on each node in the domain. Some domain functions include authentication, authorization, and logging. When you start Informatica Services, you start the Service Manager
*
Application services. Services that represent PowerCenter server-based functionality, such as the Repository Service and the Integration Service. The application services that run on a node depend on the way you configure the services. In mixed-version domain, multiple versions of application services can run in the same domain.

The Service Manager and application services control PowerCenter security. The Service Manager manages users and groups that can log in to PowerCenter applications and authenticates the users who log in to PowerCenter applications. The Service Manager and application services authorize user requests from PowerCenter applications. 

The PowerCenter Administration Console consolidates the administrative tasks for domain objects such as services, nodes, licenses, and grids and for users, groups, and roles. You manage the domain and the security of the domain through the Administration Console.



----- To be continued