Google
Information Storage and Retrieval: January 2010

Pages

Friday, January 29, 2010

Informatica Basics and Interview FAQs

Q1 What is Informatica Powercenter?
Ans Powercenter is a data integration software of Informatica Corporation which provides an environment that allows to load data into a centralized location such as data warehouse. Data can be extracted from multiple sources , can be transformed according to the business logic and can be loaded into files and relation targets. It has following components:
PowerCentre Domain
PowerCenter Repositiory
Administration Console
PowerCenter Client
Repository Service
Integration service
Web Services Hub
Data Analyzer
Metadata Manager
PowerCenter Repository Reports



Q2 What is Data Integration?
Ans Data Integration is the process of combining data residing at different sources and providing the user with a unified view of these data.


Q3 Explain PowerCenter Repository?
Ans Repository consist of database tables that store metadata. Metadata describes different types of objects , such as mappings or transformations , that you can create using PowerCenter Client tools. The interation service uses repository objects to extract , transform and load data. The repository also stores administrative information such as user names, passwords , permissions and previleges. When any task is performed through PowerCenter Client application such as creating users, analyzing sources , developing mapping or mapplets or creating workflows , Metadata is added to repository tables.
.
Q4. What is a Mapping?
Ans A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. Mappings represent the data flow between sources and targets. When the Integration Service runs a session, it uses the instructions configured in the mapping to read, transform, and write data.
.
Q5. What is a mapplet?
Ans A mapplet is a reusable object that contains a set of transformations and enables to reuse that transformation logic in multiple mappings.
.
Q6. What is Transformation?
Ans Transformation is a repository object that generates,modifies or passes data.Transformations in a mapping represent the operations the Integration Service performs on the data. Data passes through transformation ports that are linked in a mapping or mapplet.

Q7. What are Mapping Parameters and Variables? Whats the difference between them?

Mapping parameters and variables are used to make mappings more flexible.
A mapping parameter represents a constant value that can be defined before running a session. It retains the same value throughout the session. Using a parameter file, this value can be changed for subsequent sessions.

A mapping variable represents a value that can change through sessions. The Integration Service saves the value of a mapping variable to the repository at the end of each successful run and uses that value in the next run of the session.

Q8 What happens when you do not group values in an aggregator transformation?
When the values are not grouped in aggregator transformation, Integration service returns 1 row for all input rows. It typically returns the last row of each group (or the last row recieved) with the result of the aggregation. However, if you specify a particular row to be returned (e.g through FIRST function), then that row is returned.

Q9 How does using sorted input improves the performance of Aggregator Transformation?
When sorted input is used, the Integration Service assumes that all data is sorted by group and it performs aggregate calculations as it reads rows for a group. It doesnot wait for the whole data and hence this reduces the amont of data cached during the session and improves session performance. While when unsorted input is used, Integration service waits for the whole data and only then performs aggregation.

Sorted Input should not be used when either of the following conditions are true:
  •  The aggregate expression uses nested aggregate functions
  •  The session uses incremental aggregation
Q10 How does a join in Joiner transformation different from normal SQL join?

The joiner transformation join can be done on hetrogeneous sources but SQL join can be done only on tables.

Q11 What are the criteria for deciding Master and Detail sources for a joiner transformation?
  • The master pipeline ends at the joiner transformation while the detail pipeline continues to the target. So, accordingly the sources should be decided as master or detail
  • For optimal performance, if an unsorted joiner transformation is used, then designate the source with fewer rows as the master source. During a session run, the joiner transformation compares each row of the master source against the detail source.
  • For a sorted joiner transformation, designate the source with fewer duplicate key values as master.
Q12  Explain Informatica Domain, node and services

The PowerCenter domain is the fundamental administrative unit in PowerCenter. The domain supports the administration of the distributed services. A domain is a collection of nodes and services that you can group in folders based on administration ownership. A domain can run a single version of application services or multiple versions of application services. A domain that can run multiple versions of application services is called a mixed-version domain.

A node is the logical representation of a machine in a domain. One node in the domain acts as a gateway to receive service requests from clients and route them to the appropriate service and node. Services and processes run on nodes in a domain. The availability of a service or process on a node depends on how you configure the service and the node. A node can be a gateway node or a worker node. A gateway node is any node you configure to serve as a gateway for the domain. One node acts as the gateway at any given time. That node is called the master gateway. A gateway node can run application services, and it can serve as a master gateway node. The master gateway node is the entry point to the domain. The Service Manager on the master gateway node performs all domain operations on the master gateway node. The Service Managers running on other gateway nodes perform limited domain operations on those nodes.


Services for the domain include the Service Manager and a set of application services:
*
Service Manager. A service that manages all domain operations. It runs the application services and performs domain functions on each node in the domain. Some domain functions include authentication, authorization, and logging. When you start Informatica Services, you start the Service Manager
*
Application services. Services that represent PowerCenter server-based functionality, such as the Repository Service and the Integration Service. The application services that run on a node depend on the way you configure the services. In mixed-version domain, multiple versions of application services can run in the same domain.

The Service Manager and application services control PowerCenter security. The Service Manager manages users and groups that can log in to PowerCenter applications and authenticates the users who log in to PowerCenter applications. The Service Manager and application services authorize user requests from PowerCenter applications. 

The PowerCenter Administration Console consolidates the administrative tasks for domain objects such as services, nodes, licenses, and grids and for users, groups, and roles. You manage the domain and the security of the domain through the Administration Console.



----- To be continued











Thursday, January 28, 2010

Concatenate the rows

You have table named SHOW_TABLE having 2 columns SHOW_ID and EMP_NAME having data as follows:

SHOW_ID           EMP_NAME
   1                      Gaurav
   2                      Kalpana
   1                      Ashok
   2                      Manish

Write a query to return data as follows:

SHOW_ID                EMP_NAME
     1                     Gaurav,Ashok
     2                     Kalpana, Manish


Solution:

select show_id,rn, ltrim(sys_connect_by_path (emp_name,','),',') concatenated
from
(
select
show_id,
emp_name,
row_number() over (partition by show_id order by show_id )rn,
count(*) over (partition by show_id) cnt from
(
select distinct show_id, emp_name from show_table
))
where
rn=cnt
start with rn=1
CONNECT BY prior show_id=show_id and PRIOR rn = rn -1
order by show_id,length(concatenated) desc

Friday, January 22, 2010

SQL, PL/SQL and DataWarehousing Interview Questions

Q1. What are collections? What are the advantages of using Collections?

PL/SQL collection types let you declare high-level datatypes similar to arrays, sets, and hash tables found in other languages. In PL/SQL, array types are known as varrays (short for variable-size arrays), set types are known as nested tables, and hash table types are known as associative arrays.Each kind of collection is an ordered group of elements, all of the same type. Each element has a unique subscript that determines its position in the collection. When declaring collections, you use a TYPE definition. The adantage of collections is that Collections can be passed as parameters, so that subprograms can process arbitrary numbers of elements.You can use collections to move data into  and out of database tables using high-performance language features known as bulk SQL.

Q2. Tell some new features of Oracle 10g from a developer's point of view.


1. UNIX-style regular expressions  can be used  while performing queries and string manipulations. You use the REGEXP_LIKE operator in SQL queries, and the REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR functions anywhere you would use INSTR, REPLACE, and SUBSTR.

2. Flashbach query functions introduced. The functions SCN_TO_TIMESTAMP and TIMESTAMP_TO_SCN let you translate between a date and time, and the system change number that represents the database state at a point in time.

3. Nested tables defined in PL/SQL have many more operations than previously. You can compare nested tables for equality, test whether an element is a member of a nested table, test whether one nested table is a subset of another, perform set operations such as union and intersection, and much more.

4. New datatypes BINARY_FLOAT and BINARY_DOUBLE represent floating-point numbers in IEEE 754 format. These types are useful for scientific computation where you exchange data with other programs and languages that use the IEEE 754 standard for floating-point.

5. Staring with Oracle 10g release 1, the BINARY_INTEGER datatype was changed to be identical to PLS_INTEGER so the datatypes can be used interchangeably.

Q3. What is high water mark?

Each table's data is stored in its own data segment. The high water mark is the boundary between used and unused space in a segment. When a table is deleted, the high water mark remains unchanged but when a table is truncated, the high water mark drops.

Q4. What are mutating tables?

Mutating means changing.A mutating table is a table that is currently being modified by an update, delete or insert statement. When a trigger tries to reference a table that is in state of flux(being changed), it is considered mutating and raises an error since oracle should never return inconsistenet data.


Q5 How does PL/SQL offers performance benefits over SQL.

Without PL/SQL, Oracle must process SQL statements one at a time. Programs that
issue many SQL statements require multiple calls to the database, resulting in significant network and performance overhead. With PL/SQL, an entire block of statements can be sent to Oracle at one time. This can drastically reduce network traffic between the database and an application.You can use PL/SQL blocks and subprograms to group SQL statements before sending them to the database for execution

Q6 Define %TYPE and %ROWTYPE


Ans The %TYPE attribute provides the datatype of a variable or database column. This is particularly useful when declaring variables that will hold database values. For example, assume there is a column named last_name in a table named employeesTo declare a variable named v_last_name that has the same datatype as column title, use dot notation and the %TYPE attribute, as follows:

v_last_name employees.last_name%TYPE;

Declaring v_last_name with %TYPE has two advantages. First, you need not know the exact datatype of last_name. Second, if you change the database definition of last_name, perhaps to make it a longer character string, the datatype of v_last_name changes accordingly at run time.

%ROWTYPE

In PL/SQL, records are used to group data. A record consists of a number of related fields in which data values can be stored. The %ROWTYPE attribute provides a record type that represents a row in a table. The record can store an entire row of data selected from the table or fetched from a cursor or cursor variable.

Q7. Define NOT NULL constraint

Ans:  NOT NULL constraint puts a constraint on the value of a varaible. It can't be NULL. If NULL is tried to be inserted into such a variable, PL/SQL raises the predefined exception VALUE_ERROR.

Q8 Whats the difference between RANK() and DENSE_RANK() functions?
Ans RANK() will leave a gap in the ranking sequence when there are ties e.g rather than 1,2,2,3 RANK() will return 1,2,2,4.

Q9 Whats the difference between CASE and DECODE?

DECODE works only on equality operator. CASE can handle inquality operators as well  e.g <,>,<=,>=,!=


Q10 Take us through the life-cycle of an SQL query when fired against the database.


These are the following steps during the execution of the SQL query:

  • A cursor is established to hold the context of the SQL statement. A cursor is a connection to a specific area in the Program Global Area (PGA) that contains information about the specific SQL statement.
  • The SQL statement is now parsed. Parsing is the process of checking an SQL statement to ensure that it is valid(syntax checking), as well as creating an optimal execution plan for the statement.
  • Preparation for the return of result set is done.
  • Bind variables are supplied with run time values.
  • The final statement is now executed using the execution plan that has been prepared.
  • Oracle returns the result set.

Q11 What is an Oracle Instance?


An oracle instance is a software service that act as an intermediary between application requests and Oracle database. It is a set of some background processes and  shared memory. An Oracle instance is either started as part of the process of booting a server or can be started explicitly with commands. There are 3 steps in the start-up process:

  • Starting the instance process itself
  • Mounting the database, which consist of opening the control files for the instance
  • Opening the database which makes the database available for user requests.
An instance can mount and open only a single database ever. However, a database can ne mounted and opened by one or more instances(using Real Application Clusters)



(I like to imagine instance as a 'database armed and ready for action)


Q12 Define SGA and PGA


These are memory areas used by an Individual oracle instance.


SGA:
System Global Area is an area of memory that is accessible to all user processes of an Oracle instance. 3 main areas use by SGA are :

  • Redo log buffer which holds information used for recovery until the information can be written to the redo log
  • The shared pool which holds information that can be shared across user processes such as execution plans for SQL statements, compiled stored procedures and information retrieved from the Oracle data dictionary.
  • The database buffer pools which are used to hold data blocks.
PGA:
Program Global Area is an area of memory that is just available to a single server process. PGA contains items like user varaibles and cursor information for an individual user's SQL statement, such as number of rows that have been retreived so far. The PGA is also used for sorting data for an individual user process.

Q13 What is hard and soft parsing?

During parsing, after validating an SQL statement, the Oracle database computes a hash alogorithm to check whether a version of statement exists in the shared pool of SGA. If Oracle can not find the statement in shared pool, the instance prepares the execution plan for it. This is called hard parsing. If Oracle finds the statement in the shared pool, it simply retrieves the execution plan for the statment. This is called soft parsing. Performance is improved by increasing the chances for soft parsing

Q14 What are bind variables and how they improve performance of a SQL code?

A bind variable is a value in an SQL statement that is not known until the statement is run. In an SQL statement , the bind variables act as a placeholder for this subsequent data. In an Oracle SQL syntax, a bind variable is indicated by a colon (:) before the variable name.

e.g Select ename from emp where emp_id = :n_empid

The use of bind variables helps Oracle to do soft parsing of the SQL statements which enhances performance. 

Consider following 2 statements :

Select ename from emp where emp_id = 7

Select ename from emp where emp_id = 5

Both these statements should use identical optimizer plans. If Oracle finds the execution plan for first statement, it should use the same for the second. However it will not since, the hash algorithm generates the execution plans based on characters in the SQL statement. Using bind variables, this problem can be resolved.

Select ename from emp where emp_id = :n_empid

Using this syntax, the execution plan can be retrieved for n number of such statements.

Q15 What is the difference between 'DELETE' and 'TRUNCATE' command?

DELETE is used to delete the rows from a table. However it retains the rows to be deleted in redo log buffers and if ROLLBACK is issued, the rows restored back into the table. Only when COMMIT is executed after DELETE, the rows are permanently deleted.

TRUNCATE requires no COMMIT command to delete the rows. As soon as TRUNCATE is fired, rows are permnently deleted. Thats why its also called as DDL statement. TRUNCATE causes the movement of High Water Mark.

TRUNCATE is faster than DELETE since TRUNCATE do not requires to hold the rows in redo log buffers.


Q16 What is a surrogate key? How it is different from a natural key? What are the advantages of a surrogate key?


surrogate key (also known as artificial or identity key) is a system generated key with no business value. A surrogate key is a substitution for the natural primary key. 
It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table. Data warehouses typically use a surrogate key for the dimension tables primary keys.

A natural key is a value that has a business meaning and people use them e.g SSN, Customer Id 

Advantages of Surrogate Keys:
  • Natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.And in many cases natural primary key is a combination of more than 1 columns. In such cases, the usage of surrogate key simplifies the design.
  • Surrogate keys are numeric and hence indexing on them is faster.
  • Proper tracking the slowly changing dimension. e.g if an employee E1 belongs to business unit BU1 on 1st June 2009 but migrates to Business Unit BU2 on 1st November 2009. If natural key Employee_Id E1 is used, every other attribute will now be moved to BU2. In this case, if surrogate key is used, a new record can be created for E1 to show new data that belongs to BU2.
Q17 What are external tables?

External Tables are a way to query data in a flatfile as if the file were an Oracle table. Thus, it provides a convenient way to move data into/out of database. No DML operations(other than CREATE TABLE) are permitted on External tables. Hence, we can not create indexes on External Tables.

Internally External table uses ORACLE_LOADER access driver to move data from flat file into database. It uses DATA PUMP access driver to move data out of the db into a file. 

Thursday, January 21, 2010

Incremental Data Extraction

Define a logic to incrementally extract data through an Informatica Mapping. On every session run, the mapping should start loading data from the row where it ended loading in the last run. There is column rowdatetime for every row in the table.

Solution:

Create a user defined mapping variable $$LastUpdateDateTime that saves the timestamp of the last row, the integration service read in the previous session. There is a build-in variable $$SessStartTime for the end time stamp. Use the following filter in source filter of source qualifier to incremetally extract data:

Mytable.rowdatetime > TO_DATE('$$LastUpdateDateTime') and mytable.rowdatetime < TO_DATE('$$SessStartTime')

 

Constraint Based Loading and Target Load Order

Define Informatica's Constraint Based Loading and Target Load Order. Whats is the difference between the two?

Constraint Based Loading is a session level property. When this option is selected in session properties, the Integration Service orders the target load on a row by row basis. For every row, the transformed row is loaded first to the primary key table and then to the secondary key table. Target tables in the mapping should have 1 active source (Targets in separate pipelines are not applicable) and should have key relationships(non-circular in nature). Also the session option 'Treat Source Rows As' should be set to Insert. Updates can not be used with constraint based loading.

Target Load Order Group is a collection of source qualifiers, transformations and targets in a mapping. Thus a mapping containing more than 1 pipeline is eligible to Target Load Order. Target Load Order sets the order in which Integration Service sends rows to targets in different target load order groups.

Differences:

1. Constraint Based Loading requires targets to have only 1 active source while Target Load Order is for targets having different active sources.

2. Constraint Based Loading can not be done to maintain referential integrity for updates while Target Load Order can be used to maintain referential integrity when inserting, deleting or updating tables that have primary key and foreign key constraints. 

Tuesday, January 19, 2010

Query to retreive every Nth row from an Oracle table

The query is as follows:

select * from my_table where (rowid,0) in (select rowid, mod(rownum,n) from my_table )

Query to retrieve Nth row from an Oracle table

The query is as follows:

select * from my_table where rownum <= n
MINUS
select * from my_table where rownum < n

How does one eliminate duplicate rows in an Oracle Table?

Method 1:

DELETE from table_name A
where rowid > (select min(rowid) from table_name B where A.key_values = B.key_values);

Method 2:

create table table_name2 as select distinct * from table_name1;
drop table table_name1;
rename table table_name2 as table_name1;

In this method, all the indexes,constraints,triggers etc have to be re-created.

Method 3:

DELETE from table_name t1
where exists (select 'x' from table_name t2 where t1.key_value=t2.key_value
                      and t1.rowid > t2.rowid)

Method 4:

DELETE from table_name where rowid not in (select min(rowid) from my_table group by key_value )