Information Storage and Retrieval: 2010

Tuesday, December 28, 2010

SIEBEL CLINICAL TRIAL MANAGEMENT SYSTEM and associated Glossary

Oracle's SIEBEL CLINICAL TRIAL MANAGEMENT SYSTEM is a CRM software solution for managing Clinical trials. Siebel Clinical allows pharmaceutical companies, biotechnology companies, and contract research organizations (CROs) to better manage the clinical trial process, maintain quality of clinical trials,and manage investigator relationships. It provides a comprehensive set of tools for clinical research associates (CRAs), clinical investigators, and site coordinators, including a personalized Internet portal to conduct study activities more efficiently.

Generic terms related to Clinical Trials:

Clinical Trial: A systematic study of a test article (treatment, drug or device) in one or more human subjects. An investigation in human subjects intended to discover or verify the clinical, pharmacological and/or other pharmacodynamic effects of an investigational product(s), and/or to identify any adverse reactions to an investigational product(s), and/or to study absorption, distribution, metabolism, and excretion of an investigational product(s) with the object of ascertaining its safety and/or efficacy.

Subject/Trial subject: An individual who participates in a clinical trial, either as recipient of the investigational product(s) or as a control.

Application. Application made to a health authority to market or license a new product.

Visit:. A clinical encounter for a subject in a trial. Visits are frequently referred to as occurring on Day X or during Week Y;

Sites: Sites are locations where clinical trials are conducted. They are typically a clinic or hospitals where investigators see subjects and perform study procedures, such as medical checks.

Product(Compound): A pharmaceutical form of an active ingredient or placebo being tested or used as a reference in a clinical trial. Products can also be devices such as Pacemakers

Program: Groups of Clinical Studies or Clinical Trials for the same compound.

Projects: Groups of Studies within a Program (Oracle Clinical Only)

Investigator: A person responsible for the conduct of the clinical trial at a trial site. If a trial is conducted by a team of individuals at a trial site, the investigator is the responsible leader of the team and may be called the principal investigator.

Sub-investigator: Any individual member of the clinical trial team designated and supervised by the investigator at a trial site to perform critical trial-related procedures and/or to make important trial-related decisions (e.g., associates, residents, research fellows).

Case Report Form:A printed, optical, or electronic document designed to record all of the protocol-required information to be reported to the sponsor on each trial subject. The CRF is the way the Clinical Data for Patients is collected.

Clinical Research Associate (CRA): Person employed by a sponsor, or by a contract
research organization acting on a sponsor’s behalf, who monitors the progress of investigator sites participating in a clinical study. At some sites (primarily in academic settings), clinical research coordinators are called CRAs.

Clinical Research Coordinator (CRC): Person who handles most of the administrative responsibilities of a clinical trial, acts as liaison between investigative site and sponsor, and reviews all data and records before a monitor’s visit. Synonyms: trial coordinator, study coordinator, research
coordinator, clinical coordinator, research nurse, protocol nurse.

Discrepancy:Problems found with data reported in the CRF pages by Investigators for specific Patients

Protocol/Study Protocol: The study protocol is the blueprint that all researchers will follow.

A study protocol is a document that describes, in detail, the plan for conducting the clinical study. The study protocol explains the purpose and function of the study as well as how to carry it out. Some specific things included in the protocol are the reason for the study, the number of participants, eligibility and exclusion criteria, details of the intervention or therapy the participants will receive (such as frequency and dosages), what data will be gathered, what demographic information about the participants will be gathered, steps for clinical caregivers to carry out, and the study endpoints. A single standard protocol must be used without deviation to ensure that the resulting data will be significant and reliable.

===========================================

SIEBEL CTMS:

Each clinical trial starts with a protocol for a specific compound (product). Each protocol is conducted at sites and managed by site personnel. A protocol can have many versions and multiple protocols can roll up to a single program. Protocols can also roll up to regions. Subjects are screened and enrolled at protocol sites for specific protocol versions. Protocol sites are paid, based on the activities they complete. Visits and activities are generated for subjects based on templates defined for the protocol. The Clinical Research Associates perform site initiation activities for protocol sites and submit periodic trip reports. A protocol can also be associated with one or more projects.

The entities and tables in Siebel CTMS are as shown below:

Entity                                             Tables
Account                                     S_ORG_EXT, S_PARTY
Activity                                      S_EVT_ACT
Address                                     S_ADDR_PER
Affiliation                                   S_PTCL_ST_CON_LS
Application                                 S_CL_PGM_APP_LS
Clinical Payment                         S_SRC_PAYMENT
Contact                                     S_CONTACT, S_PARTY
Design                                      S_CL_DSGN_LS
Position                                     S_POSTN
Product                                      S_PROD_INT
Program                                     S_CL_PGM_LS
Project                                        S_PROJ
Project Subcontractor                    S_PROG_ORG
Project Subcontractor Contact         S_PROJ_ORG_CON
Protocol                                      S_CL_PTCL_LS
Protocol Site                                 S_PTCL_SITE_LS
Subject                                        S_CL_SUBJ_LS
Subject Status                              S_CL_SUBJ_ST_LS
Subject Template                          S_SUBJ_TMPL_LS
Template Version                          S_SBJTMP_VER_LS
Template Visit                              S_TMPL_PLANITEM

Trip Report S_EVT_ACT
Visit S_EVT_ACT

A typical Clinical Trial Scenario:

The clinical director and the study manager, working for a clinical research organization, or pharmaceutical, biotech, or medical device company, have administrator responsibilities in Siebel Clinical to:
■ Set up a new treatment study program.
■ Create one or more protocols designed to assess the safety and efficacy of certain compounds in
the treatment of the disease.
■ Set up the geographic regions where the protocols are to be carried out.
■ Compile a list of documents that are critical to the study and implement tracking at the protocol,
region, and site levels, and for accounts and contacts.
■ Create a subject visit template to facilitate consistent application of the protocol across sites and
subjects. This template is used to set up subject visit schedules and activities according to the
guidelines laid out in the protocol.

When the program, protocol, and subject visit templates have been set up, the CRAs who are the
end users of the Siebel Clinical product do the following:

■ Enter data about the:
❏ Sites where the protocols are carried out.
❏ Members to be assigned to the teams at the site, region, and protocol levels.

❏ Accounts, institutions such as hospitals and clinics where the studies are conducted.
❏ Contacts, site personnel such as investigators, site coordinators, and nurse practitioners who carry out the protocols.
❏ Subjects recruited for the clinical trial.
■ Screen and enroll subjects and, if necessary, rescreen the subjects.
■ Use the subject visit template to set up detailed schedules for the subjects’ visits to the sites.
■ Track required documents at the protocol, region, or site level, or for accounts or contacts.

Some major steps to be performed in Siebel CTMS to achieve the above mentioned scenario are:

1. Create a new Clinical Program. The clinical program is the highest-level initiative in Siebel Clinical. Protocols, regions, sites, and subjects must be associated with a program.

2. Setup a Protocol

3. Set Up Regions

4. Define a Subject Visit template. Subject visit templates allow you to set up a template schedule based on the protocol. The template is then used to generate screening, rescreening, and enrollment schedules for each subject, according to the subject’s screening, rescreening, and enrollment dates.

5. Create an Account and Contacts. An account is the institution from which clinical trials are managed. Typically, it is the facility where the investigators conduct the trials. More than one site can be associated with an account and one account can carry out multiple protocols. IRBs (institutional review boards), central labs, CROs(clinical research organizations), and other subcontractors may also be tracked as accounts. Contacts is the term used for personnel working at clinical sites. This includes the investigators, typically medical professionals who are also researchers and site coordinators, who may be the practicing nurses administering the treatment plan according to the clinical protocol.

6. Create a Site. The site is the group at an account, headed by a principal investigator, who carries out a particular protocol. In Siebel Clinical, a separate site record must exist for each unique combination protocol,
account, and principal investigator.

======================================

Hierarchical relationship of programs, protocols, regions, and sites:

Monday, December 27, 2010

DAC Parameters for Informatica Workflows

The main location for the parameters for DAC are in the 2 files located in $DAC_HOME/Informatica/parameters/input

- parameterfileOLTP.txt is used for source systems/SDE mappings
- parameterfileDW.txt is used for DW mappings/SIL and PLP mappings

Another place to define the parameters is in the task’s Parameters tab in DAC console itself. The parameter values defined in here will overwrite those set in the text files.

When DAC starts to execute the task, it creates the parameter file needed for each workflow in Informatica on the fly based on these definitions. DAC changes the format of the file suitable for Informatica including changing the command name into actual Informatica session name [FolderName.SessionName]. It also adds common DAC parameters into each workflow parameter file.

Oracle Business Analytics Warehouse

The Oracle Business Analytics Warehouse is a unified data repository for all

customer-centric data. The purpose of the Oracle Business Analytics Warehouse is to

support the analytical requirements of Oracle E-Business Suite, Oracle's Siebel CRM,

and PeopleSoft Applications. It is composed of following components:

1. A complete relational enterprise data warehouse data model with numerous

pre-built star schemas encompassing many conformed dimensions and several

hundred fact tables.

2. Prebuilt data extractors to incorporate data from external applications into the

Oracle Business Analytics Warehouse.

3. A set of ETL (extract-transform-load) processes (Informatica) that takes data from Oracle

E-Business Suite, Siebel CRM, PeopleSoft Enterprise and other transactional

systems (OLTP), and creates the Oracle Business Analytics Warehouse tables.

4. The Oracle Business Intelligence Data Warehouse Administration Console (DAC),

a centralized console for the set up, configuration, administration, loading, and

monitoring of the Oracle Business Analytics Warehouse.

Tuesday, December 21, 2010

Date Manipulations in OBIEE

Dates are manipulated in OBIEE using TIMESTAMPADD function.

The syntax is as follows:

TIMESTAMPADD(SQL_TSI_interval, init_expr, timestamp_expr)

e.g if you want to add a month to a given date; it can be done as follows:

TIMESTAMPADD(SQL_TSI_MONTH, 1, given_date)

Other SQL_TSI intervals are:

SQL_TSI_SECOND
SQL_TSI_MINUTE
SQL_TSI_HOUR
SQL_TSI_DAY
SQL_TSI_WEEK
SQL_TSI_QUARTER
SQL_TSI_YEAR

Monday, November 29, 2010

How to find out the referenced table and column name if the foreign key constraint name is given

The following query will give the desired result:

select
ac.table_name child_tab,
acc1.column_name child_col,
ac.constraint_name foreign_key,
acc.constraint_name primary_key,
acc.table_name parent_tab,
acc.column_name parent_col
from
all_constraints ac,
all_cons_columns acc,
all_cons_columns acc1
where
ac.owner=acc.owner
and acc.constraint_name=ac.r_constraint_name
and ac.constraint_name=acc1.constraint_name
and ac.constraint_name= :foreign_key_constraint_name

Sunday, November 28, 2010

Constraints

In Oracle, constraints are a facility to enforce rules to make sure that only allowable data values are stored in the database(i.e to make sure the data integrity in database). All constraints have a name. The developer who defines the table or adds a constraint can name it at that time. However, if the name is not supplied, Oracle will assign a system generated name that will uniquely identify the constraint. System generated names are pre-fixed with SYS_C (for "system constraint") followed by a 7 digit integer.

Types of Constraints:

1. NOT NULL: If as per business logic, a column or a set of columns in a table can not allow NULL values, then NOT NULL constraint can be used to enforce this rule.

e.g

alter table SALES modify (cust_id NOT NULL);

2. UNIQUE: If as per business logic, a colum or a set of columns in a table need to store unique values, then, UNIQUE constraint can be used to enforce this rule.

e.g

create table test (col1 number UNIQUE);

UNIQUE constraints allow NULL values to be stored.

3. PRIMARY KEY : Primary Key constraint is a combination of NOT NULL and UNIQUE constraints. The column or the set of columns on which Primary Key is defined, will allow only unique and not null values. There can only be 1 (and only 1) primary key in an Oracle table.

e.g

create table test (col1 number PRIMARY KEY);

4. FOREIGN KEY: It is frequenly required that data in one table should be validated by comparing it to data in other table.(e.g if you add a new order in your ORDERS table, you must cross check that a valid product corresponding to this order is present in your PRODUCTS table). To achieve this kind of data integrity, foeign key constrained is used. This type of validation is also known as referential integrity. A foreign key constraint always makes refrence to a Primary key or a unique constraint of other table. The table that has foreign key defined is called referencing table. The table that has Primary key or Unique constraint defined is called referenced table.

e.g

create table orders
(
order_no number primary key,
customer_name varchar2(10),
constraint cons_prod_fk foreign key(prod_no) references product(prod_no)
);

> If you define the foreign key constraint with 'ON DELETE CASCADE' option, then if any rows are deleted from the referenced table, then the corresponding rows will also be deleted from the referencing table.

> NULL values are allowed in Foreign Key columns.

5. CHECK: Check constraints are used to enforce one or more conditions to be checked for the data row.

e.g

alter table customers add constraint customer_credit_limit CHECK (credit_limit <= 1000)

Some additional information regarding Constraints:

Constraint names can be found in ALL_CONSTRAINTS table. The column names on which constraints are defined can be found in ALL_CONS_COLUMNS.

Constraints can, at any time, be either enabled or disabled. When you create, disable or enable a constraint, you may speify some other information regarding how the constraint behaves. An ebabled constraint can have two options VALIDATE and NOVALIDATE. VALIDATE will validate the existing data in the table while NOVALIDATE will not validate the existing data afyter the constraint is enabled.

When you create or enable a Primary key or Unique constraint, Oracle will create a unique index on the columns of that constraint. Foreign key constraints do not enforce an automatic creation of index.However it is worthwhile to build an index on the columns of each foreign key constraint. Without an index on the corresponding columns in the child table, Oracle is forced to take out a table lock on the child while it performs the DELETE on the parent. If an index is existing, Orcale uses it to identify and lock just the necessary rows in the child while the parent row is deleted.

Friday, September 3, 2010

OBIEE Training for Beginners

Here is a PPT Presentation to introduce Beginners to OBIEE

OBIEE Session

Thursday, September 2, 2010

Simple things about Lookup Transformation

Q Define lookup transformation?

A lookup transformation is used to lookup data in a ‘data-pool’. This data-pool may be a flat-file, relational table, view or a synonym. You can also create a lookup definition from a source qualifier. The Integration Service queries the lookup source based on the lookup ports in the transformation and a lookup condition. The Lookup transformation returns the result of the lookup to the target or another transformation.

Lookups are generally used to get a related value, to perform a calculation using the derived related value or to update a slowly changing dimension.

When you configure a flat file Lookup transformation for sorted input, the condition columns must be grouped. If the condition columns are not grouped, the Integration Service cannot cache the lookup and fails the session. For optimal caching performance, sort the condition columns. The Integration Service always caches flat file and pipeline lookups. If you configure a Lookup transformation to use a dynamic cache, you can use only the equality operator (=) in the lookup condition.

Q What are the differences between connected and unconnected lookups?

1. Connected Lokkup uses a dynamic or static cache while unconnected lookup uses only static cache.

2. Connected lookup can return multiple columns from the same row or insert into the dynamic lookup cache while unconnected lookup returns one column from each row.

3. Connected lookup supports user-defined default values while unconnected lookup does not supports user-defined default values.

Q How can you return multiple ports from an unconnected lookup transformation?

Unconnected lookup transformation returns only 1 port. To return multiple ports, concatenate all those ports in the overwritten lookup query and return the concatenated port. Now separate out those columns in an expression transformation.

Q How can you optimize a lookup transformation?

1. If you have privileges to modify the database containing a lookup table, you can improve lookup initialization time by adding an index to the lookup table.

2. You can improve performance by indexing the columns in the lookup ORDER BY.

3. By default, the Integration Service generates an ORDER BY clause for a cached lookup. The ORDER BY clause contains all lookup ports. To increase performance, you can suppress the default ORDER BY clause and enter an override ORDER BY with fewer columns. Place two dashes ‘--’ as a comment notation after the ORDER BY clause to suppress the ORDER BY clause that the Integration Service generates.

4. If you include more than one lookup condition, place the conditions in the following order to optimize lookup performance:

- Equal to (=)

- Less than (<), greater than (>), less than or equal to (<=), greater than or equal to (>=)

- Not equal to (!=)

5. Improve session performance by caching small lookup tables.

6. If the lookup table is on the same database as the source table in the mapping and caching is not feasible, join the tables in the source database rather than using a Lookup transformation.

Wednesday, May 12, 2010

Oracle BULK Loader

SQL *Loader (sqlldr) is a bulk loader utility to load data from external files into Oracle database. To use this utility, a control file is required which specifies how data should be loaded into the database and a data file is required which specifies what data should be loaded. A sample control file is as follows:

LOAD DATA

INFILE "datafile"

APPEND INTO TABLE "tablename"

FIELDS TERMINATED BY "separater"

("list of all attribute names to be loaded")

A sample data file can be of following form:

1,'Gaurav'

2, 'ABC'

3,'PQR'

Wednesday, March 3, 2010

Informatica Scenario - 5

Previous Scenarios

Scenario: A source table contains emp_name and salary columns. Develop an Informatica mapping to load all records with 5th highest salary into the target table.

Solution:

The mapping will contain following transformations after the Source Qualifier Transformation:

1. Sorter : It will contain 2 ports - emp_name and salary. The property 'Direction' will be selected as 'Descending' on key 'Salary'

2. Expression transformation: It will 6 ports as follows -
     a> emp_name : It will be an I/O port directly connected from previous sorter transformation
     b> salary_prev : It will be a variable type port. Give any vriable name e.g val in its Expression column
     c> salary : It will be an I/O port directly connected from previous transformation
     d> val : It will be a variable port. The expression column of this port will contain 'salary'
     e> rank: It will be a variable type port. The expression column will contain decode
                   (salary,salary_prev,rank,rank+1)
     f> rank_o : It will be an output port containg the value of 'rank'.

3. Filter Transformation : It will have 2 I/O ports emp_name and salary with a filter condition rank_o = 5

The ports emp_name and salary from Filter Transformation will be connected to target

Divisions

A table contains certain columns. One of the columns is Division. The user enters a division name through a prompt. Write a query to satisfy below scenario:

If the value entered in prompt is 'a' or 'b', then all records should be displayed else, the records pertaining to that particular division should be displayed.

Solution:

The query would be written as follows:

select * from my_table
where
division = &division
OR
'^' = decode(&division,'a','^','b','^',&division)

Here any special character (like '^' in above query) can be choosed to give all results in case division value entered from prompt is either 'a' or 'b'.

Monday, February 22, 2010

'LIKE' does'nt like UNDERSCORE

Write a query to return those values from an Oracle table where the column myword contains an UNDERSCORE ( _ ) sign in its values. e.g if the column 'myword' contains following values:

myword

----------

FAC_FAC

FACFAC

_FACE

FACE

FACE_

The output should be:

myword

----------

FAC_FAC

_FACE

FACE_

Solution:

Generally, most people will write the query as follows:

select myword from mytable where myword like '%_%'.

However this query will not give expected results. It will return all the values of column myword.

The correct query will be as follows:

select myword from mytable where myword like '%\_%' escape '\'

We will have to escape the '_' character to exactly match it.

Thursday, February 18, 2010

Generate a Pyramid

You have table pyramid_table with 2 columns CH and NUM. It contains 1 row. The column CH contains a single character and column NUM contains a number.

Write a SQL query to Generate a pyramid of CH characters of height NUM e.g if CH is '*' and NUM is 5, the output should be :

      *
    ***
   *****
*******
*********

Solution:

SELECT
lpad(' ',num-rownum,' ')||substr(lpad(ch,2*num,ch),1,2*rownum-1)||lpad(' ',num-rownum,' ') as pyramid
from
pyramid_table,all_objects
where
rownum<=num

Wednesday, February 17, 2010

Count of a Particular Character in a string

Write a query to count a particular character in a string. e.g Return the number of times 'N' appears in string 'NITININ'. The result shoudl be 3.

Solution:

select length('NITININ') - length(replace('NITININ','N','')) from dual

Count of Individual Characters in a String

Write a query to display the count of individual characters in a given string. e.g if the given string is "mesopotamia" , the output should be :

m 2
e 1
i 1
s 1
o 2
p 1
t 1
a 2

Solution:

select ind, count(ind)
from
(
select substr(given_string,rownum,1) ind
from
(
select 'mesopotamia' given_string from all_objects
)
where
rownum<=length(given_string)
)
group by ind

Sunday, February 14, 2010

Informatica Scenario based Questions and Solutions (Part-1)

Scenario1:

We have a target source table containing 3 columns : Col1, Col2 and Col3. There is only 1 row in the table as follows:

Col1 Col2 Col3
-----------------
a       b       c

There is target table containg only 1 column Col. Design a mapping so that the target table contains 3 rows as follows:

Col
-----
a
b
c

Solution: Not using a Normalizer transformation:

Create 3 expression transformations exp_1,exp_2 and exp_3 with 1 port each. Connect col1 from Source Qualifier to port in exp_1.Connect col2 from Source Qualifier to port in exp_2.Connect col3 from source qualifier to port in exp_3. Make 3 instances of the target. Connect port from exp_1 to target_1. Connect port from exp_2 to target_2 and connect port from exp_3 to target_3.

Scenario 2:

There is a source table that contains duplicate rows.Design a mapping to load all the unique rows in 1 target while all the duplicate rows (only 1 occurence) in another target.

Solution :

Bring all the columns from source qualifier to an Aggregator transformation. Check group by on the key column. Create a new output port count_col in aggregator transformation and write an expression count(key_column). Make a router transformation with 2 groups:Dup and Non-Dup. Check the router conditions count_col>1 in Dup group while count_col=1 in Non-dup group. Load these 2 groups in different targets.

Scenario 3:

There is a source table containing 2 columns Col1 and Col2 with data as follows:

Col1   Col2
a          l
b         p
a         m
a         n
b         q
x          y

Design a mapping to load a target table with following values from the above mentioned source:

Col1    Col2
a        l,m,n
b       p,q
x        y

Solution:

Use a sorter transformation after the source qualifier to sort the values with col1 as key. Build an expression transformation with following ports(order of ports should also be the same):

1. Col1_prev : It will be a variable type port. Expression should contain a variable e.g val
2. Col1 : It will be Input/Output port from Sorter transformation
3. Col2 : It will be input port from sorter transformation
4. val : It will be a variable type port. Expression should contain Col1
5. Concatenated_value: It will be a variable type port. Expression should be decode(Col1,Col1_prev,Concatenated_value||','||Col2,Col1)
6. Concatenated_Final : It will be an outpur port conating the value of Concatenated_value

After expression, build a Aggregator Transformation. Bring ports Col1 and Concatenated_Final into aggregator. Group by Col1. Don't give any expression. This effectively will return the last row from each group.

Connect the ports Col1 and Concatenated_Final from aggregator to the target table.

Scenario 4:

Design an Informatica mapping to load first half records to 1 target while other half records to a separate target.

Solution:

You will have to assign a row number with each record. To achieve this, either use Oracle's psudo column rownum in Source Qualifier query or use NEXTVAL port of a Sequence generator. Lets name this column as rownumber.

From Source Qualifier, create 2 pipelines:

First Pipeline:
Carry first port Col1 from SQ transformation into an aggregator transformation. Create a new output port "tot_rec" and give the expression as COUNT(Col1). Do not group by any port. This will give us the total number of records in Source Table. Carry this port tot_rec to an Expression Transformation. Add another port DUMMY in expression transformation with default value 1.

Second Pipeline:
from SQ transformation, carry all the ports(including an additional port rownumber generated by rownum or sequence generator) to an Expression Transformation. Add another port DUMMY in expression transformation with default value 1.

Join these 2 pipelines with a Joiner Transformation on common port DUMMY. carry all the source table ports and 2 additional ports tot_rec and rownumber to a router transformation. Add 2 groups in Router : FIRST_HALF and SECOND_HALF. Give condition rownumber<=tot_rec/2 in FIRST_HALF. Give condition rownumber>tot_rec/2 in SECOND_HALF. Connect the 2 groups to 2 different targets.

Wednesday, February 3, 2010

Performance Tuning in Informatica

Performance Tuning is done to improve session performance. To tune a session, performance bottlenecks are identified and eliminated. The bottlenecks should be looked in following places in the given order:

1. Target

2. Source

3. Mapping

4. Session

5. System

General Techniques to identify bottlenecks:

1. Thread statistics should be analysed to identify source, target or transformation bottlenecks. By default, the Integration Service uses 1 reader thread, 1 tranformation thread and 1 writer thread to process a session. The thread with the highest busy percentage identifies the bottleneck.

2. Add a filter transformation after each source qualifier. Set the filter condition to false so that no data is passed fromk filter. If the time to run the session remains about the same, there is a source bottleneck.

3. Add a filter transformation before each target definition. Set the filter condition to false so that no data is loaded into the target. If the time taken to run the session is almost same as before, there is a mapping bottleneck.

General Techniques to improve session performance:

1. If a target bottleneck is identified, consider dropping key constraints and indexes on the target tables before the session is run. Rebuild those constraints and indexes after the session run.

2. Use bulk loading incase of insertion of large amount of data. When bulk loading, the Integration Service bypasses the database log which speeds up performance. But there is a trade-off here. Without writing to the database log, the target database can not perform rollback.

3. If multiple source tables are joined in one source qualifier, optimize the query by using hints and/or adding indexes on the joined columns, GROUP BY or ORDER BY columns. Configure the database to run parallel queries to improve performance.

4. Reduce unnecessary transformations in the mapping and delete unnecessary links between transformations to optimize the mapping.

5. If the session reads from a flat file source, the performance can be improved by setting the number of bytes the Integration Service reads per line.By default, the Integration service reads 1024 bytes per line. If each line in source file is less than the default setting, the 'line sequential buffer length' should be decreased in the session properties.

6. Use filters as early as possible in the mapping.

7. Eliminate unnecessary datatype conversions. Try to use integer values in place of other datatypes when performing comparisons using lookup and filter transformations.

8. The integration service reads expressions written with operators faster than the expressions with functions. Wherever possible, operators should be used. e.g in place of CONCAT function use || operator.

9. When Aggregator Transformation is used, following guidelines should be used -

The group by should be done on simple columns. If possible, the number type columns should be preffered for GROUP BY.
Use sorted input. When sorted input option is used, the Integration Service assumes that all data is sorted by group. It does not wait for the whole data to come into memory for aggrgation.
Use incemental aggrgation if the changes in source affect less than half the target. In this case, the Integration Service updates the target incrementally, rather than processing the entire source and recalculating the same calculations everytime you run the session.
Filter data before aggregation. This will avoid unnecessary aggregation.
Unncessary port connections should be avoided to reduce the amount of data the aggregator transformation stores in the data cache.

10. When using Joiner Transformation, following things should be considered:

In case of large data sets, configure joiner transformation to use sorted input. In this case, the Integration service improves performance by minimizing disk input/output. Also, designate the master source as the source with fewer duplicate key values. When Integration Service processes a sorted joiner transformation, it caches rows for 100 unique keys at a time. If the master source contains many rows with the same key value, the Integration Service needs to cache more rows and hence performance can be slowed.
Designate master source as the source with fewer rows. This is because during a session run, the joiner transformation compares the value of each row of the detail source against the master source. The fewer rows in the master, the fewer iterations of the join comparison occur, which speds the join process.
Perform join in the database or source qualifier transformation if possible

11. When using Lookup Transformation, following things should be considered:

If you have privileges to modify the database containing a lookup table, you can improve lookup initialization time by adding an index to the lookup table.
You can improve performance by indexing the columns in the lookup ORDER BY.
By default, the Integration Service generates an ORDER BY clause for a cached lookup. The ORDER BY clause contains all lookup ports. To increase performance, you can suppress the default ORDER BY clause and enter an override ORDER BY with fewer columns. Place two dashes ‘--’ as a comment notation after the ORDER BY clause to suppress the ORDER BY clause that the Integration Service generates.
If you include more than one lookup condition, place the conditions in the following order to optimize lookup performance:- Equal to (=),Less than (<), greater than (>), less than or equal to (<=), greater than or equal to (>=), Not equal to (!=)
Improve session performance by caching small lookup tables.
If the lookup table is on the same database as the source table in the mapping and caching is not feasible, join the tables in the source database rather than using a Lookup transformation.

12. If possible, run sessions and workflows concurrently to improve performance.

13. Increase the commit interval. By doing so, the number of times, Integration Service commits data decreases and hence performance is gained. However, if the commit interval is too high, the Integration Service will fill the database log file and cause the session to fail.

14. If the session runs if 'high precision' setting enabled, then disabling this property may reult in performance gain.

15. Set the proper error tracing level in session. e.g Verbose Data setting should only be used for debugging mappings and should be avoided in the normal run.

16. Use pipeline partitioning to improve session performance. Increasing the number of partitions or partition points increases the number of threads. If the Integration Service nodes contain enough CPU bandwidth, this approach can significantly improve performance.

Friday, January 29, 2010

Informatica Basics and Interview FAQs

Q1 What is Informatica Powercenter?

Ans Powercenter is a data integration software of Informatica Corporation which provides an environment that allows to load data into a centralized location such as data warehouse. Data can be extracted from multiple sources , can be transformed according to the business logic and can be loaded into files and relation targets. It has following components:
PowerCentre Domain
PowerCenter Repositiory
Administration Console
PowerCenter Client
Repository Service
Integration service
Web Services Hub
Data Analyzer
Metadata Manager
PowerCenter Repository Reports

Q2 What is Data Integration?

Ans Data Integration is the process of combining data residing at different sources and providing the user with a unified view of these data.

Q3 Explain PowerCenter Repository?

Ans Repository consist of database tables that store metadata. Metadata describes different types of objects , such as mappings or transformations , that you can create using PowerCenter Client tools. The interation service uses repository objects to extract , transform and load data. The repository also stores administrative information such as user names, passwords , permissions and previleges. When any task is performed through PowerCenter Client application such as creating users, analyzing sources , developing mapping or mapplets or creating workflows , Metadata is added to repository tables.

Q4. What is a Mapping?

Ans A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. Mappings represent the data flow between sources and targets. When the Integration Service runs a session, it uses the instructions configured in the mapping to read, transform, and write data.

Q5. What is a mapplet?

Ans A mapplet is a reusable object that contains a set of transformations and enables to reuse that transformation logic in multiple mappings.

Q6. What is Transformation?

Ans Transformation is a repository object that generates,modifies or passes data.Transformations in a mapping represent the operations the Integration Service performs on the data. Data passes through transformation ports that are linked in a mapping or mapplet.

Q7. What are Mapping Parameters and Variables? Whats the difference between them?

Mapping parameters and variables are used to make mappings more flexible.

A mapping parameter represents a constant value that can be defined before running a session. It retains the same value throughout the session. Using a parameter file, this value can be changed for subsequent sessions.

A mapping variable represents a value that can change through sessions. The Integration Service saves the value of a mapping variable to the repository at the end of each successful run and uses that value in the next run of the session.

Q8 What happens when you do not group values in an aggregator transformation?

When the values are not grouped in aggregator transformation, Integration service returns 1 row for all input rows. It typically returns the last row of each group (or the last row recieved) with the result of the aggregation. However, if you specify a particular row to be returned (e.g through FIRST function), then that row is returned.

Q9 How does using sorted input improves the performance of Aggregator Transformation?

When sorted input is used, the Integration Service assumes that all data is sorted by group and it performs aggregate calculations as it reads rows for a group. It doesnot wait for the whole data and hence this reduces the amont of data cached during the session and improves session performance. While when unsorted input is used, Integration service waits for the whole data and only then performs aggregation.

Sorted Input should not be used when either of the following conditions are true:

The aggregate expression uses nested aggregate functions
The session uses incremental aggregation

Q10 How does a join in Joiner transformation different from normal SQL join?

The joiner transformation join can be done on hetrogeneous sources but SQL join can be done only on tables.

Q11 What are the criteria for deciding Master and Detail sources for a joiner transformation?

The master pipeline ends at the joiner transformation while the detail pipeline continues to the target. So, accordingly the sources should be decided as master or detail
For optimal performance, if an unsorted joiner transformation is used, then designate the source with fewer rows as the master source. During a session run, the joiner transformation compares each row of the master source against the detail source.
For a sorted joiner transformation, designate the source with fewer duplicate key values as master.

Q12 Explain Informatica Domain, node and services

The PowerCenter domain is the fundamental administrative unit in PowerCenter. The domain supports the administration of the distributed services. A domain is a collection of nodes and services that you can group in folders based on administration ownership. A domain can run a single version of application services or multiple versions of application services. A domain that can run multiple versions of application services is called a mixed-version domain.

A node is the logical representation of a machine in a domain. One node in the domain acts as a gateway to receive service requests from clients and route them to the appropriate service and node. Services and processes run on nodes in a domain. The availability of a service or process on a node depends on how you configure the service and the node. A node can be a gateway node or a worker node. A gateway node is any node you configure to serve as a gateway for the domain. One node acts as the gateway at any given time. That node is called the master gateway. A gateway node can run application services, and it can serve as a master gateway node. The master gateway node is the entry point to the domain. The Service Manager on the master gateway node performs all domain operations on the master gateway node. The Service Managers running on other gateway nodes perform limited domain operations on those nodes.

Services for the domain include the Service Manager and a set of application services:

Service Manager. A service that manages all domain operations. It runs the application services and performs domain functions on each node in the domain. Some domain functions include authentication, authorization, and logging. When you start Informatica Services, you start the Service Manager

Application services. Services that represent PowerCenter server-based functionality, such as the Repository Service and the Integration Service. The application services that run on a node depend on the way you configure the services. In mixed-version domain, multiple versions of application services can run in the same domain.

The Service Manager and application services control PowerCenter security. The Service Manager manages users and groups that can log in to PowerCenter applications and authenticates the users who log in to PowerCenter applications. The Service Manager and application services authorize user requests from PowerCenter applications.

The PowerCenter Administration Console consolidates the administrative tasks for domain objects such as services, nodes, licenses, and grids and for users, groups, and roles. You manage the domain and the security of the domain through the Administration Console.

----- To be continued

Thursday, January 28, 2010

Concatenate the rows

You have table named SHOW_TABLE having 2 columns SHOW_ID and EMP_NAME having data as follows:

SHOW_ID           EMP_NAME
   1                      Gaurav
   2                      Kalpana
   1                      Ashok
   2                      Manish

Write a query to return data as follows:

SHOW_ID                EMP_NAME
     1                     Gaurav,Ashok
     2                     Kalpana, Manish

Solution:

select show_id,rn, ltrim(sys_connect_by_path (emp_name,','),',') concatenated
from
(
select
show_id,
emp_name,
row_number() over (partition by show_id order by show_id )rn,
count(*) over (partition by show_id) cnt from
(
select distinct show_id, emp_name from show_table
))
where
rn=cnt
start with rn=1
CONNECT BY prior show_id=show_id and PRIOR rn = rn -1
order by show_id,length(concatenated) desc

Friday, January 22, 2010

SQL, PL/SQL and DataWarehousing Interview Questions

Q1. What are collections? What are the advantages of using Collections?

PL/SQL collection types let you declare high-level datatypes similar to arrays, sets, and hash tables found in other languages. In PL/SQL, array types are known as varrays (short for variable-size arrays), set types are known as nested tables, and hash table types are known as associative arrays.Each kind of collection is an ordered group of elements, all of the same type. Each element has a unique subscript that determines its position in the collection. When declaring collections, you use a TYPE definition. The adantage of collections is that Collections can be passed as parameters, so that subprograms can process arbitrary numbers of elements.You can use collections to move data into and out of database tables using high-performance language features known as bulk SQL.

Q2. Tell some new features of Oracle 10g from a developer's point of view.

1. UNIX-style regular expressions can be used while performing queries and string manipulations. You use the REGEXP_LIKE operator in SQL queries, and the REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR functions anywhere you would use INSTR, REPLACE, and SUBSTR.

2. Flashbach query functions introduced. The functions SCN_TO_TIMESTAMP and TIMESTAMP_TO_SCN let you translate between a date and time, and the system change number that represents the database state at a point in time.

3. Nested tables defined in PL/SQL have many more operations than previously. You can compare nested tables for equality, test whether an element is a member of a nested table, test whether one nested table is a subset of another, perform set operations such as union and intersection, and much more.

4. New datatypes BINARY_FLOAT and BINARY_DOUBLE represent floating-point numbers in IEEE 754 format. These types are useful for scientific computation where you exchange data with other programs and languages that use the IEEE 754 standard for floating-point.

5. Staring with Oracle 10g release 1, the BINARY_INTEGER datatype was changed to be identical to PLS_INTEGER so the datatypes can be used interchangeably.

Q3. What is high water mark?

Each table's data is stored in its own data segment. The high water mark is the boundary between used and unused space in a segment. When a table is deleted, the high water mark remains unchanged but when a table is truncated, the high water mark drops.

Q4. What are mutating tables?

Mutating means changing.A mutating table is a table that is currently being modified by an update, delete or insert statement. When a trigger tries to reference a table that is in state of flux(being changed), it is considered mutating and raises an error since oracle should never return inconsistenet data.

Q5 How does PL/SQL offers performance benefits over SQL.

Without PL/SQL, Oracle must process SQL statements one at a time. Programs that

issue many SQL statements require multiple calls to the database, resulting in significant network and performance overhead. With PL/SQL, an entire block of statements can be sent to Oracle at one time. This can drastically reduce network traffic between the database and an application.You can use PL/SQL blocks and subprograms to group SQL statements before sending them to the database for execution

Q6 Define %TYPE and %ROWTYPE

Ans The %TYPE attribute provides the datatype of a variable or database column. This is particularly useful when declaring variables that will hold database values. For example, assume there is a column named last_name in a table named employees. To declare a variable named v_last_name that has the same datatype as column title, use dot notation and the %TYPE attribute, as follows:

v_last_name employees.last_name%TYPE;

Declaring v_last_name with %TYPE has two advantages. First, you need not know the exact datatype of last_name. Second, if you change the database definition of last_name, perhaps to make it a longer character string, the datatype of v_last_name changes accordingly at run time.

%ROWTYPE

In PL/SQL, records are used to group data. A record consists of a number of related fields in which data values can be stored. The %ROWTYPE attribute provides a record type that represents a row in a table. The record can store an entire row of data selected from the table or fetched from a cursor or cursor variable.

Q7. Define NOT NULL constraint

Ans: NOT NULL constraint puts a constraint on the value of a varaible. It can't be NULL. If NULL is tried to be inserted into such a variable, PL/SQL raises the predefined exception VALUE_ERROR.

Q8 Whats the difference between RANK() and DENSE_RANK() functions?
Ans RANK() will leave a gap in the ranking sequence when there are ties e.g rather than 1,2,2,3 RANK() will return 1,2,2,4.

Q9 Whats the difference between CASE and DECODE?

DECODE works only on equality operator. CASE can handle inquality operators as well e.g <,>,<=,>=,!=

Q10 Take us through the life-cycle of an SQL query when fired against the database.

These are the following steps during the execution of the SQL query:

A cursor is established to hold the context of the SQL statement. A cursor is a connection to a specific area in the Program Global Area (PGA) that contains information about the specific SQL statement.
The SQL statement is now parsed. Parsing is the process of checking an SQL statement to ensure that it is valid(syntax checking), as well as creating an optimal execution plan for the statement.
Preparation for the return of result set is done.
Bind variables are supplied with run time values.
The final statement is now executed using the execution plan that has been prepared.
Oracle returns the result set.

Q11 What is an Oracle Instance?

An oracle instance is a software service that act as an intermediary between application requests and Oracle database. It is a set of some background processes and shared memory. An Oracle instance is either started as part of the process of booting a server or can be started explicitly with commands. There are 3 steps in the start-up process:

Starting the instance process itself
Mounting the database, which consist of opening the control files for the instance
Opening the database which makes the database available for user requests.

An instance can mount and open only a single database ever. However, a database can ne mounted and opened by one or more instances(using Real Application Clusters)

(I like to imagine instance as a 'database armed and ready for action)

Q12 Define SGA and PGA

These are memory areas used by an Individual oracle instance.

SGA:
System Global Area is an area of memory that is accessible to all user processes of an Oracle instance. 3 main areas use by SGA are :

Redo log buffer which holds information used for recovery until the information can be written to the redo log
The shared pool which holds information that can be shared across user processes such as execution plans for SQL statements, compiled stored procedures and information retrieved from the Oracle data dictionary.
The database buffer pools which are used to hold data blocks.

PGA:

Program Global Area is an area of memory that is just available to a single server process. PGA contains items like user varaibles and cursor information for an individual user's SQL statement, such as number of rows that have been retreived so far. The PGA is also used for sorting data for an individual user process.

Q13 What is hard and soft parsing?

During parsing, after validating an SQL statement, the Oracle database computes a hash alogorithm to check whether a version of statement exists in the shared pool of SGA. If Oracle can not find the statement in shared pool, the instance prepares the execution plan for it. This is called hard parsing. If Oracle finds the statement in the shared pool, it simply retrieves the execution plan for the statment. This is called soft parsing. Performance is improved by increasing the chances for soft parsing

Q14 What are bind variables and how they improve performance of a SQL code?

A bind variable is a value in an SQL statement that is not known until the statement is run. In an SQL statement , the bind variables act as a placeholder for this subsequent data. In an Oracle SQL syntax, a bind variable is indicated by a colon (:) before the variable name.

e.g Select ename from emp where emp_id = :n_empid

The use of bind variables helps Oracle to do soft parsing of the SQL statements which enhances performance.

Consider following 2 statements :

Select ename from emp where emp_id = 7

Select ename from emp where emp_id = 5

Both these statements should use identical optimizer plans. If Oracle finds the execution plan for first statement, it should use the same for the second. However it will not since, the hash algorithm generates the execution plans based on characters in the SQL statement. Using bind variables, this problem can be resolved.

Select ename from emp where emp_id = :n_empid

Using this syntax, the execution plan can be retrieved for n number of such statements.

Q15 What is the difference between 'DELETE' and 'TRUNCATE' command?

DELETE is used to delete the rows from a table. However it retains the rows to be deleted in redo log buffers and if ROLLBACK is issued, the rows restored back into the table. Only when COMMIT is executed after DELETE, the rows are permanently deleted.

TRUNCATE requires no COMMIT command to delete the rows. As soon as TRUNCATE is fired, rows are permnently deleted. Thats why its also called as DDL statement. TRUNCATE causes the movement of High Water Mark.

TRUNCATE is faster than DELETE since TRUNCATE do not requires to hold the rows in redo log buffers.

Q16 What is a surrogate key? How it is different from a natural key? What are the advantages of a surrogate key?

A surrogate key (also known as artificial or identity key) is a system generated key with no business value. A surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table. Data warehouses typically use a surrogate key for the dimension tables primary keys.

A natural key is a value that has a business meaning and people use them e.g SSN, Customer Id

Advantages of Surrogate Keys:

Natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.And in many cases natural primary key is a combination of more than 1 columns. In such cases, the usage of surrogate key simplifies the design.
Surrogate keys are numeric and hence indexing on them is faster.
Proper tracking the slowly changing dimension. e.g if an employee E1 belongs to business unit BU1 on 1st June 2009 but migrates to Business Unit BU2 on 1st November 2009. If natural key Employee_Id E1 is used, every other attribute will now be moved to BU2. In this case, if surrogate key is used, a new record can be created for E1 to show new data that belongs to BU2.

Q17 What are external tables?

External Tables are a way to query data in a flatfile as if the file were an Oracle table. Thus, it provides a convenient way to move data into/out of database. No DML operations(other than CREATE TABLE) are permitted on External tables. Hence, we can not create indexes on External Tables.

Internally External table uses ORACLE_LOADER access driver to move data from flat file into database. It uses DATA PUMP access driver to move data out of the db into a file.

Pages