Google
Information Storage and Retrieval: January 2014

Pages

Saturday, January 25, 2014

Statistics: Random Notes

Statistics:
Statistics is the science of collecting,organizing, summarizing, analyzing and interpreting data.

Quantitative variables take numerical values whose "size" is meaningful. Quantitative variables answer questions such as "how many?" or "how much?" For example, it makes sense to add, to subtract, and to compare two persons' weights, or two families' incomes: These are quantitative variables. Quantitative variables typically have measurement units, such as pounds, dollars, years, volts, gallons, megabytes, inches, degrees, miles per hour, pounds per square inch, BTUs, and so on.

Some variables, such as social security numbers and zip codes, take numerical values, but are not quantitative: They are qualitative or categorical variables. The sum of two zip codes or social security numbers is not meaningful. The average of a list of zip codes is not meaningful. Qualitative and categorical variables typically do not have units. Qualitative or categorical variables—such as gender, hair color, or ethnicity—group individuals. Qualitative and categorical variables have neither a "size" nor, typically, a natural ordering to their values. They answer questions such as "which kind?" The values categorical and qualitative variables take are typically adjectives (for example, green, female, or tall). Arithmetic with qualitative variables usually does not make sense, even if the variables take numerical values. Categorical variables divide individuals into categories, such as gender, ethnicity, age group, or whether or not the individual finished high school

Distribution: The pattern of values in the data, showing their frequency of occurrence relative to each other.

ModeL: A model is a formula where one variable (response) varies depending on one or more independent variables(covariates). One of the simplest models we can create is a Linear Model where we start with the assumption that the dependent variable varies linearly with the independent variable(s). Creating a Linear Model involves a technique known as Linear Regression.

Histogram:
A histogram is a figure that shows how a quantitative variable is distributed over all its values. It allows for the variable to be “binned” into unequal intervals.
In a histogram, area of a bar means the percent in the interval.
Height of bar = % in interval/(right endpoint – left endpoint)
Heights measure density or “crowdedness” in the interval.

Measures of Location:
Median is the half-way point of data. The median is the number that divides the (ordered) data in half—the smallest number that is at least as big as half the data. At least half the data are equal to or smaller than the median, and at least half the data are equal to or greater than the median
Mode: The value that has highest frequency
Mean: The mean (more precisely, the arithmetic mean) is commonly called the average. It is the sum of the data, divided by the number of data

For qualitative and categorical data, the mode makes sense, but the mean and median do not

The pth percentile of a list of numbers is the smallest number that is atleast as large as p% of the list.
A bar chart gives 2 D data while Histogram is 1 D data

An outlier is a data point that lies outside the general range of the data.
Markov’s Inequality: If a list has only non-negative entries, then the proportion of entries that are atleast as large as k times the average is atmost 1/k.
Variance: Mean Square of deviations from average.
Standard Deviation: Squareroot of variance. It measures roughly how far off the entries are from their average. Since its simply a measure, it can’t be negative.
When you add a constant to a list of values, the average also adds up by constant but the SD don’t change. If you multiply by a constant, the new average and new SD also get multiplied by that constant.
Corelation Coefficient (r) is a number between -1 and 1. It measures linear association i.e how tightly the points are clustered about a straight line.


Distribution of a variable is the pattern of values in the data for that variable, showing the frequency of occurrence of the values relative to each other.

Sunday, January 19, 2014

What is the difference between an Oracle User and an Oracle Schema?

An Oracle schema is a collection of database objects like tables, functions, indexes etc A user "owns" a schema. When a user is created, it is given access to a schema with same name. A user can be given access to objects of other schema.

For all intents and purposes you can consider a user to be a schema and a schema to be a user.

Monday, January 6, 2014

Nested Subquery and Correlated Subquery

A subquery is a query which is used in the WHERE clause of a query. A subquery is nested when you are having subquery in the WHERE or HAVING clause of another subquery.

e.g

select Firstname, lastname from student where studentid in (select studentidfrom studentcourse where courseid in (select courseid from course wherecoursename='Oracle'))

In this case, innermost query is executed first and the results are passed on to outer query.

Corelated subquery is the reverse of nested subquery. In a corelated subquery the outer query is executed first and for each row of outer query, the inner query is executed.

select Coursename ,Courseadminid,(select student_name from student where studentid=Course.courseadminid)as CourseAdminName from course




Sunday, January 5, 2014

Cursor and Ref Cursor

A cursor is a name for a structure in memory called a private SQL area, which the server allocates at run time for each SQL statement. When ever an SQL statement is executed, a Oracle database server automatically allocates a cursor that will hold the context(info about the SQL statement like variables used, memory addesses of those variables, parsed version of SQL statement etc) of that SQL statement. This is known as implicit cursor.

Cursors can also be defined explicitly to handle more than 1 row in a program. They are declared, opened, fetched and closed in a PL/SQL program and can be used for static SQL queries.

e.g
 declare
emp_name varchar2(15);
CURSOR my_cur is select name from emp;
BEGIN
OPEN my_cur;
LOOP
FETCH my_cur INTO emp_name;
EXIT WHEN my_cur%NOTFOUND;
dbms_output.put_line('The employee name is: '||emp_name);
end loop;
CLOSE my_cur;
END;
/

REF CURSOR is used in case of dynamic SQL queries. REF CURSOR is not a cursor. Instead, its a data type that will,at runtime, hold a pointer to a place in memory where a cursor really lives. So basically, its a pointer to a cursor or a reference to a cursor. Its declared as follows:

DECLARE
TYPE refcur IS REF CURSOR;
my_ref refcur;
emp_name varchar2(15);
BEGIN
OPEN my_ref FOR 'select name from emp' ;
LOOP
FETCH my_ref INTO emp_name;
EXIT WHEN my_ref%NOTFOUND;
dbms_output.put_line('The employee name is: '||emp_name);
end loop;
CLOSE my_ref;
END;
/

OBIEE11g Security

OBIEE11g security can be divided into 2 broad categories:

1. Controlling access to the components within BI domain(resource access security)
2. Controlling access to data(data access security)

Oracle BI integrates with Fusion Middleware's security platform:

Weblogic Console - management of users and groups for the embedded LDAP server that serves as the default identity store

Enterprise Manager - Management of policy store application roles that grant permissions to users, groups and other application roles

BI Admin tool - management of permissions for Presentation layer objects and business model objects in the repository.

OBIEE11g default security model includes:

1. An embedded directory server functioning as an identity store designed to hold all user and group definitions required to control authentication.

2. A file based policy store designed to hold the application role and permission grant mappings to users and groups require to control authorization. So, Policy store is nothing but a physical file that contains Application Roles, Application Policies and their corresponding mapping to Users and Groups. The file name is system-jazn-data.xml. The file exists under [MIDDLEWARE_HOME]\user_projects\bifoundation_domain\config\fmwconfig


3. A file based credential store designed to hold all user and system credentials required to control authentication or authorization.

Groups are logically ordered set of users. Users with similar needs are organized into groups for easier maintenance. Groups are then mapped to application roles to grant rights. By default, 3 groups are created: BIAdministrators, BIAuthors and BIConsumers

Application role defines the set of permissions that are granted to a user or a group. Application roles are defined in Fusion Middleware Control. The policy store contains the default application roles hierarchy. Default application roles are:
BISystem: Grants the permissions necessary to impersonate other users. This role is required by BI system components for intercomponent communication
BIAdministrator: Grants the administrative permissions necessary to configure and manage OBIEE installation. Any member of BIAdministrators group is explicitly granted this role and implicitly granted BIAuthor and BIConsumer roles.
BIAuthor: Grants the permissions necessary to create and edit content. BIAuthors group is explicitly mapped to this role
BIConsumer: Grants the permission necessary for a user to use the content created by other users.

Applications roles are in policy store while Groups are in Identity Store.

Application Policies are the authorization policies that an application uses for controlling access to its resources. They are defined in Fusion Middleware Control.

Object Level Permissions can be set at RPD and at  Catalog Presentation services Layer. At the RPD, you can navigate to Manage>Identity>BI repository and Click on Applications Role tab. here you can double-click on any application role and set permissions on whole subject area or a particular object of subject area and set appropriate permissions for it. Alternatively you can go to Presentation layer and douvble click on any object and set permissions for it.

At the presentation services layer, you can open the Analytics Link and go to Administration > "Manage Previleges" and set various permissions for different objects. You can also edit properties for individual reports or dashboard and assign privileges here.

Row-Level Permissions means that a group of users have access to reports but each one of them will see different data sets depending on the permission each user is assigned to. To set row-level security, following steps need to be executed:

1. Identity the key information that needs to be filtered and assign it to a user login name. e.g in for Clinical data, if we want to filter the Studies for different users, we have to identify the study numbers allowed for user login names. Store this information in a database table(security table) so that you know know which study numbers belong to which user login.
2. Create a "row-wise initialized" session variable that will store the study numbers for each user that logs in.(We are assuming that 1 user can be assigned to multiple study numbers. Thats why we have to use a row-wise initialized variable). The initialization block query could be of this form:

select 'STUDY_NUMBER', study_number from security_table where login=':USER'. This will create a session variable with name STUDY_NUMBER and assign it a semi-colon separated list of study numbers for the particular user that has logged in.

3. Go to Admin Tool> Manage>Identity and click on Application Tab. Click on the application role where you want to apply row-level filter. If you want to apply filter for all users, select BIConsumer . Click on Permissions and select "Data Filters' tab. Click Add button and select the table(either from Presentation or Business Model layer) on which you want to apply the filter. The value of the filter should be as follows:

"Table Name".Study Number = VALUEOF(NQ_SESSION.STUDY_NUMBER)

Alternatively, you could also go to the Business Model Layer and select the logical table source of the table(Content tab), you wish to apply the filter on and give the filter expression as above in the WHERE clause.



Friday, January 3, 2014

Interview questions for OBIEE11g



Q1 Define the components of Physical layer.

The physical layer consists of 3 objects:

1. Database Object : It is the highest level object in the Physical layer. It defines the data source to which Oracle BI Server submits queries. Importing a schema automatically creates a database object for the schema in the Physical layer.

2. Connection pool: For each data source there is atleast one connection pool. The connection pool contains information about the connection between Oracle BI Server and a data source. The connection pool contains the data source name, the no of connections allowed, time out information and other connectivity-related administrative details. Connection pools allow multiple concurrent data source requests(queries) to share a single database connection, reducing the overhead of connecting to a database. You can create multiple connection pools to improve the performance for groups of users.  

It is recommended that you use a dedicated connection pool for initialization blocks. This is because when a user logs in, the init blocks are fired to populate session variables. If same connection pool is used for init blocks as well as data queries, then the limit of max connections will be reached very soon and there may be case where new users may be unable to login.


3. Schema Folder: It contains tables and columns for a physical schema


Q2 Why are aliases used?

Aliases are used to avoid circular references. A common use of aliases is role-playing dimensions where a single dimension simultaneously appears several times in the same fact table. e.g a order date and shipping date in a fact table may both point to same date column in a physical dimension table but each date is derived from a separate alias.

Q3 What is a logical table source?

Logical table source define the mapping between a logical table and physical table. A logical table source can have many physical table sources. In such a case, a logical table source act as a database view querying from more than 1 table. e.g Revenue information may come from different physical tables. This can be represented in a single logical table. The logical table source will be mapped to these multiple tables.

There can also be more than 1 logical table sources for a logical table. So, when data is not duplicated across many physical tables, we can model multiple physical sources under 1 Logical table sources. When data is duplicated in physical sources (e.g same column is present in 2 physical tables), then we can have more than 1 LTS under same logical table. BI Server will decide the most economical source while forming the query.

For each logical table source, we can choose "Aggregation Content, Group By" value in the "Content" tab. This defines grain or the context in which this LTS will be used. It can have 2 values: By Column or By Level. When a dimension hierarchy is not created, we have only 1 option "By Column". We select the column in the drop down which provides the context. When a dimension hierarchy is created, we should select 'By level' and provide the logical level.

Q4. What is an implicit fact column and why it is used?

There may be requirements where a user needs to build a report containing columns only from dimension tables.  However many fact tables may be joined to multiple dimensions. So, when the dimension-only query is run, BI Server may choose most economical fact table and display results. But the user may be expecting results from a join to a different fact table. For such situations, Implicit fact column is used. Its a column that is automatically added to dimension-only queries. It forces the BI Server to select a pre-determined fact table source even if its not the most economical source. It provides a default join path between dimension tables when there are several possible alternatives. Implicit column is set for a subject area in Presentation layer.

Q5. Can we drag contents in a Subject Area from multiple business models?

No. A single subject area must be populated with content from a single business model. It can not span multiple business models. Multiple subject areas can reference the same business model.

Q6. What is Global Consistency Check?

Its a feature of Administration tool that checks whether a repository has met certain requirements such as:

a. All logical columns are mapped directly or indirectly to one or more physical columns.
b. All logical dimension tables have a logical key
c. All logical tables have a logical join relationship to another logical table
d. There are atleast 2 logical tables in the business model: a logical fact table and a logical dimension table.          Both can map to the same physical table
e. There are no circular join relationships.
f.  A subject area exists for the business model


Q7 How logging level is set in OBIEE?

Logging level is set in Repository file through Admin tool. Open the repository online and go to Manage>Identity>BIRepository>Users . Select Action>Set Online User Filter and retrieve the username for which you want to set the logging level. Double click the user name and set the logging level.

If you want to set the same logging level for all the users, then use the system session variable LOGLEVEL. You can set it through an initialization block.

Q8. What are the different types of logical dimensions?

There are 2 types: dimensions with level based hierarchies(structure hierarchies) and dimensions with parent-child hierarchies(value hierarchies). level-based hierarchies are those in which members of the same type occur only at a single level, while members in a parent-child hierarchies all have the same type.

Level-based hierarchies can inturn be of 3 types: Unbalanced(ragged), skip-level and Time hierarchies.

Unbalanced hierarchy is the hierarchy in which the leaves do not necessarily have the same depth. A skip-level hierarchy is a hierarchy in which there are members that do not have a value for a particular ancestor level. e.g in a Country-State-City hierarchy, the city Chandigarh does not belong to a state. Its a union territory. So you can drill down from Country level(India) to the city level(Chandigarh) directly. Time hierarchies are used for modeling time-series data. 

Q9. What are the different types of aggregation that can be applied to measures?

You can either apply "Default Aggregation" rules or "Dimension-Specific" rules. Usually, the default aggregation rule is specified most of the times for fully additive or non-additive measures.

There may be cases where a measure can be added only for some dimensions(semi-additive measures) but not for others e.g bank balances might be averaged over time but summed over the individual accounts

In such cases, we specify the aggregation as "based on dimensions". Yoy can select the dimension on which the aggregation will be applied. you can also select "Others" to specify the aggregation rule for other dimensions.

Q10 What is the use of setting "Number of Elements"?

The number of elements is used by BI Server when picking aggregate sources. Setting the number of elements is necessary when there are 2 or more aggregate sources that can be used by BI server. Aggregate Sources are selected based on the combination of the fields selected as well as the number of elements of the levels in the logical dimensions to which they map. The number does not need to be exact but ratios of numbers from one logical level to another should be accurate.

Q11 What is the use of Aggregate persistence Wizard?

OBIEE provides an aggregate persistence feature that automates the creation and loading of the aggregate tables and their corresponding OBIEE metadata mappings.

Q12  You have 2 physical tables in your schema: ORDERS_HISTORICAL and ORDERS_RECENT. Both tables contain the same columns. Historical table contains all data prior to a given Order Date. The Recent table contains all the data after the given Order date. How will you model these tables in BMM layer and how the BI server will choose when to select which table?

We will use "Fragmentation Content" feature. In BMM layer, we specify the fragment for both logical table sources. Lets say the given order date is 31-Dec-2013'. We have an PERIOIDKEY column in the tables with value as 20131231. Assume that Business Model name is SupplierSales and Logical Fact tbale name Fact-Sales

So we will specify the fragments as :

For Historical LTS : "SupplierSales"."Fact-Sales"."Order Date" <= '20121231'
For Recent LTS: "SupplierSales"."Fact-Sales"."Order Date" > '20121231'  

Also tick "This source should be combined with other sources at this level"

Now BI server will know to include which table in the final query. If no filter on Order Date is give, BI Server will combine data from both sources automatically (It will do a UNION ALL).

Q13. Define the various kinds of variables in OBIEE

There are 4 classes of variables: Repository, Session, Presentation and Request Variables

A Repository variable has a single value at any point in time. They can be used instead of Constants. Their value persist from the time Orcale BI server is started till the time it is shut down. They are of 2 types: Static and Dynamic. 

Static variables are those variables whose values are initialized once and remains constant. 

Dynamic Variables are the variables whose value is dynamically initialized through an Initialization Block using an SQL query. This value can be periodically refreshed by specifying a refresh schedule. A common use of these variables is to set column filters in reports. e.g to filter a report based on Current Month, we can create a dynamic repository variable CurrentMonth and use it as a filter.

Session variables are created and assigned a value when each user logs on. Like Repository Dynamic variables, their values are assigned by Initialization blocks but unlike repository variables, this initialization is not scheduled. When a user begins a session, Oracle BI Server creates new instances of session variables and initializes them. There are as many instances of session variables as there are active sessions. on BI server. There are 2 types of session variables: System and Non-System

System Session variables are pre-defined session variables used by BI Server for specific purposes such as authenticating users. They have reserved names. e.g USER which contains the value that the user entered as logon name, GROUP which contains the group name that a user belongs to.

Non-System session variables are created for specific application. A common use is for setting filters. e.g a non-system session variable can be created and initialized to user's sales region. A security filter can then be set to filter rows for a particular user's region.

Presentation Variables are created in Presentation services using dashboard prompts. There are 2 types of dashboard prompts - Column Prompt and Variable Prompt. A presentation variable created as part of column prompt is associated with a column and the values it takes come from column values.

A presentation variable defined with a variable prompt takes the values which the person who created the prompt, provides. This value doesn't comes from a table or a view. The variable prompt just holds a value and compares it to the value passed.

Request Variable A request variable lets you override the value of a session variable but only for the duration of a database request initiated from a column prompt. You can create a request variable as part of the process of creating a column prompt.


A request variable that is created as part of a column prompt is associated with a column, and the values that it can take come from the column values.


Q14 What is the difference between default users - weblogic and BISystemUser ?

weblogic is the administrative user. After installation, a single administrative user is shared between Oracle BI and Weblogic Server. It is the member of BIAdministrators group but it can not impersonate other users

BISystemUser is used to by BI system components to establish a connection to each other. 

Q15 What is Usage Tracking and how it is enabled?

Usage tracking is a method to track and store BI Server usage at the detailed query level. The statistics gathered can be utilized for database performance optimization, aggregation strategies or billing users and departments based on the resources they consume. There are 2 methods for enabling usage tracking:

a. Direct Insertion: Oracle BI Server inserts statistics for every query directly into a relational database table

b. Log file: Oracle BI Server inserts statistics for every query into a log file

In case of Direct Insert,  database table (S_NQ_ACCT) is created using a SQL script and NQSConfig.ini is edited to make [USAGE_TRACKING] parameter ENABLE=YES. Also the parameter DIRECT_INSERT is set as YES.

Q16 Why it is recommended to use separate connection pool for initialization blocks?

When a user logs in, init blocks are run against database to populate session variables. Each will use a connection. If the same connection pool is used for both data queries and init blocks, the limit of max connections will be used up very soon and can result in new users not able to login as init block queries are queuing up. Hence it is recommended to have dedicated connection pool for initialization blocks 

Q17 What is the difference between Filters and Selection Steps?

Filters are applied before the query is aggregated but Selection Steps are applied after the query is aggregated. So, filters may change the resulting values of aggegation but selection step will change on the members or rows displayed.

Q18 What is the significance of Content tab of LTS?


The Content Tab is used to specify the grain at which information is stored in a LTS. For a Fact Table LTS, the content level is set for a measure for each dimension. This helps OBIEE to understand at what level the fact is joined to the dimension. Setting the totals level on the Content tab of the LTS is meaningless and is identical to leaving it blank. This helps OBIEE to join a fact with non-conformed dimension. The content tab helps OBIEE to select the right LTS. e.g it tells OBIEE that a LTS at the Month level is beter than the LTS at the Day level.

Q19 What is the use of BI Java Host system component in OBIEE11g?

BI Java Host gives the Presentation Services the ability to use functionality that is provided in Java libraries to support graph generation, style and skin, charts etc

Q20 What are different types of authentications in OBIEE?

OBIEE supports 4 types of authentication:

a. LDAP Authentication: An LDAP server is used to authenticate users in OBIEE. In weblogic console, an additional authentication provider can be set up using the LDAP details

b. External table Authentication: If the list of users and password are maintained in a database table, then this table can be used for authentication. 

c. Database Authentication: Oracle BI Server can authenticate users based on database logins. If a user has read permissions on specific database, BI Presentation Services can authenticate that user.

d. Oracle BI Server User Authentication: Using Admin tool, a list of users and their passwords can be maintained in BI Repository. The BI server will authenticate the users against this list. 

OBIEE11g Architecture


OBIEE11g is a business intelligence and analytics platform that delievers a full range of capabilities like interactive dashboards, adhoc querying, notifications and alerts etc.

OBIEE11g installation results in 2 components: An Oracle BI Domain and a BI Schema(BI schema is a repository schema that contains meta-data about 11g environment. 2 DB schemas are created: BIFOUNDATION and MDS)




The overall system is called "Oracle BI Domain" and is made up of 2 components : Java and Non-Java(System Components)

Java Components are JEE applications and are hosted on Weblogic Server11g on a Managed sever.

System components are C/C++ executables and are controlled by OPMN.

The description of various components is as follows:

Machine: Machine is a physical computer that hosts weblogic server(s).
Weblogic Domain: The basic administrative unit of weblogic server installation. It is a logically related group of weblogic server resources that you manage as a unit.
Admin server: Admin server is the central point of control for a domain. It stores the configuration information and logs for a domain. It also run the weblogic administration console. The domain configuration information is maintained in file config.xml
Managed server: Its any server in the domain which is not Admin server. There can be multiple managed servers in a domain but only 1 admin server. Typically in a production environment, managed servers are used to run business applications while admin server is used to manage resources. The failure of a admin server for a domain does not affect the operation of managed servers in the domain. If the domain admin server becomes unavailable while the server instances it manages are running, the managed servers continue to run.
In a domain with only a single weblogic server instance, that one server functions as both admin server and managed server.
Cluster:
A weblogic server cluster consists of multiple weblogic server instances running simultaneously and working together to provide increased scalability and reliability. A cluster appears to clients as 1 weblogic server instance. The weblogic server instances that constitute a cluster can run on 1 machine or on different machines. Each weblogic server cluster is managed by a single admin server.
Node Manager: It’s a java program enabling you to start and stop server instances- both admin server and managed server remotely and to monitor and automatically restart them after an unexpected failure. There will be 1 node manager on each physical machine. Node manager is usually implemented as a startup service so that if a machine restarts, the node manager is active automatically. A Node Manager process is not associated with a specific WebLogic domain but with a machine. You can use the same Node Manager process to control server instances in any WebLogic Server domain, as long as the server instances reside on the same machine as the Node Manager process. Node Manager must run on each computer that hosts WebLogic Server instances—whether Administration Server or Managed Server—that you want to control with Node Manager.


Oracle Process Manager and Notification Server (OPMN) is installed and configured with every Oracle Application Server installation type and is essential for running Oracle Application Server. In OBIEE11g, OPMN is used for starting/stopping BI System components

Oracle BI Instance: A collection of system components on a single host.


Farm: All instances across Oracle BI Domain. Each EM FMW installation manages a single farm which can contain one or more instances and a single weblogic domain.
Weblogic Admin Console: It’s a browser-based GUI used to configure a domain. It enables to monitor and manage weblogic domain
Enterprise Manager: Its also a browser based tool and is used to manage, monitor and configure BI System Components.  Some of the activities done from EM are: starting/stopping BI System Components, Deploying RPD, Configuring Defaults like cache size, performing diagnostics and logging

Wednesday, January 1, 2014

Dynamic Lookup Cache Scenario in Informatica

There may be cases where we need a dynamic lookup instead of static lookup. The most frequent scenario will be when the source data contains duplicate records or if you are loading same data from multiple sources in the same mapping. In such cases if we use static lookup, then duplicate data would get inserted into the target. A dynamic lookup cache will be updating the cache for any changes and hence the data loading will be correct.


Once the lookup is configured as dynamic cache, Informatica adds one extra port to the lookup: NewLookupRow It can have 3 values - 0,1, or 2 


While using a dynamic lookup cache, we must associate each lookup/output port with an input/output port or a sequence ID. The Integration Service uses the data in the associated port to insert or update rows in the lookup cache. The Designer associates the input/output ports with the lookup/output ports used in the lookup condition.