Google
Information Storage and Retrieval

Pages

Sunday, November 1, 2015

Understanding recursion

Recursion is the process of solving a problem, by reducing the problem to a smaller(or simpler) version of the same problem and keep it reducing until you reach a case where you can solve it directly. This simple case is called base case.

Factorial:

n! = n*(n-1)*(n-2)*.......*1

Base Case: If n=1 then return 1
Recursive Case: return n*(n-1)!

Python Code:

def factorial(n):
    if(n==1):
        return 1
    else:
        return n*factorial(n-1)

Towers of Hanoi:

We have 3 towers: FromTower, ToTower and SpareTower

Base case: If no of disks =1, move FromTower> ToTower
Recursive Case: Move (n-1) FromTower> SpareTower, then move 1 disk FromTower>ToTower, then move (n-1) disks SpareTower>ToTower

Python Code:

def move(from,to):
    print("Move from "+ str(from)+" to "+str(to)

def TowersofHanoi(n,from,to,spare):
    if(n==1):
        move(from,to)
    else:
        TowersofHanoi(n-1,from,spare,to)
        TowersofHanoi(1,from,to,spare)
        TowersofHanoi(n-1,spare,to,from)
        

Tuesday, August 19, 2014

Use of Staging Area in Datawarehouse

Reasons for Staging data before its loaded into a datawarehouse:

1. 'Free' the source systems: Once the data has been brought to staging tables, the source systems are free for this ETL. If the load process fails, we don't have to intrude again into source tables.

2. Backup: Sometimes, due to massive data volumes, datawarehouse may not be reliably backed-up. We can easily take the backup of staging files/tables and use them if we ever need to reload the datawarehouse.

3. Auditing: The process of auditing becomes easy and straightforward using Staging tables.

Saturday, August 16, 2014

Fastest way to know if a number is power of 2

Using Bitwise operators:

Check num & (num-1) == 0. If true, the number is a power of 2 else its not.

Sunday, June 15, 2014

DataWarehousing Terms

Dimensional Modeling:

A way to design datawarehouse. It employs Ralph Kimball's bottom up approach to design data marts using conformed Facts and Dimensions. It contrasts with ER Modeling in its purpose. The purpose of ER modeling is to store the data in such a way that redundancy is minimized and inserts and updates are quick. Dimensional modeling on the other hand stores the data in such a way that querying or retrieval is fast. 

Role Playing Dimension:

Its is a dimension that has multiple meanings and can be used for multiple purposes. A Date dimension is an example.

Junk Dimension:

A junk dimension is a convenient grouping of flags and indicators into a separate table. In this way, you don't need to store these fields into the fact table and hence you don't need to create separate dimensions for them.

Factless fact:

A table which captures an occurrence of an event but do not contain any actual measurements is known as factless fact table.  This table contains only the keys of dimensions. Typical example can be a fact table that captures the attendance of a student on a particular day.

Degenerate Dimension:

There are instances when an attribute is present in a fact table but it does not have any corresponding dimension. Such attributes are known as degenrate dimensions. For example, Order Numbers, Invoice Number, Transaction Dates, Sale Dates, Credit-Debit Indicator etc can all be classified as degenerate dimensions

Conformed Dimensions and Conformed Facts:

When there is need to analyze facts from two or more data marts, conformed dimensions are used. This is also known as "drilling-across". Time dimension is an example of conformed dimension.

Conforming the facts is the process where the names of measure columns in different fact tables are kept identical.  

Master Data Management:

MDM is the process of creating and managing data that an organization must have as a single copy. For example, data about customers or products can be used as a master copy across different units of the same organization. master data provides a single version of truth. MDM is similar to datawarehousing in terms of ETL process but there are following differences between the two:

a. MDM only applies to entities and not to transactional data while a datawarehouse includes both transactional and non-transactional data. MDM is about dimensional data. It does not concerns with data in Facts.

b. MDM is about reports which are concerned with data goverance, data quality and compliance. Datawarehousing is about analytical reports 

Friday, February 7, 2014

What is Oracle Wallet

There are often situations where we have to give database access credentials in a shell script. This can be a security issue as database connection details are explicitly written in these scripts. Oracle provides an option of creating an external password storage mechanism. 

Wallet is simply a directory in database server where passwords are written in an encrypted form. The wallet location is configured in SQLNET.ora file  and a stored password can be retrieved/used by referencing a TNS alias configured in TNSNAMES.ORA file. Now instead of using explicit db connection details, one can use this TNS alias in shell scripts to connect to database.

Saturday, January 25, 2014

Statistics: Random Notes

Statistics:
Statistics is the science of collecting,organizing, summarizing, analyzing and interpreting data.

Quantitative variables take numerical values whose "size" is meaningful. Quantitative variables answer questions such as "how many?" or "how much?" For example, it makes sense to add, to subtract, and to compare two persons' weights, or two families' incomes: These are quantitative variables. Quantitative variables typically have measurement units, such as pounds, dollars, years, volts, gallons, megabytes, inches, degrees, miles per hour, pounds per square inch, BTUs, and so on.

Some variables, such as social security numbers and zip codes, take numerical values, but are not quantitative: They are qualitative or categorical variables. The sum of two zip codes or social security numbers is not meaningful. The average of a list of zip codes is not meaningful. Qualitative and categorical variables typically do not have units. Qualitative or categorical variables—such as gender, hair color, or ethnicity—group individuals. Qualitative and categorical variables have neither a "size" nor, typically, a natural ordering to their values. They answer questions such as "which kind?" The values categorical and qualitative variables take are typically adjectives (for example, green, female, or tall). Arithmetic with qualitative variables usually does not make sense, even if the variables take numerical values. Categorical variables divide individuals into categories, such as gender, ethnicity, age group, or whether or not the individual finished high school

Distribution: The pattern of values in the data, showing their frequency of occurrence relative to each other.

ModeL: A model is a formula where one variable (response) varies depending on one or more independent variables(covariates). One of the simplest models we can create is a Linear Model where we start with the assumption that the dependent variable varies linearly with the independent variable(s). Creating a Linear Model involves a technique known as Linear Regression.

Histogram:
A histogram is a figure that shows how a quantitative variable is distributed over all its values. It allows for the variable to be “binned” into unequal intervals.
In a histogram, area of a bar means the percent in the interval.
Height of bar = % in interval/(right endpoint – left endpoint)
Heights measure density or “crowdedness” in the interval.

Measures of Location:
Median is the half-way point of data. The median is the number that divides the (ordered) data in half—the smallest number that is at least as big as half the data. At least half the data are equal to or smaller than the median, and at least half the data are equal to or greater than the median
Mode: The value that has highest frequency
Mean: The mean (more precisely, the arithmetic mean) is commonly called the average. It is the sum of the data, divided by the number of data

For qualitative and categorical data, the mode makes sense, but the mean and median do not

The pth percentile of a list of numbers is the smallest number that is atleast as large as p% of the list.
A bar chart gives 2 D data while Histogram is 1 D data

An outlier is a data point that lies outside the general range of the data.
Markov’s Inequality: If a list has only non-negative entries, then the proportion of entries that are atleast as large as k times the average is atmost 1/k.
Variance: Mean Square of deviations from average.
Standard Deviation: Squareroot of variance. It measures roughly how far off the entries are from their average. Since its simply a measure, it can’t be negative.
When you add a constant to a list of values, the average also adds up by constant but the SD don’t change. If you multiply by a constant, the new average and new SD also get multiplied by that constant.
Corelation Coefficient (r) is a number between -1 and 1. It measures linear association i.e how tightly the points are clustered about a straight line.


Distribution of a variable is the pattern of values in the data for that variable, showing the frequency of occurrence of the values relative to each other.

Sunday, January 19, 2014

What is the difference between an Oracle User and an Oracle Schema?

An Oracle schema is a collection of database objects like tables, functions, indexes etc A user "owns" a schema. When a user is created, it is given access to a schema with same name. A user can be given access to objects of other schema.

For all intents and purposes you can consider a user to be a schema and a schema to be a user.