Google
Information Storage and Retrieval: Statistics: Random Notes

Pages

Saturday, January 25, 2014

Statistics: Random Notes

Statistics:
Statistics is the science of collecting,organizing, summarizing, analyzing and interpreting data.

Quantitative variables take numerical values whose "size" is meaningful. Quantitative variables answer questions such as "how many?" or "how much?" For example, it makes sense to add, to subtract, and to compare two persons' weights, or two families' incomes: These are quantitative variables. Quantitative variables typically have measurement units, such as pounds, dollars, years, volts, gallons, megabytes, inches, degrees, miles per hour, pounds per square inch, BTUs, and so on.

Some variables, such as social security numbers and zip codes, take numerical values, but are not quantitative: They are qualitative or categorical variables. The sum of two zip codes or social security numbers is not meaningful. The average of a list of zip codes is not meaningful. Qualitative and categorical variables typically do not have units. Qualitative or categorical variables—such as gender, hair color, or ethnicity—group individuals. Qualitative and categorical variables have neither a "size" nor, typically, a natural ordering to their values. They answer questions such as "which kind?" The values categorical and qualitative variables take are typically adjectives (for example, green, female, or tall). Arithmetic with qualitative variables usually does not make sense, even if the variables take numerical values. Categorical variables divide individuals into categories, such as gender, ethnicity, age group, or whether or not the individual finished high school

Distribution: The pattern of values in the data, showing their frequency of occurrence relative to each other.

ModeL: A model is a formula where one variable (response) varies depending on one or more independent variables(covariates). One of the simplest models we can create is a Linear Model where we start with the assumption that the dependent variable varies linearly with the independent variable(s). Creating a Linear Model involves a technique known as Linear Regression.

Histogram:
A histogram is a figure that shows how a quantitative variable is distributed over all its values. It allows for the variable to be “binned” into unequal intervals.
In a histogram, area of a bar means the percent in the interval.
Height of bar = % in interval/(right endpoint – left endpoint)
Heights measure density or “crowdedness” in the interval.

Measures of Location:
Median is the half-way point of data. The median is the number that divides the (ordered) data in half—the smallest number that is at least as big as half the data. At least half the data are equal to or smaller than the median, and at least half the data are equal to or greater than the median
Mode: The value that has highest frequency
Mean: The mean (more precisely, the arithmetic mean) is commonly called the average. It is the sum of the data, divided by the number of data

For qualitative and categorical data, the mode makes sense, but the mean and median do not

The pth percentile of a list of numbers is the smallest number that is atleast as large as p% of the list.
A bar chart gives 2 D data while Histogram is 1 D data

An outlier is a data point that lies outside the general range of the data.
Markov’s Inequality: If a list has only non-negative entries, then the proportion of entries that are atleast as large as k times the average is atmost 1/k.
Variance: Mean Square of deviations from average.
Standard Deviation: Squareroot of variance. It measures roughly how far off the entries are from their average. Since its simply a measure, it can’t be negative.
When you add a constant to a list of values, the average also adds up by constant but the SD don’t change. If you multiply by a constant, the new average and new SD also get multiplied by that constant.
Corelation Coefficient (r) is a number between -1 and 1. It measures linear association i.e how tightly the points are clustered about a straight line.


Distribution of a variable is the pattern of values in the data for that variable, showing the frequency of occurrence of the values relative to each other.

No comments: