Statistics:
Statistics is the science of collecting,organizing,
summarizing, analyzing and interpreting data.
Quantitative variables take
numerical values whose "size" is meaningful.
Quantitative variables
answer questions such as "how many?" or "how much?" For
example, it makes sense to add, to subtract, and to compare two persons'
weights, or two families' incomes: These are quantitative variables.
Quantitative variables typically have measurement units, such as pounds,
dollars, years, volts, gallons, megabytes, inches, degrees, miles per hour,
pounds per square inch, BTUs, and so on.
Some
variables,
such as social security numbers and zip codes, take numerical values, but are
not quantitative: They are
qualitative or categorical variables. The sum of two zip codes or social security
numbers is not meaningful. The average of a list of zip codes is not
meaningful.
Qualitative and
categorical variables
typically do not have units. Qualitative or categorical variables—such as
gender, hair color, or ethnicity—group individuals. Qualitative and categorical
variables have neither a "size" nor, typically, a natural ordering to
their values. They answer questions such as "which kind?" The values
categorical and qualitative variables take are typically adjectives (for
example, green, female, or tall). Arithmetic with
qualitative variables
usually does not make sense, even if the variables take numerical values.
Categorical variables
divide individuals into categories, such as gender, ethnicity, age group, or
whether or not the individual finished high school
Distribution:
The pattern of values in the data, showing their frequency of occurrence
relative to each other.
ModeL: A model is
a formula where one variable (response) varies depending on one or more
independent variables(covariates). One of the simplest models we can create is
a Linear Model where we start with the assumption that
the dependent variable varies linearly with the independent variable(s). Creating
a Linear Model involves a technique known as Linear Regression.
Histogram:
A histogram
is a figure that shows how a quantitative variable is distributed over all its
values. It allows for the variable to be “binned” into unequal intervals.
In a
histogram, area of a bar means the percent in the interval.
Height of
bar = % in interval/(right endpoint – left endpoint)
Heights measure density or “crowdedness” in the interval.
Measures of
Location:
Median is the half-way point of data. The
median is the number that divides the (ordered) data in half—the smallest
number that is at least as big as half the data. At least half the data are
equal to or smaller than the median, and at least half the data are equal to or
greater than the median
Mode: The value that has highest
frequency
Mean: The mean (more precisely, the
arithmetic mean) is commonly called the average. It is the sum of the data,
divided by the number of data
The pth percentile of a list of numbers is the smallest
number that is atleast as large as p% of the list.
A bar chart gives 2 D data while Histogram is 1 D data
An outlier is a data point that lies outside the general
range of the data.
Markov’s Inequality:
If a list has only non-negative entries, then the proportion of entries that
are atleast as large as k times the average is atmost 1/k.
Variance: Mean
Square of deviations from average.
Standard Deviation:
Squareroot of variance. It measures roughly how far off the entries are from
their average. Since its simply a measure, it can’t be negative.
When you add a constant to a list of values, the average also
adds up by constant but the SD don’t change. If you multiply by a constant, the
new average and new SD also get multiplied by that constant.
Corelation Coefficient
(r) is a number between -1 and 1. It measures linear association i.e how
tightly the points are clustered about a straight line.
Distribution of a variable is the pattern of values in the
data for that variable, showing the frequency of occurrence of the values
relative to each other.