Statistics Learning: February 2016

Sunday, February 21, 2016

Random Variables

Although I've taken probability and statistics as a course (numerous times, either in college or via MOOCs), I never was quite certain about the definition of a random variable.

However, now I'm more cognizant thanks to the MIT course on Probability. So basically, a Random variable is a function that maps the outcome of a probabilistic experiment to a measure. An example will make this clear. If I have 30 students in a class and the experiment is simply to pull out a student at random, then the probability of picking one student is 1/30. However, each student has attributes: like height, weight, GPA, etc. If I then map that student to their height, that is the random variable. So based on the outcome of the experiment (i.e., who I choose), the mapping from the student to the height (or weight, or GPA) is a random variable!

So a random variable maps from the sample space (omega) to a number. You can combine random variables and that itself is a random variable. Every outcome will be mapped to a number and the mapping can be discrete or continuous.

Now, the pmf (probability mass function, a term that is only used when the random variable in discussion takes on discrete values) is a probability law or probability distribution of the random variable X. So it describes the probabilities given to the random variable when the random variable takes on a certain values. So P[X=1] = some probability; P[X=2] = another probability; P[X=3] = yet another probability. An exhaustive description like this constitutes the pmf.

Sunday, February 14, 2016

Law of Large Numbers

Sources:
Elementary Statistics, Mario F. Triola
The Drunkard's Walk, Leonard Mlodinow

"When finding probabilities with the relative frequency approach (Rule 1), we obtain an approximation instead of an exact value. As the total number of observations increases, the corresponding approximations tend to get closer to the actual probability. [this is referred to as the law of large numbers which is stated as: As a procedure is repeated again and again, the relative frequency probability (from Rule 1) of an event tends to approach the actual probability."

Jakob Bernoulli discovered this law/theorem in the 1680s or so and it says that (1) you give him a tolerance of error (+/- a percentage, e.g. + or - 5 percent) from the target value that you expect & (2) a tolerance of uncertainty (99% certain or 90% certain that you can be sure of the result). Given both, Bernoulli will tell you how many trials you need to conduct. His formulas did not last because they were based on approximations and so modern mathematics have improved on them; however the concept behind his law is the important piece: namely that it is always possible to conduct the procedure enough times to be almost certain that the percentage of what you are expecting will fall near the target.

Given the Law of Large Numbers, there is a funny/sarcastic version of it called the Law of Small Numbers, "which is based on a misconception [or mistaken intuition] that a small sample accurately reflects underlying probabilities. It is a sarcastic name describing the misguided attempt to apply the law of large numbers when the numbers aren't large." An example of this is seeing a CEO's performance over a range of years and then judging that performance as representative. One CEO's performance over a subset of years is hardly the basis with which to determine true performance.

Bernoulli had wanted to answer something of the sort: "Given that you view a certain number of roulette spins, how closely can you nail down the underlying probabilities, and with what level of confidence?" Instead he answered a closely related question: How well are underlying probabilities reflected in actual results? [he came up with a formula to determine now many trials would need to be conducted depending on how certain you wanted to be and how close you wanted to be to the true answer].

So with the second question, we are really talking about fixed probabilities that we suspect are know to be the case [e.g., in many cases, these are gambling examples where a priori probabilities are known]. However in most real-life cases, we do not know the probabilities beforehand and so we must actually answer the first question which is: given a set of data, how can we infer the underlying probabilities? [a much harder question and one that Rev. Thomas Bayes helps us with and the science of Bayesian statistics and inference].

Probability Theory

I'm taking MIT's 6.041x : Introduction to Probability: The Science of Uncertainty in Winter 2016. It's been only the first week but it is an awesome class so far! The professor and teaching staff are remarkable: quite lucid and very engaging.

Even though I've studied probability & statistics before [I have an undergrad degree in App. Math & Statistics], I didn't have a full appreciation of the concepts. I could probably calculate things but not really know what I was doing and what the underlying meaning was. Now, however, with the advent of data science coming into vogue and my own love for the topic, I've been exploring my roots in probability & statistics. As a professional project manager, it is my job to deal with uncertainty and I need to constantly find ways to combat it: via risk management and using Monte Carlo simulations to simulate uncertainty.

Highlights of the class:

A refresher in sets, sequences, limits, series, geometric series [sum is 1/(1-r)], Cantor's diagonalization argument on why the real #s are not countable [very cool!]
Using Sets as the basis, then layering probability onto sets; i.e., we calculate probabilities on events [which are subsets of the sample space]; one can then create a probability mapping/function that assigns probabilities to events; If Event A is said to occur, this means the outcome was one of the elements in that subset A. If the outcome fell outside of set A, then we say that Event A did not occur.
From a minimal # of axioms, proving various properties of probabilities [Union Bound property; or P(A u B) = P(A) + P(B) - P(A ^ B); these and others fall out of a small # of axioms
How the concept of area relates to probability; and how measure theory is an underlying mathematical foundation for allowing this; some weird subsets with areas cannot be used as a probability mapping, however the unit square is just fine. in fact areas in the unit square can be used to calculate probabilities
paradoxes that arise when applying the properties if you ignore instances of where the sets are not finite or not countable [i.e., the union of each point in the unit square is the entire unit square which has area = 1 (thus probability = 1); AND each point is disjoint from every other point, thus the P(union of the points) should be the sum of the probabilities of each point; however each point has 0 area = 0 probability, thus we just showed that 1 = 0; paradox! So we must be careful to apply the laws to appropriate sets. However, this paradox is resolved because you can only apply the (stronger version of the) additive axiom in cases where the individual disjoint sets can be ordered into a sequence. There is no way to do this with the points in the unit square. But it helps with calculating probabilities of an infinite # of sets [however each of those sets must be arrangeable in a countable fashion]
Infinite sets can be discrete [like the integers] and continuous [like the reals]; this distinction is important because these denote whether a set is countable or not. some of the laws/theorems cannot be applied in uncountable instances.
One way of interpreting probabilities is that a probability is a frequency of how often something occurs when done an infinite # of times. so P(of obtaining a head) = 0.5 so that can be viewed as half of (whatever large # of times I toss the coin).
Statistics is a field that complements probability by using data to come up with good models

Sunday, February 7, 2016

Data Science Cool Learning Resources

http://scott.fortmann-roe.com/docs/BiasVariance.html

[Understanding the Bias-Variance Tradeoff] from Udacity's Data Analysis with R course

2. http://www.perossi.org/home/bsm-1
[Bayesian Statistics & Marketing] : Features the Yogurt data set from Data Analysis with R course

http://flowingdata.com/2014/02/27/how-to-read-histograms-and-use-them-in-r/

[How to Read & Use Histograms in R]

4.
http://rstudio-pubs-static.s3.amazonaws.com/4305_8df3611f69fa48c2ba6bbca9a8367895.html

Turning a Table into a Horizontal Bar Graph using ggplot2

5.
http://www.ats.ucla.edu/stat/r/faq/smooths.htm
Exploring Different smooths in ggplot2

6
http://dept.stat.lsa.umich.edu/~kshedden/Courses/Stat401/Notes/401-bivariate-slides.pdf

All about Bivariate Analysis

7.
http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/src/timeseries.html
Using R for Time-Series

8.
http://www.r-tutor.com/elementary-statistics/numerical-measures/percentile
Calculating percentiles in R

9.
http://blog.rstudio.org/2014/07/22/introducing-tidyr/
Introducing tidyr

10.
http://www.statmethods.net/advgraphs/parameters.html
[using different symbols when plotting points for e.g.; ]

R Graphing

Taking Data Science from Udacity [specifically the Data Analysis with R]; I love ggplot!!

To convert a variable in a dataframe to a factor variable, simply do this;
yo$id <- factor(yo$id); # Here id was an integer but will now be converted to a factor variable

Scaling data:
Use scale_x_log10 : to scale the x axis in log-10 units

scale_x_continuous : takes a breaks=seq(0,20000,1000) argument [where the x-axis will go from 0 -> 20,000 in increments of a thousand; also takes a limits=c(0,20000) argument
scale_x_discrete
coord_cartesian(ylim=c(0,1000)) <= believe this doesn't remove points from the dataset when plotting; but just provides a zoom-in; the data points won't be shown but will be kept for other layers to make use of

Can use quantile function to set limits: quantile(diamonds$price, c(.90))
will find the value that is the 90% of diamonds$price.

xlim
ylim

Examining data:

summary(yo)
unique(yo$price) # will list all the unique values that are in that variable
length(unique(yo$price)) # will give a count of all the unique values
table(yo$price) # will give a count of each unique value in the variable [similar to a tabular histogram]

Transforming Data:

Using dplry & tidyr:

pf.fc_by_age_gender.wide <- subset(pf.fc_by_age_gender[c('age', 'gender', 'median_friend_count')], !is.na(gender)) %>% spread(gender, median_friend_count) %>% mutate(ratio = female / male)
spread command spreads rows into columns; it is used to go from a long format to a wide format; so if you have: grouped data as in above which has data in this format:

where the gender is repeated for each age, you can use the spread command to spread the gender into columns

so you would then get [after executing the above command]:

yo$all.purchases <- yo$strawberry + yo$blueberry + yo$pina.colada + yo$plain + yo$mixed.berry

# old school above; another way is to do transform [which is less verbose]

yo <- transform(yo, all.purchases = strawberry + blueberry + pina.colada + plain + mixed.berry)

Putting Charts side-by-side:

install.packages('gridExtra')
library(gridExtra)
p1 <- ggplot(...)
p2 <- ggplot(...)
grid.arrange(p1,p2,ncol=2)

Histograms
Two ways:
(1) ggplot: ggplot(aes(price), data=yo) + geom_histogram(binwidth=10)

(2) qplot: qplot(x = price, data = yo, binwidth=10)

Both do the same thing; qplot is a quicker way to write the command

1. One of the exercises has us plotting the median data using this command:

ggplot(aes(x=age, y=friend_count, color=gender),
data=subset(pf, !is.na(gender))) +
geom_line(stat='summary', fun.y=median)

The above command uses a stat = summary and median function to plot the y data.

Another way to do it is by grouping the data using the dplyr package. Basically:

pf.fc_by_age_gender <- pf %>% group_by(age, gender) %>% summarise(mean_friend_count=mean(friend_count), median_friend_count = median(friend_count), n=n())

Then use:

ggplot(data=subset(pf.fc_by_age_gender,!is.na(gender)), aes(age, median_friend_count, color=gender)) + geom_line()

2. Cool way to add a horizontal line to your graph:

ggplot(data = pf.fc_by_age_gender.wide, aes(age,ratio)) +
geom_line() + geom_hline(yintercept = 1, linetype=2)

# The linetype parameter can take the values 0-6:
# 0 = blank, 1 = solid, 2 = dashed
# 3 = dotted, 4 = dotdash, 5 = longdash
# 6 = twodash

ScatterPlot Matrices:
Useful when you have a lot of different variables and you want to explore relationships all at once.

Use the ggally library; need to install

install.packages('GGally')
library(GGally)
theme_set(theme_minimal(20))

set.seed(1836)
pf_subset <- pf[,c(2:15)]
names(pf_subset)
ggpairs(axisLabels = 'internal', pf_subset[sample.int(nrow(pf_subset),1000),]) # this command may take a long time; axisLabels option will put the label of the data ON the plot rather than on the bottom and left hand sides; may make it easier to read

Plotting 2 categorical variables
1. Use table command to get a frequency/contingency table: x <- table(data$g, data$h)
2. However, x will now be a datatype 'table'. You can verify this by doing: class(x)
3. To convert back to a data frame, do x <- as.data.frame(x)
4. Doing the conversion will also, as a byproduct, obliterate the column name so you will have things like Var1, Var2, and so on.
5. You can rename these by doing: names(x)[1] = your_column_name
6. Now you can run a ggplot command on x which is now a dataframe, e.g., ggplot(x, aes(x=col_1, y=col_2, fill=col_3)) + geom_bar() [This will create a stacked bar chart]

The above was a workaround when using the table command. More quickly/useful is to use the plyr or dplyr package which has the count function.

See this very useful link: http://www.r-bloggers.com/how-to-get-the-frequency-table-of-a-categorical-variable-as-a-data-frame-in-r/

&

http://www.stat.wisc.edu/~larget/stat302/chap2.pdf

Histograms

This is a cool command to view histogram of diamond prices by multiple dimensions:
ggplot(data=diamonds, aes(x=price)) +
geom_histogram(aes(fill=cut)) +
scale_x_log10() +
scale_fill_brewer(type = 'qual') +
facet_wrap(~color)

The facet_wrap creates multiple histograms for each color [D = flawless for e.g.]; the geom_histogram(aes(fill=cut)) is another way of looking at another dimension. The histogram bar can be filled in by color of what the diamond 'cut' is. So 'Fair' will be colored one way; 'Good' will be colored another, etc.