Statistics Learning: R Graphing

Taking Data Science from Udacity [specifically the Data Analysis with R]; I love ggplot!!

To convert a variable in a dataframe to a factor variable, simply do this;
yo$id <- factor(yo$id); # Here id was an integer but will now be converted to a factor variable

Scaling data:
Use scale_x_log10 : to scale the x axis in log-10 units

scale_x_continuous : takes a breaks=seq(0,20000,1000) argument [where the x-axis will go from 0 -> 20,000 in increments of a thousand; also takes a limits=c(0,20000) argument
scale_x_discrete
coord_cartesian(ylim=c(0,1000)) <= believe this doesn't remove points from the dataset when plotting; but just provides a zoom-in; the data points won't be shown but will be kept for other layers to make use of

Can use quantile function to set limits: quantile(diamonds$price, c(.90))
will find the value that is the 90% of diamonds$price.

xlim
ylim

Examining data:

summary(yo)
unique(yo$price) # will list all the unique values that are in that variable
length(unique(yo$price)) # will give a count of all the unique values
table(yo$price) # will give a count of each unique value in the variable [similar to a tabular histogram]

Transforming Data:

Using dplry & tidyr:

pf.fc_by_age_gender.wide <- subset(pf.fc_by_age_gender[c('age', 'gender', 'median_friend_count')], !is.na(gender)) %>% spread(gender, median_friend_count) %>% mutate(ratio = female / male)
spread command spreads rows into columns; it is used to go from a long format to a wide format; so if you have: grouped data as in above which has data in this format:

where the gender is repeated for each age, you can use the spread command to spread the gender into columns

so you would then get [after executing the above command]:

yo$all.purchases <- yo$strawberry + yo$blueberry + yo$pina.colada + yo$plain + yo$mixed.berry

# old school above; another way is to do transform [which is less verbose]

yo <- transform(yo, all.purchases = strawberry + blueberry + pina.colada + plain + mixed.berry)

Putting Charts side-by-side:

install.packages('gridExtra')
library(gridExtra)
p1 <- ggplot(...)
p2 <- ggplot(...)
grid.arrange(p1,p2,ncol=2)

Histograms
Two ways:
(1) ggplot: ggplot(aes(price), data=yo) + geom_histogram(binwidth=10)

(2) qplot: qplot(x = price, data = yo, binwidth=10)

Both do the same thing; qplot is a quicker way to write the command

1. One of the exercises has us plotting the median data using this command:

ggplot(aes(x=age, y=friend_count, color=gender),
data=subset(pf, !is.na(gender))) +
geom_line(stat='summary', fun.y=median)

The above command uses a stat = summary and median function to plot the y data.

Another way to do it is by grouping the data using the dplyr package. Basically:

pf.fc_by_age_gender <- pf %>% group_by(age, gender) %>% summarise(mean_friend_count=mean(friend_count), median_friend_count = median(friend_count), n=n())

Then use:

ggplot(data=subset(pf.fc_by_age_gender,!is.na(gender)), aes(age, median_friend_count, color=gender)) + geom_line()

2. Cool way to add a horizontal line to your graph:

ggplot(data = pf.fc_by_age_gender.wide, aes(age,ratio)) +
geom_line() + geom_hline(yintercept = 1, linetype=2)

# The linetype parameter can take the values 0-6:
# 0 = blank, 1 = solid, 2 = dashed
# 3 = dotted, 4 = dotdash, 5 = longdash
# 6 = twodash

ScatterPlot Matrices:
Useful when you have a lot of different variables and you want to explore relationships all at once.

Use the ggally library; need to install

install.packages('GGally')
library(GGally)
theme_set(theme_minimal(20))

set.seed(1836)
pf_subset <- pf[,c(2:15)]
names(pf_subset)
ggpairs(axisLabels = 'internal', pf_subset[sample.int(nrow(pf_subset),1000),]) # this command may take a long time; axisLabels option will put the label of the data ON the plot rather than on the bottom and left hand sides; may make it easier to read

Plotting 2 categorical variables
1. Use table command to get a frequency/contingency table: x <- table(data$g, data$h)
2. However, x will now be a datatype 'table'. You can verify this by doing: class(x)
3. To convert back to a data frame, do x <- as.data.frame(x)
4. Doing the conversion will also, as a byproduct, obliterate the column name so you will have things like Var1, Var2, and so on.
5. You can rename these by doing: names(x)[1] = your_column_name
6. Now you can run a ggplot command on x which is now a dataframe, e.g., ggplot(x, aes(x=col_1, y=col_2, fill=col_3)) + geom_bar() [This will create a stacked bar chart]

The above was a workaround when using the table command. More quickly/useful is to use the plyr or dplyr package which has the count function.

See this very useful link: http://www.r-bloggers.com/how-to-get-the-frequency-table-of-a-categorical-variable-as-a-data-frame-in-r/

&

http://www.stat.wisc.edu/~larget/stat302/chap2.pdf

Histograms

This is a cool command to view histogram of diamond prices by multiple dimensions:
ggplot(data=diamonds, aes(x=price)) +
geom_histogram(aes(fill=cut)) +
scale_x_log10() +
scale_fill_brewer(type = 'qual') +
facet_wrap(~color)

The facet_wrap creates multiple histograms for each color [D = flawless for e.g.]; the geom_histogram(aes(fill=cut)) is another way of looking at another dimension. The histogram bar can be filled in by color of what the diamond 'cut' is. So 'Fair' will be colored one way; 'Good' will be colored another, etc.

Statistics Learning

Sunday, February 7, 2016

R Graphing

No comments:

Post a Comment