Taking Data Science from Udacity [specifically the Data Analysis with R]; I love ggplot!!
To convert a variable in a dataframe to a factor variable, simply do this;
yo$id <- factor(yo$id); # Here id was an integer but will now be converted to a factor variable
Scaling data:
Use scale_x_log10 : to scale the x axis in log-10 units
will find the value that is the 90% of diamonds$price.
xlim
ylim
Examining data:
Transforming Data:
Putting Charts side-by-side:
Histograms
Two ways:
(1) ggplot: ggplot(aes(price), data=yo) + geom_histogram(binwidth=10)
(2) qplot: qplot(x = price, data = yo, binwidth=10)
Both do the same thing; qplot is a quicker way to write the command
1. One of the exercises has us plotting the median data using this command:
ggplot(aes(x=age, y=friend_count, color=gender),
data=subset(pf, !is.na(gender))) +
geom_line(stat='summary', fun.y=median)
The above command uses a stat = summary and median function to plot the y data.
Another way to do it is by grouping the data using the dplyr package. Basically:
pf.fc_by_age_gender <- pf %>% group_by(age, gender) %>% summarise(mean_friend_count=mean(friend_count), median_friend_count = median(friend_count), n=n())
Then use:
ggplot(data=subset(pf.fc_by_age_gender,!is.na(gender)), aes(age, median_friend_count, color=gender)) + geom_line()
2. Cool way to add a horizontal line to your graph:
ggplot(data = pf.fc_by_age_gender.wide, aes(age,ratio)) +
geom_line() + geom_hline(yintercept = 1, linetype=2)
# The linetype parameter can take the values 0-6:
# 0 = blank, 1 = solid, 2 = dashed
# 3 = dotted, 4 = dotdash, 5 = longdash
# 6 = twodash
ScatterPlot Matrices:
Useful when you have a lot of different variables and you want to explore relationships all at once.
Use the ggally library; need to install
install.packages('GGally')
library(GGally)
theme_set(theme_minimal(20))
set.seed(1836)
pf_subset <- pf[,c(2:15)]
names(pf_subset)
ggpairs(axisLabels = 'internal', pf_subset[sample.int(nrow(pf_subset),1000),]) # this command may take a long time; axisLabels option will put the label of the data ON the plot rather than on the bottom and left hand sides; may make it easier to read
Plotting 2 categorical variables
1. Use table command to get a frequency/contingency table: x <- table(data$g, data$h)
2. However, x will now be a datatype 'table'. You can verify this by doing: class(x)
3. To convert back to a data frame, do x <- as.data.frame(x)
4. Doing the conversion will also, as a byproduct, obliterate the column name so you will have things like Var1, Var2, and so on.
5. You can rename these by doing: names(x)[1] = your_column_name
6. Now you can run a ggplot command on x which is now a dataframe, e.g., ggplot(x, aes(x=col_1, y=col_2, fill=col_3)) + geom_bar() [This will create a stacked bar chart]
The above was a workaround when using the table command. More quickly/useful is to use the plyr or dplyr package which has the count function.
See this very useful link: http://www.r-bloggers.com/how-to-get-the-frequency-table-of-a-categorical-variable-as-a-data-frame-in-r/
&
http://www.stat.wisc.edu/~larget/stat302/chap2.pdf
Histograms
This is a cool command to view histogram of diamond prices by multiple dimensions:
ggplot(data=diamonds, aes(x=price)) +
geom_histogram(aes(fill=cut)) +
scale_x_log10() +
scale_fill_brewer(type = 'qual') +
facet_wrap(~color)
The facet_wrap creates multiple histograms for each color [D = flawless for e.g.]; the geom_histogram(aes(fill=cut)) is another way of looking at another dimension. The histogram bar can be filled in by color of what the diamond 'cut' is. So 'Fair' will be colored one way; 'Good' will be colored another, etc.
To convert a variable in a dataframe to a factor variable, simply do this;
yo$id <- factor(yo$id); # Here id was an integer but will now be converted to a factor variable
Scaling data:
Use scale_x_log10 : to scale the x axis in log-10 units
- scale_x_continuous : takes a breaks=seq(0,20000,1000) argument [where the x-axis will go from 0 -> 20,000 in increments of a thousand; also takes a limits=c(0,20000) argument
- scale_x_discrete
- coord_cartesian(ylim=c(0,1000)) <= believe this doesn't remove points from the dataset when plotting; but just provides a zoom-in; the data points won't be shown but will be kept for other layers to make use of
will find the value that is the 90% of diamonds$price.
xlim
ylim
Examining data:
- summary(yo)
- unique(yo$price) # will list all the unique values that are in that variable
- length(unique(yo$price)) # will give a count of all the unique values
- table(yo$price) # will give a count of each unique value in the variable [similar to a tabular histogram]
Transforming Data:
- Using dplry & tidyr:
- pf.fc_by_age_gender.wide <- subset(pf.fc_by_age_gender[c('age', 'gender', 'median_friend_count')], !is.na(gender)) %>% spread(gender, median_friend_count) %>% mutate(ratio = female / male)
- spread command spreads rows into columns; it is used to go from a long format to a wide format; so if you have: grouped data as in above which has data in this format:
where the gender is repeated for each age, you can use the spread command to spread the gender into columns
so you would then get [after executing the above command]:
- yo$all.purchases <- yo$strawberry + yo$blueberry + yo$pina.colada + yo$plain + yo$mixed.berry
- yo <- transform(yo, all.purchases = strawberry + blueberry + pina.colada + plain + mixed.berry)
Putting Charts side-by-side:
- install.packages('gridExtra')
- library(gridExtra)
- p1 <- ggplot(...)
- p2 <- ggplot(...)
- grid.arrange(p1,p2,ncol=2)
Histograms
Two ways:
(1) ggplot: ggplot(aes(price), data=yo) + geom_histogram(binwidth=10)
(2) qplot: qplot(x = price, data = yo, binwidth=10)
Both do the same thing; qplot is a quicker way to write the command
1. One of the exercises has us plotting the median data using this command:
ggplot(aes(x=age, y=friend_count, color=gender),
data=subset(pf, !is.na(gender))) +
geom_line(stat='summary', fun.y=median)
The above command uses a stat = summary and median function to plot the y data.
Another way to do it is by grouping the data using the dplyr package. Basically:
pf.fc_by_age_gender <- pf %>% group_by(age, gender) %>% summarise(mean_friend_count=mean(friend_count), median_friend_count = median(friend_count), n=n())
Then use:
ggplot(data=subset(pf.fc_by_age_gender,!is.na(gender)), aes(age, median_friend_count, color=gender)) + geom_line()
2. Cool way to add a horizontal line to your graph:
ggplot(data = pf.fc_by_age_gender.wide, aes(age,ratio)) +
geom_line() + geom_hline(yintercept = 1, linetype=2)
# The linetype parameter can take the values 0-6:
# 0 = blank, 1 = solid, 2 = dashed
# 3 = dotted, 4 = dotdash, 5 = longdash
# 6 = twodash
ScatterPlot Matrices:
Useful when you have a lot of different variables and you want to explore relationships all at once.
Use the ggally library; need to install
install.packages('GGally')
library(GGally)
theme_set(theme_minimal(20))
set.seed(1836)
pf_subset <- pf[,c(2:15)]
names(pf_subset)
ggpairs(axisLabels = 'internal', pf_subset[sample.int(nrow(pf_subset),1000),]) # this command may take a long time; axisLabels option will put the label of the data ON the plot rather than on the bottom and left hand sides; may make it easier to read
Plotting 2 categorical variables
1. Use table command to get a frequency/contingency table: x <- table(data$g, data$h)
2. However, x will now be a datatype 'table'. You can verify this by doing: class(x)
3. To convert back to a data frame, do x <- as.data.frame(x)
4. Doing the conversion will also, as a byproduct, obliterate the column name so you will have things like Var1, Var2, and so on.
5. You can rename these by doing: names(x)[1] = your_column_name
6. Now you can run a ggplot command on x which is now a dataframe, e.g., ggplot(x, aes(x=col_1, y=col_2, fill=col_3)) + geom_bar() [This will create a stacked bar chart]
The above was a workaround when using the table command. More quickly/useful is to use the plyr or dplyr package which has the count function.
See this very useful link: http://www.r-bloggers.com/how-to-get-the-frequency-table-of-a-categorical-variable-as-a-data-frame-in-r/
&
http://www.stat.wisc.edu/~larget/stat302/chap2.pdf
Histograms
This is a cool command to view histogram of diamond prices by multiple dimensions:
ggplot(data=diamonds, aes(x=price)) +
geom_histogram(aes(fill=cut)) +
scale_x_log10() +
scale_fill_brewer(type = 'qual') +
facet_wrap(~color)
The facet_wrap creates multiple histograms for each color [D = flawless for e.g.]; the geom_histogram(aes(fill=cut)) is another way of looking at another dimension. The histogram bar can be filled in by color of what the diamond 'cut' is. So 'Fair' will be colored one way; 'Good' will be colored another, etc.
No comments:
Post a Comment