The boing boing candy dataset contains the results of a Halloween candy survey. The data covers three years (2015, 2016 and 2017) and includes information on the person completing the survey, such as age, gender and country. The ratings used were Joy, Despair, Meh or NA for the 2016 and 2017 data; Meh was not an option in 2015. The 2015 data also lacked information on the gender and country of the person completing the survey. The original csv file required extensive cleaning and the main steps conducted are summarised below in the cleaning section.
Assumptions have been made which are specific to the questions asked. These are detailed below as part of the relevant answers.
When I refer to “candy columns” I mean any column which contains the Joy, Despair, Meh rating. “Non-candy columns” are year, age, gender, country and going out.
A lot of columns were removed, these were columns which weren’t required to answer the analysis questions.
For full details see cleaning script comments. A summary of the main cleaning steps is provided here:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(here)
## here() starts at /Users/fionacarson/Documents/codeclan_work/dirty_data_codeclan_project_fiona_carson/task_4_candy
<- read_csv(here("clean_data/clean_candy_data.csv")) candy
## Rows: 9349 Columns: 123
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (121): going_out, gender, country, abstained_from_m_and_m_ing, anonymous...
## dbl (2): year, age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
What is the total number of candy ratings given across the three years?
%>%
candy summarise(across(
.cols = 6:123,
.fns = ~sum(!is.na(.x)))) %>%
rowSums()
## [1] 772352
# Code to check answer
# sum(is.na(candy[6:123]))
There were 772,352 candy ratings given across the three years.
To check this number the number of NAs was determined and subtracted from the total number of entries.
118 candy columns * 9349 rows = 1,103,182 entries
Missing values (NAs) = 330830
1,103,183 - 330,830 = 772,352 which is our original answer!
What was the average age of people who are going out trick or treating?
%>%
candy filter(going_out == "Yes") %>%
summarise(avg_age_trick_or_treaters = round(mean(age, na.rm = TRUE)))
The average age of trick or treaters is 35 which seems rather high. Perhaps this is parents who are going out with their children.
What was the average age of people who are not going out trick or treating?
%>%
candy filter(going_out == "No") %>%
summarise(avg_age_trick_or_treaters = round(mean(age, na.rm = TRUE)))
The average age of people not going out trick or treating was 39. I can’t think of anything interesting to say about this answer.
For each of joy, despair and meh, which candy bar received the most of these ratings
Assumption - this means which candy received highest number of joy ratings, the highest number of meh ratings etc.
<- function(JOY_MEH_DESPAIR) {
highest_rating <- candy %>%
rating_results pivot_longer(cols = 6:123, names_to = "candy", values_to = "rating") %>%
group_by(candy) %>%
filter(rating == JOY_MEH_DESPAIR) %>%
summarise(total_ratings = n()) %>%
filter(total_ratings == max(total_ratings))
<- rating_results[1,1]
highest_candy <- rating_results[1,2]
num_ratings
paste0("The candy with the most ", JOY_MEH_DESPAIR, " ratings is ", rating_results$candy,
" with a count of ", rating_results$total_ratings, ".")
}
highest_rating("JOY")
## [1] "The candy with the most JOY ratings is any_full_sized_candy_bar with a count of 7589."
highest_rating("MEH")
## [1] "The candy with the most MEH ratings is lollipops with a count of 1570."
highest_rating("DESPAIR")
## [1] "The candy with the most DESPAIR ratings is broken_glow_stick with a count of 7905."
I wasted an extremely long time on this question by trying to answer it without pivoting :(
Any full sized candy bar brings the most joy, while broken glow stick bring the most despair which isn’t that surprising. Lollipops get the highest meh rating which was a little surprising, I would have thought they would have been more popular.
How many people rated Starburst as despair?
%>%
candy filter(starburst == "DESPAIR") %>%
summarise(starburst_as_despair = n())
Personally I am not a fan of starburst (or Opal Fruits as they were when I was a kid) so I feel this number should be higher!
I realise now I could have very easily answered this question by using pivot_longer but I love my function and it took me ages so I’ve left it in. It also displays all the results together in a nice dataframe. The code to answer this question using pivot_longer can be found in the Extra Analysis section at the end of the document.
# Converting JOY to 1, DESPAIR to -1 and MEH to O
# These JOY, DESPAIR, MEH strings won't be identified in the non-candy columns
# (e.g. if "meh" was part of a country name) as they are capitalised and the
# other columns are lower or sentence case.
<- candy %>%
candy_num_ratings mutate_all(~str_replace(., "JOY", "1")) %>%
mutate_all(~str_replace(., "DESPAIR", "-1")) %>%
mutate_all(~str_replace(., "MEH", "0")) %>%
mutate(across((6:123), as.numeric))
# The function below filters by gender, then sums each of the columns
# The results are converted into a dataframe, transposed and filtered to reveal
# the top scoring candy.
<- function(dataset, sex){
high_score_by_gender # sex options are: female, male, i'd rather not say, other
<- dataset %>%
gender_results filter(gender == sex) %>%
summarise(across(
.cols = 6:123,
.fns = ~ sum(., na.rm = TRUE)
)
)# Tried to keep the pipe going but couldn't get the as.data.frame(t()) function
# to work so left the above and below code separate
<- as.data.frame(t(gender_results)) %>%
gender_results rownames_to_column() %>%
cbind(sex) %>%
rename(score = V1, sweet = rowname, gender = sex) %>%
relocate(gender, .before = sweet) %>%
filter(score == max(score))
gender_results
}
# Running function on each of the genders and binding results together for output
rbind(high_score_by_gender(candy_num_ratings, "female"),
high_score_by_gender(candy_num_ratings, "male"),
high_score_by_gender(candy_num_ratings, "i'd rather not say"),
high_score_by_gender(candy_num_ratings, "other"))
As for the results, nothing interesting here, full sized candy bars popular across all genders.
What was the most popular candy bar in each year?
I adapted my beautiful function to work with year, again pivot_longer would have provided a much quicker and easier answer.
<- function(dataset, yr){
high_score_by_year # year options are: 2015, 2016 and 2017
<- dataset %>%
year_results filter(year == yr) %>%
summarise(across(
.cols = 6:123,
.fns = ~ sum(., na.rm = TRUE)
)
)# Tried to keep the pipe going but couldn't get the as.data.frame(t()) function
# to work so left the above and below code separate
<- as.data.frame(t(year_results)) %>%
year_results rownames_to_column() %>%
cbind(yr) %>%
rename(score = V1, sweet = rowname, year = yr) %>%
relocate(year, .before = sweet) %>%
filter(score == max(score))
year_results
}
# Running function on each of the 3 years and binding results together for output
rbind(high_score_by_year(candy_num_ratings, 2015),
high_score_by_year(candy_num_ratings, 2016),
high_score_by_year(candy_num_ratings, 2017))
Full sized candy bars proving popular across the years.
What was the most popular candy bar by this rating for people in US, Canada, UK, and all other countries
<- function(dataset, cntry){
high_score_by_country <- dataset %>%
country_results filter(country == cntry) %>%
summarise(across(
.cols = 6:123,
.fns = ~ sum(., na.rm = TRUE)
)
)# Tried to keep the pipe going but couldn't get the as.data.frame(t()) function
# to work so left the above and below code separate
<- as.data.frame(t(country_results)) %>%
country_results rownames_to_column() %>%
cbind(cntry) %>%
rename(score = V1, sweet = rowname, country = cntry) %>%
relocate(country, .before = sweet) %>%
filter(score == max(score))
country_results
}
# Running function on USA, Canada and UK and binding results together for output
rbind(high_score_by_country(candy_num_ratings, "usa"),
high_score_by_country(candy_num_ratings, "canada"),
high_score_by_country(candy_num_ratings, "uk"))
In the UK cash rules, here cash is rated higher than the full sized candy bar.
I spent a very, very long time try to get the function to take in a vector of countries as an argument. I added an if else statement at the cbind() line which worked on length of the cntry argument but ran into errors. I understood why I was getting the error but didn’t manage to fix it.
The code for “all other countries” is below.
# Creating a dataframe of countries other than US, UK and Canada
<- candy %>%
all_other_countries_df filter(!country %in% c("usa", "canada", "uk")) %>%
filter(!is.na(country))
# Extracting countries into a vector
<- all_other_countries_df$country
all_other_countries_vec
<- function(dataset, cntry){
high_score_by_multiple_countries <- dataset %>%
country_results filter(country == cntry) %>%
summarise(across(
.cols = 6:123,
.fns = ~ sum(., na.rm = TRUE)
)
)# Tried to keep the pipe going but couldn't get the as.data.frame(t()) function
# to work so left the above and below code separate
<- as.data.frame(t(country_results)) %>%
country_results rownames_to_column() %>%
rename(score = V1, sweet = rowname) %>%
filter(score == max(score))
country_results
}
high_score_by_multiple_countries(candy_num_ratings, all_other_countries_vec)
## Warning in country == cntry: longer object length is not a multiple of shorter
## object length
There is a draw for the most popular candy in all other countries. These numbers seem low but there is only 105 observations for countries other than USA, UK and Canada (not counting NAs). A manual check of the data shows these numbers are accurate.
For completeness I adapted the function slightly to determine the top rated candy for when country is NA. The results were surprising!
<- function(dataset){
high_score_by_NA_country <- dataset %>%
country_results filter(is.na(country)) %>%
summarise(across(
.cols = 6:123,
.fns = ~ sum(., na.rm = TRUE)
)
)# Tried to keep the pipe going but couldn't get the as.data.frame(t()) function
# to work so left the above and below code separate
<- as.data.frame(t(country_results)) %>%
country_results rownames_to_column() %>%
rename(score = V1, sweet = rowname) %>%
filter(score == max(score))
country_results
}
high_score_by_NA_country(candy_num_ratings)
Just kidding - no surprises here. Full sized candy bars are top again.
Question 6 What was the most popular candy bar by the -1, 0, 1 rating system for each gender in the dataset?
This is the code for answering question 6 by using pivot_longer. Could put this code in a function with column (e.g. country, year or gender) and desired group (e.g. usa, female, 2016) as the arguments and they use it to answer question 6 to 8 but I am done with functions for now.
# These JOY, DESPAIR, MEH strings won't be identified in the non-candy columns
# (e.g. if "meh" was part of a country name) as they are capitalised and the
# other columns are lower or sentence case.
<- candy %>%
candy_num_ratings mutate_all(~str_replace(., "JOY", "1")) %>%
mutate_all(~str_replace(., "DESPAIR", "-1")) %>%
mutate_all(~str_replace(., "MEH", "0")) %>%
mutate(across((6:123), as.numeric))
%>%
candy_num_ratings pivot_longer(cols = 6:123, names_to = "candy", values_to = "rating") %>%
filter(gender == "female") %>%
group_by(candy) %>%
summarise(total_ratings = sum(rating, na.rm = TRUE)) %>%
filter(total_ratings == max(total_ratings))
Investigating top 3 candy’s by country
I thought it would be interesting to see the top 3 rated candies to see if full size candy bars rated highly in UK and whether cash rated highly in UK or Canada.
<- function(dataset, cntry){
top_3_by_country <- dataset %>%
top3_country filter(country == cntry) %>%
summarise(across(
.cols = 6:123,
.fns = ~ sum(., na.rm = TRUE)
)
)# Tried to keep the pipe going but couldn't get the as.data.frame(t()) function
# to work so left the above and below code separate
<- as.data.frame(t(top3_country)) %>%
top3_country rownames_to_column() %>%
cbind(cntry) %>%
rename(score = V1, sweet = rowname, country = cntry) %>%
relocate(country, .before = sweet) %>%
arrange(desc(score)) %>%
head(3)
top3_country
}
rbind(top_3_by_country(candy_num_ratings, "usa"),
top_3_by_country(candy_num_ratings, "canada"),
top_3_by_country(candy_num_ratings, "uk"))
# Would be nice to add a rank but I'm done