The decathlon dataset contains the results from each of the 10 events in the decathlon. The data covers two competitions - the 2004 Olympic Games and the 2004 Decastar competition. The data also includes the place the athlete finished (“rank”) and their total points. Out of the 13 athletes in the Decastar competition only 3 did not compete in the Olympics that year. The Olympic field consisted of 28 athletes.
Assumptions have been made which are specific to the questions asked. These are detailed below as part of the relevant answers.
The dataset was cleaned using the “decathlon_cleaning_script.R” file. The main steps were:
As part of the cleaning the data was checked for NAs, NULLs and values outwith expected ranges. None were found, the dataset is complete.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(here)
## here() starts at /Users/fionacarson/Documents/codeclan_work/dirty_data_codeclan_project_fiona_carson/task_1_decathlon
::here() here
## [1] "/Users/fionacarson/Documents/codeclan_work/dirty_data_codeclan_project_fiona_carson/task_1_decathlon"
<- read_csv(here("clean_data/clean_decathlon_data.csv")) decathlon
## Rows: 41 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): athlete, competition
## dbl (12): x100m, long_jump, shot_put, high_jump, x400m, x110m_hurdle, discus...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Who had the longest long jump seen in the data?
%>%
decathlon filter(long_jump == max(long_jump))
Bryan Clay from the US had the longest result for the long jump from the data. This jump was at the 2004 Olympic Games and was a personal best at the time. Clay went on to win the silver medal in this competition.
What was the average 100m time in each competition?
%>%
decathlon group_by(competition) %>%
summarise(avg_100m_time = round(mean(x100m), 2))
The Olympic Games has a faster average time for the 100m than Decastar. I don’t think this is a surprising result. The Olympics Games is a higher profile competition and athletes design four year training programs around the Olympic schedule to ensure they peak for this competition.
Who had the highest total points across both competitions?
Assumption - highest total points across both competitions means the points are summed for each athlete (that competed in both competitions) and the athlete with the highest points identified.
%>%
decathlon group_by(athlete) %>%
summarise(total_points = sum(points)) %>%
filter(total_points == max(total_points))
Roman Sebrle, an athlete from the Czech Republic, has the highest total points across both competitions. He won the 2004 Olympics and the 2004 Decastar event so would have the highest points total for each event individually as well.
What was the shot-put scores for the top three competitors in each competition?
Assumption - the question means the shot-put distance since it is not possible to calculate the “shot-put scores” from this data.
%>%
decathlon # Slicing the 6 lowest numbers in rank column to catch the 1, 2, 3 ranked
# positions in each competition. There are no ties so don't need to worry about
# that.
slice_min(rank, n = 6) %>%
select(athlete, shot_put, rank, competition) %>%
arrange(competition, rank)
The first, second and third places in both competitions were taken by the same men in the same order. The shot-put distances in the Olympics are all higher than those thrown in the Decastar competition.
What was the average points for competitors who ran the 400m in less than 50 seconds vs. those than ran 400m in more than 50 seconds?
%>%
decathlon mutate(x400m_speed = ifelse(x400m > 50, "more than 50 sec",
"less than 50 sec")) %>%
group_by(x400m_speed) %>%
summarise(average_points = round(mean(points), 0))
The athletes who are slower at the 400 m have a lower points average than those who are faster. This isn’t surprising as those who are slower at the 400 m are more likely to be slower at the other running events which will affect their overall points total.