This project aims to clean up and analyse the data set of Ph. D. students salaries by universities and departments over time. I performed the analysis with R and the tidyverse libraries. For the data cleaning, I excluded the variables containing a majority of missing values, combined similar departments, and separated the universities per location. The significant highlights from the analysis are:
The majority of responders are from the USA, data collection from non-USA universities started in 2013
The students stipend and living wages are higher in the USA.
The stipends do not increase with experience
The Ph.D. stipends had a significant decrease around the 2008 crisis.
The Ph.D. stipends are not equal between departments.
Method
The Data set, version 7, was downloaded on Kaggle on the 2020-07-12.
Questions
What is the size of the data set, and was any data excluded?
What data cleaning steps did you perform and why?
Are there any outliers that should be treated separately? And how did you treat them?
What questions did you ask and what interesting relationships, subsets, or clusters can you find?
Data Cleaning
initial_obs <- nrow(df)
glue("This datasets contain: {ncol(df)} variables and {nrow(df)} observations")
This datasets contain: 11 variables and 9273 observations.
Rename variables
df <- df %>%
rename(Overall_Pay = `Overall Pay`,
LW_Ratio = `LW Ratio`,
Academic_Year = `Academic Year`,
Program_Year = `Program Year`,
Gross_Pay_12M = `12 M Gross Pay`,
Gross_Pay_9M = `9 M Gross Pay`,
Gross_Pay_3M = `3 M Gross Pay`
)
Identify missing values
plot_missing(df)
The variables with a majority of missing values are excluded from the rest of the analysis.
df <- df %>% select(-c('Fees', 'Gross_Pay_9M', 'Comments', 'Gross_Pay_3M'))
Convert Salary to numeric
# See https://stackoverflow.com/questions/10294284/remove-all-special-characters-from-a-string-in-r
df$Overall_Pay <- str_replace_all(df$Overall_Pay, "[[:punct:]]", "")
df$Overall_Pay <- str_replace_all(df$Overall_Pay, "[[$]]", "")
df$Overall_Pay <- as.numeric(df$Overall_Pay)
df$Gross_Pay_12M <- str_replace_all(df$Gross_Pay_12M, "[[:punct:]]", "")
df$Gross_Pay_12M <- str_replace_all(df$Gross_Pay_12M, "[[$]]", "")
df$Gross_Pay_12M <- as.numeric(df$Gross_Pay_12M)
p1 <- ggplot(df) +
aes(x = Overall_Pay)+
geom_density(fill = "#E69F00", alpha = 0.7)
p2 <- ggplot(df) +
aes(x = LW_Ratio)+
geom_density(fill = "#E69F00", alpha = 0.7)
p1 / p2
After research PhD salaries in the US are between $15k and 35k. Therefore salaries above $50K are likely entries errors. Salaries under $15K could be legitimates cases and will be examined separately.
Convert Program Year to numeric
# This variable could be considered both as categorical and continous
# I will create 2 variables for data analysis and visualisation
df$Program_Year <- str_replace_all(df$Program_Year, "1st", "1")
df$Program_Year <- str_replace_all(df$Program_Year, "2nd", "2")
df$Program_Year <- str_replace_all(df$Program_Year, "3rd", "3")
df$Program_Year <- str_replace_all(df$Program_Year, "4th", "4")
df$Program_Year <- str_replace_all(df$Program_Year, "5th", "5")
df$Program_Year <- str_replace_all(df$Program_Year, "6th and up", "6")
df$Seniority_Year <- as.numeric(df$Program_Year)
Convert Academic Years to factor
# https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html
df$Academic_Year <- str_replace_all(df$Academic_Year, "-....", "")
df$Academic_Year <- as.Date(df$Academic_Year, "%Y")
# After conversion all year start on the 07-12 need to change for first September
df$Academic_Year <- as.character(df$Academic_Year)
df$Academic_Year <- str_replace_all(df$Academic_Year, "-07-18", "-01-09")
df$Academic_Year <- as.Date(df$Academic_Year)
Universities locations
By manual inspection we can identify a subset of non-american universities which represent the majority of this data set. In order to compare US and non US salaries I created an excel file to differentiate salaries by locations.
# Export universities names
university_names <- data.frame(unique(df$University))
# Order the names by alphabetic order
university_names <- university_names %>%
arrange(unique.df.University.)
# Number of Universities
nrow(university_names)
file_path <- file.path(output_directory,"universities_names.csv")
write_csv(university_names, file_path)
# Add us - non-us category
univ_locations <- read_xlsx(path = "Data/univ_location.xlsx", sheet=3)
df <- inner_join(df, univ_locations, by="University")
Edit departments names
Several identical departments have different spellings in the data set. Using stringr I corrected these difference in order to identify the most common department and the associated salaries.
df$department_cleaned <- df$Department
# https://stringr.tidyverse.org/articles/regular-expressions.html
# Convert to lower case
df$department_cleaned <- tolower(df$department_cleaned)
# n/a to NA
df <- df %>%
filter(!is.na(department_cleaned)) %>%
filter(department_cleaned != "n/a")
# Group all bioscience related department to biology
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}bio.{0,}", "biology")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}cell.{0,}", "biology")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}molecular.{0,}", "biology")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}neuro.{0,}", "biology")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}genetics.{0,}", "biology")
# Group all aerospace department
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}aero.{0,}", "aerospace")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}space.{0,}", "aerospace")
# Group all health departments
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}health.{0,}", "health")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}medical.{0,}", "health")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}clinical.{0,}", "health")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}pharmacy.{0,}", "health")
# Group all chemistry dept
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}chemical.{0,}", "chemistry")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}chemistry.{0,}", "chemistry")
# Group all Computer Science dept
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}comput.{0,}", "computer science")
df$department_cleaned <- str_replace_all(df$department_cleaned, "^computational.{0,}", "computer science")
# Group all Psychology dept
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}psych.{0,}", "psychology")
# Group all comm. dept
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}communication.{0,}", "communication")
# Group all physic departments
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}physics.{0,}", "physics")
# Group all agriculture dept
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}agri.{0,}", "agriculture")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}crop.{0,}", "agriculture")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}plant.{0,}", "agriculture")
# Group all engineering dept
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}engineering.{0,}", "engineering")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}mechanical.{0,}", "engineering")
# Group all business dept
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}business.{0,}", "business")
# Group all math dept
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}math.{0,}", "mathematics")
# Group all criminology dept
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}crim.{0,}", "criminology")
# Group all geoscience dept
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}earth.{0,}", "geoscience")
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}geo.{0,}", "geoscience")
# Group all english dept
df$department_cleaned <- str_replace_all(df$department_cleaned, ".{0,}english.{0,}", "english")
# List department with at least 50 students
top_dept <- df %>%
group_by(department_cleaned) %>%
count() %>%
arrange(desc(n)) %>%
filter(n >= 50) %>%
pull(department_cleaned)
Filter observation
Excludes observations with:
missing Overall Pay
Overall Pay above $50K which are likely entries errors
missing LW ratio
missing Universities
missing Program Year
df <- df %>%
filter(!(is.na(Overall_Pay))) %>%
filter(Overall_Pay <= 50000) %>%
filter(!is.na(LW_Ratio)) %>%
filter(!(is.na(Program_Year)))
Data Visualisation
Compare salaries and living wage ratio by locations
p1 <- df %>%
filter(Overall_Pay > 15000) %>%
ggplot() +
aes(x = location, y = Overall_Pay, fill = location) +
geom_boxplot(varwidth = TRUE, notch = TRUE) +
scale_fill_manual(values=c("us" = "#CC6666", "non_us" = "#9999CC"))
p2 <- df %>%
filter(Overall_Pay > 15000) %>%
ggplot() +
aes(x = location, y = LW_Ratio, fill = location) +
geom_boxplot(varwidth = TRUE, notch = TRUE) +
scale_fill_manual(values=c("us" = "#CC6666", "non_us" = "#9999CC"))
p1 | p2
We can see on this figure that:
a majority of observation are from US universities
the overall pay and living wage ratios are significantly higher in US universities
How many student are paid less than $15K are they located outside the USA?
# https://stackoverflow.com/questions/26553526/how-to-add-frequency-count-labels-to-the-bars-in-a-bar-graph-using-ggplot2
p1 <- df %>%
filter(Overall_Pay <= 15000) %>%
group_by(location) %>%
ggplot() +
aes(x = location, fill = location) +
geom_bar(width = 0.5) +
scale_fill_manual(values=c("us" = "#CC6666", "non_us" = "#9999CC")) +
geom_text(stat='count', aes(label=..count..), vjust=-0.2) +
ggtitle("student under $15K")
p2 <- df %>%
filter(Overall_Pay > 15000) %>%
group_by(location) %>%
ggplot() +
aes(x = location, fill = location) +
geom_bar(width = 0.5) +
scale_fill_manual(values=c("us" = "#CC6666", "non_us" = "#9999CC")) +
geom_text(stat='count', aes(label=..count..), vjust=-0.2) +
ggtitle("student over $15K")
p1 | p2
Salary per seniority
p1 <- df %>%
filter(Overall_Pay > 15000) %>%
ggplot() +
aes(x = Program_Year, fill = Program_Year) +
geom_bar() +
scale_fill_manual(values = palette)
p2 <- df %>%
filter(Overall_Pay > 15000) %>%
mutate(log_Overall_Pay = log10(Overall_Pay)) %>%
ggplot() +
aes(x = Program_Year, y = log_Overall_Pay, fill = Program_Year) +
geom_boxplot(notch = TRUE) +
scale_fill_manual(values = palette) +
guides(fill=FALSE)
p3 <- df %>%
filter(Overall_Pay > 15000) %>%
ggplot() +
aes(x = Program_Year, y = LW_Ratio, fill = Program_Year) +
geom_boxplot(notch = TRUE) +
scale_fill_manual(values = palette) +
guides(fill=FALSE)
(p1 | plot_spacer()) / (p2 | p3)
We can see on this figure that:
the majority of respondent are in their first year.
the program year is not linked to the Overall Pay or to the Living Wage ratios.
Pay and Living wage ration over time
p1 <- df %>%
filter(!(is.na(Overall_Pay))) %>%
filter(location %in% c('us', 'non_us')) %>%
filter(Overall_Pay > 15000) %>%
filter(!(is.na(Academic_Year))) %>%
group_by(Academic_Year, location) %>%
summarise(mean_Overall_Pay = mean(Overall_Pay)) %>%
ggplot() +
aes(x=Academic_Year, y=mean_Overall_Pay, group=location, colour=location ) +
scale_color_manual(values=c("us" = "#CC6666", "non_us" = "#9999CC")) +
geom_line(size=1) +
geom_vline(xintercept = as.Date("2013-01-09"), linetype="dotted", color = "black", size=0.5) +
geom_vline(xintercept = as.Date("2015-01-09"), linetype="dotted", color = "black", size=0.5)
p2 <- df %>%
filter(!(is.na(Overall_Pay))) %>%
filter(location %in% c('us', 'non_us')) %>%
filter(Overall_Pay > 15000) %>%
filter(!(is.na(Academic_Year))) %>%
group_by(Academic_Year, location) %>%
summarise(mean_LW_Ratio = mean(LW_Ratio)) %>%
ggplot() +
aes(x=Academic_Year, y=mean_LW_Ratio, group=location, colour=location ) +
scale_color_manual(values=c("us" = "#CC6666", "non_us" = "#9999CC")) +
geom_line(size=1) +
geom_vline(xintercept = as.Date("2013-01-09"), linetype="dotted", color = "black", size=0.5) +
geom_vline(xintercept = as.Date("2015-01-09"), linetype="dotted", color = "black", size=0.5)
p1 / p2
We can see on this figure that:
non us universities were included in the data set after 2013
The mean Overall Pay have been increased over time
The living wage ratios however have decreased internationally after 2014
Overall Pay and Living wage by department
df$department_cleaned <- as.factor(df$department_cleaned)
# US univp1 <- df %>%
filter(department_cleaned %in% top_dept) %>%
filter(location == 'us') %>%
filter(Overall_Pay > 15000) %>%
group_by(department_cleaned) %>%
summarise(mean_Overall_Pay = mean(Overall_Pay)) %>%
mutate(department_cleaned = fct_reorder(department_cleaned , mean_Overall_Pay)) %>%
ggplot() +
aes(x = department_cleaned, y = mean_Overall_Pay, fill = location) +
coord_flip() +
geom_bar(stat = "identity", fill = "#CC6666")
p2 <- df %>%
filter(department_cleaned %in% top_dept) %>%
filter(location == 'us') %>%
filter(Overall_Pay > 15000) %>%
group_by(department_cleaned) %>%
summarise(mean_LW_Ratio = mean(LW_Ratio)) %>%
mutate(department_cleaned = fct_reorder(department_cleaned , mean_LW_Ratio)) %>%
ggplot() +
aes(y = department_cleaned, x = mean_LW_Ratio, fill = location) +
geom_bar(stat = "identity", fill = "#CC6666")
(p1 | p2) + plot_annotation(title = 'Pay by department in US universities')
# non us univ
p1 <- df %>%
filter(department_cleaned %in% top_dept) %>%
filter(!(is.na(Overall_Pay))) %>%
filter(location == 'non_us') %>%
filter(Overall_Pay > 15000) %>%
group_by(department_cleaned) %>%
summarise(mean_Overall_Pay = mean(Overall_Pay)) %>%
mutate(department_cleaned = fct_reorder(department_cleaned , mean_Overall_Pay)) %>%
ggplot() +
aes(x = department_cleaned, y = mean_Overall_Pay, fill = location) +
coord_flip() +
geom_bar(stat = "identity", fill = "#9999CC")
p2 <- df %>%
filter(department_cleaned %in% top_dept) %>%
filter(!(is.na(Overall_Pay))) %>%
filter(location == 'non_us') %>%
filter(Overall_Pay > 15000) %>%
group_by(department_cleaned) %>%
summarise(mean_LW_Ratio = mean(LW_Ratio)) %>%
mutate(department_cleaned = fct_reorder(department_cleaned , mean_LW_Ratio)) %>%
ggplot() +
aes(y = department_cleaned, x = mean_LW_Ratio, fill = location) +
geom_bar(stat = "identity", fill = "#9999CC")
(p1 | p2) + plot_annotation(title = 'Pay by department in non-US universities')
We can see on this figure that:
The difference in pay by department are slight in the USA.
English student are the lowest paid in the USA.
Business students are the best paid in the USA.
The pay differences are more marked in non-US universities.
Healthcare students are the best paid in non-US universities.
Communication students are the least paid in non-US universities.
Conclusions
Data collection improvement
This data set required extensive data cleaning on salaries, universities and department names. The quality of the analysis could be improved by improving the questionnaires filled by students to ensure that:
Salaries are entered as numbers.
Department choices are entered in an drop down menu.
More international data are collected.
Principal findings
The Overall Pay and Living Wage ratio is higher in the USA.
PhD students Living Wage ratios have decreased internationally since 2013.
Living Wages ratios started to increase again after 2015.
Business students are the best paid in the USA.
English students have the lowest paid in the USA.
Future studies
The following areas could be explored further:
compare LW ratio to professional salaries for each given department.
compare LW ratio to average student debt by countries.
compare salaries to inflation in the USA.
Comments