We will be looking at college admissions data from r/ApplyingToCollege (aka r/A2C). All data here is self-reported from the class of 2018. This topic is relevant to me, being only one year removed from this proces. Keep in mind that the data we have here is incomplete, lacking or having incomplete information on important application components like extracurricular activites, essays, and letteres of recommendation. Secondly, the data from r/A2C is extremely skewed, as the students on such a forum likely take college admissions very seriously. Furthermore, it has been hypoethesised that those with higher standardized test scores, GPAs (collectively “stats”) are more likely to respond. For reference, 99th percentile standardized test scores (natiional) might be around 50th percentile for survey respondents. See this for SAT percentile data (2017, refer to “SAT user”, page 5) and this for ACT percentile data. As this process has drastically changed within the past decade, check out r/A2C for more information on college admissions.
Keep in mind data is also noisy, as this survey could be filled out by anyone, and respondants are not held accountable for the accuracy of their answers. Case in point: entry claiming to have a 4.0GPA but an SAT score of just above 400 (the minimum).
Note: I elected to only use class of 2018 data to avoid mixing “old” (/2400) and “new” (/1600) SAT scores.
Note: You can download the dataset here.
This sets up the apropriate libraries and reads in our dataset.
suppressMessages(library(data.table))
library(tidyr)
suppressMessages(library(dplyr))
library(ggplot2)
library(tibble)
df <- read.csv("a2c.csv")
Let’s take a look at the raw data.
as_tibble(df)
## # A tibble: 1,326 x 118
## Timestamp Gender Are.you.Hispanic… Race..check.all.… Are.you.a.first.…
## <fct> <fct> <fct> <fct> <fct>
## 1 4/1/2018 … Male No White No
## 2 4/1/2018 … Male No Asian No
## 3 4/1/2018 … Male No Asian No
## 4 4/1/2018 … Female No White Yes
## 5 4/1/2018 … Female No White No
## 6 4/1/2018 … Female No Asian No
## 7 4/1/2018 … Female No White No
## 8 4/1/2018 … Male No Asian No
## 9 4/1/2018 … Male No Asian No
## 10 4/1/2018 … Male No White No
## # ... with 1,316 more rows, and 113 more variables:
## # What.kind.of.school.do.you.attend. <fct>,
## # Did.you.apply.for.Questbridge. <fct>,
## # Did.you.apply.for.financial.aid..i.e...FAFSA..CSS.Profile.. <fct>,
## # Are.you.a.legacy.applicant. <fct>,
## # If.you.are.an.admitted.legacy.applicant..what.school.s..did.you.apply.to.. <fct>,
## # If.you.are.a.waitlisted.legacy.applicant..what.school.s..did.you.apply.to.. <fct>,
## # If.you.are.a.rejected.legacy.applicant..what.school.s..did.you.apply.to.. <fct>,
## # ACT.Composite..one.sitting..if.applicable. <int>,
## # ACT.Composite..superscored..if.applicable. <int>,
## # SAT.Composite..one.sitting..if.applicable. <int>,
## # SAT.Composite..superscored..if.applicable. <int>,
## # Did.you.choose.to.NOT.send.your.ACT.SAT.scores.to.any.test.optional.school.s.. <fct>,
## # Unweighted.GPA <dbl>, Weighted.GPA..if.applicable. <fct>,
## # How.many.Advanced.Placement..AP..courses.did.you.take. <fct>,
## # How.many.Dual.Enrollment..DE..courses.did.you.take. <fct>,
## # How.many.International.Baccalaureate..IB..courses.did.you.take. <fct>,
## # How.many.schools.did.you.apply.to. <dbl>,
## # How.many.of.these.schools.did.you.consider..safeties.. <int>,
## # How.many.of.these.schools.did.you.consider..matches.. <int>,
## # How.many.of.these.schools.did.you.consider..reaches.. <int>,
## # How.many.schools.accepted.you. <int>,
## # How.many.schools.rejected.you. <int>,
## # How.many.schools.waitlisted.you. <int>,
## # If.you.are.waiting.on.schools..list.the.number.that.have.not.released.below. <int>,
## # How.many.ECs.did.you.list.on.your.Common.App...if.applicable. <int>,
## # Finally..what.school.will.you.be.attending.next.year.. <fct>,
## # American.University <fct>, Amherst.College <fct>,
## # Babson.College <fct>, Barnard.College <fct>, Boston.College <fct>,
## # Boston.University <fct>, Bowdoin.College <fct>,
## # Brandeis.University <fct>, Brown.University <fct>,
## # California.Institute.of.Technology..Caltech. <fct>,
## # California.Polytechnic.State.University..Cal.Poly.SLO. <fct>,
## # Carleton.College <fct>, Carnegie.Mellon.University <fct>,
## # Case.Western.Reserve.University <fct>,
## # Claremont.McKenna.College <fct>, Colby.College <fct>,
## # Colgate.University <fct>, Columbia.University <fct>,
## # Cornell.University <fct>, Dartmouth.College <fct>,
## # Davidson.College <fct>, Duke.University <fct>, Emory.University <fct>,
## # Franklin...Marshall.College <fct>, Fordham.University <fct>,
## # Georgetown.University <fct>,
## # Georgia.Institute.of.Technology..Georgia.Tech. <fct>,
## # Grinnell.College <fct>, Hamilton.College <fct>,
## # Harvard.University <fct>, Harvey.Mudd.College <fct>,
## # Johns.Hopkins.University <fct>, Kenyon.College <fct>,
## # Lehigh.University <fct>,
## # Massachusetts.Institute.of.Technology..MIT. <fct>,
## # Middlebury.College <fct>, New.York.University..NYU. <fct>,
## # Northeastern.University <fct>, Northwestern.University <fct>,
## # Occidental.College <fct>, Pitzer.College <fct>, Pomona.College <fct>,
## # Princeton.University <fct>, Reed.College <fct>, Rice.University <fct>,
## # Scripps.College <fct>, Smith.College <fct>, Stanford.University <fct>,
## # Swarthmore.College <fct>, Syracuse.University <fct>,
## # Tufts.University <fct>,
## # University.of.North.Carolina..Chapel.Hill..UNC.Chapel.Hill. <fct>,
## # University.of.California..Berkeley..Cal. <fct>,
## # University.of.California..Davis..UC.Davis. <fct>,
## # University.of.California..Irvine..UCI. <fct>,
## # University.of.California..Los.Angeles..UCLA. <fct>,
## # University.of.California..Merced..UC.Merced. <fct>,
## # University.of.California..Riverside..UC.Riverside. <fct>,
## # University.of.California..San.Diego..UCSD. <fct>,
## # University.of.California..Santa.Barbara..UCSB. <fct>,
## # University.of.California..Santa.Cruz..UCSC. <fct>,
## # University.of.Chicago..UChicago. <fct>,
## # University.of.Massachusetts..Amherst..UMass.Amherst. <fct>,
## # University.of.Michigan..UMich. <fct>, University.of.Notre.Dame <fct>,
## # University.of.Pennsylvania..UPenn. <fct>,
## # University.of.Southern.California..USC. <fct>,
## # University.of.Virginia..UVA. <fct>,
## # University.of.Washington..UW. <fct>,
## # University.of.Texas..Austin..UT.Austin. <fct>,
## # Vanderbilt.University <fct>, Vassar.College <fct>,
## # Wake.Forest.University <fct>, …
We’ve got entries of students with attributes like Gender, Race, SAT composites (single-sitting and superscore), ACT composites (single-sitting and superscore), number of AP courses taken, and college admission results (Accepted, Rejected, Waitlisted, didn’t apply).
First, let’s rename some columns to make them easier to work with
setnames(df,
old = c("Gender", "Are.you.Hispanic.or.Latino.", "Race..check.all.that.apply.", "Are.you.a.first.generation.applicant.", "What.kind.of.school.do.you.attend.", "Did.you.apply.for.Questbridge.", "Did.you.apply.for.financial.aid..i.e...FAFSA..CSS.Profile..", "Are.you.a.legacy.applicant.", "ACT.Composite..one.sitting..if.applicable.", "ACT.Composite..superscored..if.applicable.", "SAT.Composite..one.sitting..if.applicable.", "SAT.Composite..superscored..if.applicable.", "Unweighted.GPA", "Weighted.GPA..if.applicable.", "How.many.Advanced.Placement..AP..courses.did.you.take.", "How.many.Dual.Enrollment..DE..courses.did.you.take.", "How.many.International.Baccalaureate..IB..courses.did.you.take.", "For.any.missed.colleges.put..College.Name...decision...Ex..Gonzaga.University..accepted."),
new = c("gender", "hisp_or_latino" , "race", "first_gen", "HS_type", "app_qb", "app_finaid", "legacy", "ACT_single", "ACT_super", "SAT_single", "SAT_super", "GPA", "WGPA", "num_AP", "num_DE", "num_IB", "other_college")
)
Dropping some columns that won’t be used in this kernel
df <- df %>%
select(-one_of("Timestamp", "X", "X.1", "X.2"))
Let’s take a look at the dataset as a whole.
# source noted in Introduction
NAT_SAT_50P <- 1055
NAT_SAT_99P <- 1480
A2C_SAT_50P <- median(df$SAT_single, na.rm = TRUE)
# parameters for positioning labels
aes_v_offset <- 50
aes_h_offset <- 20
SAT_vals <- c(NAT_SAT_50P, NAT_SAT_99P, A2C_SAT_50P)
vline_cols <- c('blue', 'blue', 'green2')
df %>%
filter(!is.na(SAT_single)) %>%
ggplot(aes(x = SAT_single)) +
geom_histogram(binwidth = 10) +
geom_text(aes(x=NAT_SAT_50P - aes_h_offset, y = aes_v_offset,
label = sprintf("National 50%%ile (%d)", NAT_SAT_50P)), angle = 90, color = 'blue') +
geom_text(aes(x=NAT_SAT_99P - aes_h_offset, y = aes_v_offset,
label = sprintf("National 99%%ile (%d)", NAT_SAT_99P)), angle = 90, color = 'blue') +
geom_text(aes(x=A2C_SAT_50P + aes_h_offset, y = aes_v_offset - 12,
label = sprintf("r/A2C 50%%ile (%d)", A2C_SAT_50P)), angle = 90, color = 'green2') +
geom_vline(xintercept = SAT_vals, color = vline_cols) +
ggtitle('SAT scores from r/A2C 2018 dataset') +
xlab('best SAT score (single sitting, /1600)')
As mentioned earlier, the data from r/A2C is extremely skewed (left-tailed); in fact, the average survey respondant had a non-superscored SAT above the national 99th percentile.
Note: superscoring is when the highest Math and highest Reading/Writing subscores are considered as one. For example, an applicant who had taken the SAT twice, scoring 1500 (700RW / 800M) and 1530 (800RW / 730M), would have a superscored 1600 (\(max(700, 800) + max(800, 700) = 1600\)) but a best single-sitting score of 1530 (\(max(1500, 1530) = 1530\)). Superscoring for ACT is similar, considering the highest of English, Math, Reading, and Science subscores. You can read more about superscoring here.
Let’s take a look at ACT scores.
# source noted in Introduction
NAT_ACT_50P <- 20
NAT_ACT_99P <- 34
A2C_ACT_50P <- median(df$ACT_single, na.rm = TRUE)
# parameters for positioning labels
aes_v_offset <- 110
aes_h_offset <- 0.3
delta <- 0.02
ACT_vals <- c(NAT_ACT_50P, NAT_ACT_99P - delta, A2C_ACT_50P + delta)
vline_cols <- c('blue', 'lightblue1', 'green2')
df %>%
filter(!is.na(ACT_single)) %>%
ggplot(aes(x = ACT_single)) +
geom_histogram(binwidth = 1) +
geom_text(aes(x=NAT_ACT_50P - aes_h_offset, y = aes_v_offset,
label = sprintf("National 50%%ile (%d)", NAT_ACT_50P)), angle = 90, color = 'blue') +
geom_text(aes(x=NAT_ACT_99P - aes_h_offset, y = aes_v_offset,
label = sprintf("National 99%%ile (%d)", NAT_ACT_99P)), angle = 90, color = 'lightblue1') +
geom_text(aes(x=A2C_ACT_50P + aes_h_offset, y = aes_v_offset,
label = sprintf("r/A2C 50%%ile (%d)", A2C_ACT_50P)), angle = 90, color = 'green2') +
geom_vline(xintercept = ACT_vals, color = vline_cols) +
ggtitle('ACT scores from r/A2C 2018 dataset') +
xlab('best ACT score (single sitting, /36)')
We see similar results, with the national 99th percentile score being the median for survey respondants.
We would like to investigate the marginal utility of attaining higher standardized test scores in college admissions. While I would have liked to use University of Maryland as an example here, the admissions results for UMD were simply too sparse in the A2C 2018 dataset.
df %>%
filter(grepl(pattern = "UMD|college park|College Park", x = other_college)) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 26
Only 26 respondants indicated applying to UMD at all, so this is virtually unusable. Instead we will be looking at admission into any Ivy League university.
In order to maximize our usable entries, we will be converting SAT scores into ACT scores as per the official conversion table (see table 7). For this section, we will only consider respondants who applied to at least one Ivy League university. We will use superscored standardized test scores, as that is typical of universities when considering applicants for admission. Weighted GPA will not be considered here, as different school systems have different weighting formulas.
Renaming columns
iv <- c('Brown', 'Columbia', 'Cornell', 'Dartmouth', 'Harvard', 'Princeton', 'UPenn', 'Yale')
setnames(df,
old = c("Brown.University", "Columbia.University", "Cornell.University", "Dartmouth.College", "Harvard.University", "Princeton.University", "University.of.Pennsylvania..UPenn.", "Yale.University"),
new = iv
)
Indicating TRUE
if accepted, FALSE
if rejected or waitlisted, and NA
if did not apply
for (uni in iv) {
df[uni] <- as.vector(mapply(function(result) {
ifelse('' == result, NA, 'Accepted' == result)
}, df[uni]))
}
df2 <- df %>%
select(c(iv, c("gender", "hisp_or_latino", "race", "first_gen", "ACT_super", "ACT_single", "SAT_super", "SAT_single", "GPA", "num_AP", "num_IB", "num_DE")))
as_tibble(df2)
## # A tibble: 1,326 x 20
## Brown Columbia Cornell Dartmouth Harvard Princeton UPenn Yale gender
## <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <fct>
## 1 TRUE NA NA NA NA NA NA NA Male
## 2 NA NA NA NA NA NA NA NA Male
## 3 NA NA FALSE NA NA NA NA NA Male
## 4 NA NA NA NA NA NA NA NA Female
## 5 NA NA NA NA NA NA NA NA Female
## 6 NA NA FALSE NA NA NA NA NA Female
## 7 NA NA NA NA NA NA FALSE NA Female
## 8 NA NA NA NA NA NA TRUE NA Male
## 9 NA NA NA NA NA NA NA NA Male
## 10 NA NA NA NA NA NA NA NA Male
## # ... with 1,316 more rows, and 11 more variables: hisp_or_latino <fct>,
## # race <fct>, first_gen <fct>, ACT_super <int>, ACT_single <int>,
## # SAT_super <int>, SAT_single <int>, GPA <dbl>, num_AP <fct>,
## # num_IB <fct>, num_DE <fct>
Now to clean up the data:
# function for SAT -> ACT
bin_SATs <- function(score) {
# from official conversion tables
if (score == 1600) {36}
else if (1560 <= score) {35} else if (1520 <= score) {34} else if (1490 <= score) {33}
else if (1450 <= score) {32} else if (1420 <= score) {31} else if (1390 <= score) {30}
else if (1350 <= score) {29} else if (1310 <= score) {28} else if (1280 <= score) {27}
else if (1240 <= score) {26} else if (1200 <= score) {25} else if (1160 <= score) {24}
else if (1130 <= score) {23} else if (1100 <= score) {22} else if (1060 <= score) {21}
else if (1020 <= score) {20} else if (980 <= score) {19} else if (940 <= score) {18}
else if (900 <= score) {17} else if (860 <= score) {16} else if (810 <= score) {15}
else if (760 <= score) {14} else if (720 <= score) {13} else if (630 <= score) {12}
else if (560 <= score) {11}
# official concordance table does not lower than this
else if (490 <= score) {10}
else if (400 <= score) {9}
else {
# handles missing scores
# note: 0 is not a valid SAT score; it is just a placeholder
#warning(sprintf("Bad SAT Score: %d\n", score))
0
}
}
conv_GPA <- function(gpa) {
if (gpa <= 4.0) {
return(gpa)
}
else {
if (97 <= gpa && gpa <= 100) {return(4.0)} # A+
else if (93 <= gpa) {return(4.0)} # A
else if (90 <= gpa) {return(3.7)} # A-
else { # no applicant on 100pt scale had < 90 gpa
warning(sprintf("Bad GPA: %d\n", gpa))
3.5
}
}
}
handleAP <- function(how_many) {
if (is.na(how_many)) {0}
else if ('' == how_many) {0}
else if (grepl(pattern = "school does not offer", x = how_many)) {0}
else if (grepl(pattern = '20[+]', x = how_many)) {22}
else {as.integer(how_many)}
}
df2 <- df2 %>%
# consider only if applied to at least one Ivy League Uni
filter(!(is.na(Brown) & is.na(Columbia) & is.na(Cornell) & is.na(Dartmouth)
& is.na(Harvard) & is.na(Princeton) & is.na(UPenn) & is.na(Yale))) %>%
# drop if missing GPA or not on either 4.0 or 100pt scale
# because they entered their weighted GPA
filter(!is.na(GPA)) %>%
filter(GPA <= 4.0 | GPA > 90) %>%# everyone on 100pt scale had >90
# Filling in missing superscore data from single best scores
mutate(ACT_super = pmax(0, ACT_super, ACT_single, na.rm = TRUE)) %>%
mutate(SAT_super = pmax(0, SAT_super, SAT_single, na.rm = TRUE)) %>%
# SAT -> ACT, considering higher if applicant has both
mutate(ACT = pmax(0, ACT_super, mapply(bin_SATs, SAT_super), na.rm = TRUE)) %>%
filter(ACT != 0) %>% # note: 0 is not a valid ACT score; the minimum score is 1
# Converting GPAs to /4.0 scale (/100 scale used in NY)
mutate(GPA = mapply(conv_GPA, GPA)) %>%
# handle edge cases for num of AP/IB/DE courses
mutate(num_AP = mapply(handleAP, num_AP)) %>%
mutate(num_IB = mapply(handleAP, num_IB)) %>%
mutate(num_DE = mapply(handleAP, num_DE)) %>%
# counting college-level courses together
mutate(num_col_lv = num_AP + num_IB + num_DE) %>%
# turn yes/no answers into logical vectors
mutate(hisp_or_latino = (!is.na(hisp_or_latino) & ('Yes' == hisp_or_latino))) %>%
mutate(first_gen = (!is.na(first_gen) & ('Yes' == first_gen))) %>%
# merge admission results into logical disjunction
mutate(adm_ivy = Brown | Columbia | Cornell | Dartmouth | Harvard | Princeton | UPenn | Yale ) %>%
mutate(adm_ivy = ifelse(is.na(adm_ivy), FALSE, TRUE)) %>%
select(-c(SAT_single, SAT_super, ACT_single, ACT_super)) %>%
select(-c(Brown, Columbia, Cornell, Dartmouth, Harvard, Princeton, UPenn, Yale))
as_tibble(df2)
## # A tibble: 725 x 11
## gender hisp_or_latino race first_gen GPA num_AP num_IB num_DE ACT
## <fct> <lgl> <fct> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Male FALSE White FALSE 4 14 0 0 34
## 2 Male FALSE Asian FALSE 3.76 2 0 0 29
## 3 Female FALSE Asian FALSE 3.7 0 0 0 29
## 4 Male FALSE Asian FALSE 4 6 0 0 35
## 5 Male FALSE Asian FALSE 4 4 0 0 35
## 6 Male FALSE Black… FALSE 3.85 7 0 0 34
## 7 Male FALSE White FALSE 3.9 22 0 0 36
## 8 Male FALSE Asian FALSE 3.58 3 0 0 35
## 9 Male FALSE White TRUE 3.67 10 0 0 33
## 10 Male FALSE Asian FALSE 3.9 23 0 0 36
## # ... with 715 more rows, and 2 more variables: num_col_lv <dbl>,
## # adm_ivy <lgl>
It has often been asserted on r/A2C that ‘after a certain point, your scores don’t matter’. (The claim is usually along the lines of \(\geq 1500SAT/33ACT\) or \(\geq 1530SAT/34ACT\) makes no difference.) But how much truth is there to this statement? Today we will take a look a results just for Ivy League admissions.
Null hypothesis
\(H_0: (\forall n \in \{35, 36\})[P(admit(n)) \leq P(admit(34))]\), that is to say, the chances of being admitted into at least one Ivy League university do not improve for scores above 34.
We will estimate the true probability with this sample probability.
num_34s <- df2 %>%
filter(ACT == 34) %>%
count()
num_34s_admit <- df2 %>%
filter(ACT == 34 & adm_ivy) %>%
count()
p_hat_34 <- as.double(num_34s_admit / num_34s)
num_35s <- df2 %>%
filter(ACT == 35) %>%
count()
num_35s_admit <- df2 %>%
filter(ACT == 35 & adm_ivy) %>%
count()
p_hat_35 <- as.double(num_35s_admit / num_35s)
num_36s <- df2 %>%
filter(ACT == 36) %>%
count()
num_36s_admit <- df2 %>%
filter(ACT == 36 & adm_ivy) %>%
count()
p_hat_36 <- as.double(num_36s_admit / num_36s)
vals <- c(p_hat_34, p_hat_35, p_hat_36)
cbind.data.frame("Score" = c(34, 35, 36), "p_hat" = vals, "percent better than 34" = 100 * (vals - p_hat_34) / p_hat_34)
## Score p_hat percent better than 34
## 1 34 0.3691589 0.000000
## 2 35 0.3771930 2.176327
## 3 36 0.4639175 25.668798
It seems like there is a large difference between 36 and 34 (>25%), and even 35 outperforms 34 by ~2%. But are these results statistically signifigant?
Notation: \(p_{34} = P[admit(34)], p_{35} = P[admin(35)], etc\) and the sample-based estimates are \(\hat{p_{35}}, etc\). \(\bar{X}\) is \(p_{34}\).
Using conventional criteria of \(\alpha = 0.05\).
Null hypothesis: \(p_{35} \leq p_{34}\); we will reject if \(P[\bar{X} > \hat{p_{35}}] < \alpha\)
\(E[\bar{X}] = p_{34}\)
\(Var(\bar{X}) = \frac{p_{34}(1 - p_{34})}{n}\)
n <- num_35s
VarX_bar <- as.double((p_hat_34 * (1 - p_hat_34)) / n)
EX_bar <- p_hat_34
# P(X_bar > p_hat_35)
p <- pnorm(p_hat_35, mean = EX_bar, sd = sqrt(VarX_bar), lower.tail = FALSE)
p
## [1] 0.4007585
\(P[\bar{X} > p_{35}] = 0.4007... > \alpha = 0.05\) So we fail to reject the null hypothesis for scores of 35.
What about 36?
Null hypothesis: \(p_{36} \leq p_{34}\); we will reject if \(P[\bar{X} > \hat{p_{36}}] < \alpha\)
n <- num_36s
VarX_bar <- as.double((p_hat_34 * (1 - p_hat_34)) / n)
EX_bar <- p_hat_34
# P(X_bar > p_hat_36)
p <- pnorm(p_hat_36, mean = EX_bar, sd = sqrt(VarX_bar), lower.tail = FALSE)
p
## [1] 0.0265616
\(P[\bar{X} > p_{36}] = 0.0265... < \alpha = 0.05\) So we reject the null hypothesis for scores of 36; an applicant with an ACT (or equivalent SAT) score of 36 is signifigantly more likely to get into at least one Ivy League University. However, this does not account for how standardized test scores correlate with other parts of an application, like GPA, course rigor, and extracurriculars. (i.e. a student who scores a 36 or equivalent is likely to have a higher GPA and course rigor as well.)
Let’s take a look at test scores vs GPA
df2 %>%
ggplot(aes(x = factor(ACT), y = GPA)) +
geom_violin() +
ggtitle("Standardized Test Scores vs GPA") +
xlab("ACT (or equivalent SAT score)") +
ylab("unweighted HS GPA")
df2 %>%
ggplot(aes(x = ACT, y = GPA)) +
geom_point() +
geom_smooth(method = 'lm', formula = y~x) +
ggtitle("Standardized Test Scores vs GPA") +
xlab("ACT (or equivalent SAT score)") +
ylab("unweighted HS GPA")
Clealy there is a positive correlation here.
df2 %>%
ggplot(aes(x = factor(ACT), y = num_col_lv)) +
geom_violin() +
ggtitle("Standardized Test Scores vs college-level courses") +
xlab("ACT (or equivalent SAT score)") +
ylab("# of college-level courses taken (AP+IB+DE)")
df2 %>%
ggplot(aes(x = ACT, y = num_col_lv)) +
geom_point() +
geom_smooth(method = 'lm', formula = y~x) +
ggtitle("Standardized Test Scores vs college-level courses") +
xlab("ACT (or equivalent SAT score)") +
ylab("# of college-level courses taken (AP+IB+DE)")
The correlation here is less prominent, so let’s look consider GPA for the following: How do chances of admission scale with both GPA and test scores?
df2 <- df2 %>%
mutate(rGPA = round((GPA * 2), 1) / 2)
df3 <- df2 %>%
filter(rGPA >= 3.7 & ACT >= 34) %>%
group_by(rGPA, ACT) %>%
mutate(dummy_1 = 1) %>%
mutate(n = sum(dummy_1)) %>%
mutate(adm_n = sum(ifelse(adm_ivy, dummy_1, 0))) %>%
mutate(adm_r = adm_n / n) %>%
select(c(ACT, rGPA, adm_r)) %>%
arrange(ACT, rGPA) %>%
unique()
df3
## # A tibble: 20 x 3
## # Groups: rGPA, ACT [20]
## ACT rGPA adm_r
## <dbl> <dbl> <dbl>
## 1 34 3.7 0.286
## 2 34 3.75 0.143
## 3 34 3.8 0.35
## 4 34 3.85 0.368
## 5 34 3.9 0.324
## 6 34 3.95 0.286
## 7 34 4 0.519
## 8 35 3.7 0.375
## 9 35 3.75 0.125
## 10 35 3.8 0.167
## 11 35 3.85 0.25
## 12 35 3.9 0.4
## 13 35 3.95 0.212
## 14 35 4 0.509
## 15 36 3.75 0
## 16 36 3.8 0.333
## 17 36 3.85 0.286
## 18 36 3.9 0.5
## 19 36 3.95 0.533
## 20 36 4 0.481
df3 %>%
group_by(ACT) %>%
ggplot(aes(x = rGPA, y = adm_r, color = factor(ACT))) +
geom_point() +
geom_line() +
ggtitle("standardized testing and gpa vs admission rate") +
ylab('admit rate') +
xlab('rounded GPA')
Keep in mind that the sample size here is pretty small, so the dip to 0% at 3.75GPA 36ACT is due to the lack of respondants having a 3.75GPA and managing to score a 36 or 1600.
When accounting for GPA, the data here seems a lot more inconclusive, but the gap 36 has at 3.90 and 3.95 over 34 and 35 is likely signifigant, as I suspect that is where the average GPA is for admits.
df2 %>%
filter(adm_ivy) %>%
summarise(mean(GPA))
## mean(GPA)
## 1 3.932506
Right between 3.90 and 3.95.
For those curious, here’s a visualization showing scores down to 32 and GPAs down to 3.50.
df3 <- df2 %>%
filter(rGPA >= 3.5 & ACT >= 32) %>%
group_by(rGPA, ACT) %>%
mutate(dummy_1 = 1) %>%
mutate(n = sum(dummy_1)) %>%
mutate(adm_n = sum(ifelse(adm_ivy, dummy_1, 0))) %>%
mutate(adm_r = adm_n / n) %>%
select(c(ACT, rGPA, adm_r)) %>%
arrange(ACT, rGPA) %>%
unique()
df3 %>%
group_by(ACT) %>%
ggplot(aes(x = rGPA, y = adm_r, color = factor(ACT))) +
geom_point() +
geom_line() +
ggtitle("standardized testing and gpa vs admission rate") +
ylab('admit rate') +
xlab('rounded GPA')
Remember that the crazy spikes are because of small dataset size. (no, 36, 3.6 does not give you a 100% chance. no, having 32 is not better than a 36 if you have a 3.95GPA exactly)
Applicants that score 36 or 1600 are signifigantly more likely to get admitted into at least one Ivy League University. However, it is inconclusive whether this is directly caused by scoring higher (as we lack too much data on extracurriculars, etc).
We would like to investigate how SAT/ACT, GPA, and number of AP/IB/DE (Advanced Placement, International Baccalaureate, and Dual Enrollment) courses affect college admissions. Standardized test scores, Grade Point Average, and course rigor are often if not always some of the most important factors in college admissions. (Source: common data sets; you can read more about those here.)
While considering factors like race and gender in these kind of models is usually unethical (or unconstitutional), they will be taken into account here because universities do in their ‘holistic’ admissions process. Again, see the Common Data Sets for more details here.
Before providing data to models, we will one-hot encode gender and race. Well, those that indicated multiple races in the survey will get multiple 1s.
df3 <- df2 %>%
mutate(Male = ('Male' == gender)) %>%
mutate(Female = ('Female' == gender)) %>%
mutate(White = grepl(pattern = 'White', x = race)) %>%
mutate(Black = grepl(pattern = 'Black or African American', x = race)) %>%
mutate(Asian = grepl(pattern = 'Asian', x = race)) %>%
mutate(Native = grepl(pattern = 'American Indian or Alaska Native', x = race)) %>%
mutate(PIsland = grepl(pattern = 'Pacific Islander', x = race)) %>%
select(c(adm_ivy, GPA, ACT, num_col_lv, Male, Female, first_gen, hisp_or_latino, White, Black, Asian, Native, PIsland))
write.csv(df3, file = "preprocessed.csv")
as_tibble(df3)
## # A tibble: 725 x 13
## adm_ivy GPA ACT num_col_lv Male Female first_gen hisp_or_latino
## <lgl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl>
## 1 TRUE 4 34 14 TRUE FALSE FALSE FALSE
## 2 FALSE 3.76 29 2 TRUE FALSE FALSE FALSE
## 3 FALSE 3.7 29 0 FALSE TRUE FALSE FALSE
## 4 TRUE 4 35 6 TRUE FALSE FALSE FALSE
## 5 TRUE 4 35 4 TRUE FALSE FALSE FALSE
## 6 TRUE 3.85 34 7 TRUE FALSE FALSE FALSE
## 7 FALSE 3.9 36 22 TRUE FALSE FALSE FALSE
## 8 TRUE 3.58 35 3 TRUE FALSE FALSE FALSE
## 9 TRUE 3.67 33 10 TRUE FALSE TRUE FALSE
## 10 FALSE 3.9 36 23 TRUE FALSE FALSE FALSE
## # ... with 715 more rows, and 5 more variables: White <lgl>, Black <lgl>,
## # Asian <lgl>, Native <lgl>, PIsland <lgl>
Let’s try to predict admissions using deep learning. I will be using python for this part.
import sys
print(sys.version)
# supresses the messages printed to stderr about tensorflow
#sys.stderr = open('/dev/null', 'w')
## 3.6.5 (default, Apr 1 2018, 05:46:30)
## [GCC 7.3.0]
import pandas as pd
import numpy as np
from keras import Sequential
## /home/george/.local/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
## from ._conv import register_converters as _register_converters
## Using TensorFlow backend.
from keras.layers import Dense, Dropout
np.random.seed(320)
df = pd.read_csv('preprocessed.csv')
# turn booleans into 1s and 0s
for col in df.columns:
if type(df[col].values[0]) is np.bool_:
df[col] = df[col].apply(int)
# split up training and testing data
train = df.sample(frac=0.9)
test = df.drop(train.index)
y_train, y_test = train['adm_ivy'].values, test['adm_ivy'].values
del train['adm_ivy']
del test['adm_ivy']
x_train, x_test = train.values, test.values
# build the model
model = Sequential()
model.add(Dense(32, input_dim=13, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dropout(rate=0.5))
model.add(Dense(8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# train
model.fit(x_train, y_train, epochs=30, batch_size=10, verbose = False)
# evaluate
scores = model.evaluate(x_test, y_test)
##
## 32/73 [============>.................] - ETA: 0s
## 73/73 [==============================] - 0s 410us/step
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))
##
## acc: 68.49%
Wow! 68% accuracy. That’s impressive for a task as complex as college admissions, especially with such a limited feature set. A little too impressive… What’s going on here?
n_admit <- df3 %>% filter(adm_ivy) %>% count()
n <- df3 %>% count()
1 - (n_admit / n)
## n
## 1 0.6565517
Suspiciously close to the percentage of applicants not admitted – the model just learned to guess “Not admitted” every time. Variance from (1 - admit_rate) are due to random sampling of training and testing data. Let’s make sure that the training data is balanced.
import sys
print(sys.version)
# supresses the messages printed to stderr about tensorflow
#sys.stderr = open('/dev/null', 'w')
## 3.6.5 (default, Apr 1 2018, 05:46:30)
## [GCC 7.3.0]
import pandas as pd
import numpy as np
from keras import Sequential
from keras.layers import Dense, Dropout
df = pd.read_csv('preprocessed.csv')
# turn booleans into 1s and 0s
for col in df.columns:
if type(df[col].values[0]) is np.bool_:
df[col] = df[col].apply(int)
# puts training data at 50/50 admit/not admit
def normalize(train):
yes_admit = train.query('adm_ivy == 1')
n_admit = yes_admit.shape[0]
no_admit = train.query('adm_ivy == 0')
no_admit = no_admit.sample(n = n_admit)
frames = [yes_admit, no_admit]
result = pd.concat(frames)
result = result.sample(frac=1).reset_index(drop=True)
return result
# split up training and testing data
train = df.sample(frac=0.85)
test = df.drop(train.index)
train = normalize(train)
y_train, y_test = train['adm_ivy'].values, test['adm_ivy'].values
del train['adm_ivy']
del test['adm_ivy']
x_train, x_test = train.values, test.values
# build the model
model = Sequential()
model.add(Dense(32, input_dim=13, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dropout(rate=0.5))
model.add(Dense(8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
# train
model.fit(x_train, y_train, epochs=15, batch_size=12, verbose = False)
# evaluate
scores = model.evaluate(x_test, y_test)
##
## 32/109 [=======>......................] - ETA: 0s
## 109/109 [==============================] - 0s 377us/step
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))
##
## acc: 77.06%
n = sum(list(y_test)) / len(y_test)
print(f"{n} | {1-n}")
## 0.21100917431192662 | 0.7889908256880733
Even when the training set is set to 50/50, it’s still guessing one result the vast majority of the time. Predicting admissions results is a very challenging task, especially so from a restricted feature set and small dataset.