Introduction

We will be looking at college admissions data from r/ApplyingToCollege (aka r/A2C). All data here is self-reported from the class of 2018. This topic is relevant to me, being only one year removed from this proces. Keep in mind that the data we have here is incomplete, lacking or having incomplete information on important application components like extracurricular activites, essays, and letteres of recommendation. Secondly, the data from r/A2C is extremely skewed, as the students on such a forum likely take college admissions very seriously. Furthermore, it has been hypoethesised that those with higher standardized test scores, GPAs (collectively “stats”) are more likely to respond. For reference, 99th percentile standardized test scores (natiional) might be around 50th percentile for survey respondents. See this for SAT percentile data (2017, refer to “SAT user”, page 5) and this for ACT percentile data. As this process has drastically changed within the past decade, check out r/A2C for more information on college admissions.

Keep in mind data is also noisy, as this survey could be filled out by anyone, and respondants are not held accountable for the accuracy of their answers. Case in point: entry claiming to have a 4.0GPA but an SAT score of just above 400 (the minimum).

Note: I elected to only use class of 2018 data to avoid mixing “old” (/2400) and “new” (/1600) SAT scores.

Note: You can download the dataset here.

Setup

This sets up the apropriate libraries and reads in our dataset.

suppressMessages(library(data.table))
library(tidyr)
suppressMessages(library(dplyr))
library(ggplot2)
library(tibble)
df <- read.csv("a2c.csv")

Let’s take a look at the raw data.

as_tibble(df)

## # A tibble: 1,326 x 118
##    Timestamp  Gender Are.you.Hispanic… Race..check.all.… Are.you.a.first.…
##    <fct>      <fct>  <fct>             <fct>             <fct>            
##  1 4/1/2018 … Male   No                White             No               
##  2 4/1/2018 … Male   No                Asian             No               
##  3 4/1/2018 … Male   No                Asian             No               
##  4 4/1/2018 … Female No                White             Yes              
##  5 4/1/2018 … Female No                White             No               
##  6 4/1/2018 … Female No                Asian             No               
##  7 4/1/2018 … Female No                White             No               
##  8 4/1/2018 … Male   No                Asian             No               
##  9 4/1/2018 … Male   No                Asian             No               
## 10 4/1/2018 … Male   No                White             No               
## # ... with 1,316 more rows, and 113 more variables:
## #   What.kind.of.school.do.you.attend. <fct>,
## #   Did.you.apply.for.Questbridge. <fct>,
## #   Did.you.apply.for.financial.aid..i.e...FAFSA..CSS.Profile.. <fct>,
## #   Are.you.a.legacy.applicant. <fct>,
## #   If.you.are.an.admitted.legacy.applicant..what.school.s..did.you.apply.to.. <fct>,
## #   If.you.are.a.waitlisted.legacy.applicant..what.school.s..did.you.apply.to.. <fct>,
## #   If.you.are.a.rejected.legacy.applicant..what.school.s..did.you.apply.to.. <fct>,
## #   ACT.Composite..one.sitting..if.applicable. <int>,
## #   ACT.Composite..superscored..if.applicable. <int>,
## #   SAT.Composite..one.sitting..if.applicable. <int>,
## #   SAT.Composite..superscored..if.applicable. <int>,
## #   Did.you.choose.to.NOT.send.your.ACT.SAT.scores.to.any.test.optional.school.s.. <fct>,
## #   Unweighted.GPA <dbl>, Weighted.GPA..if.applicable. <fct>,
## #   How.many.Advanced.Placement..AP..courses.did.you.take. <fct>,
## #   How.many.Dual.Enrollment..DE..courses.did.you.take. <fct>,
## #   How.many.International.Baccalaureate..IB..courses.did.you.take. <fct>,
## #   How.many.schools.did.you.apply.to. <dbl>,
## #   How.many.of.these.schools.did.you.consider..safeties.. <int>,
## #   How.many.of.these.schools.did.you.consider..matches.. <int>,
## #   How.many.of.these.schools.did.you.consider..reaches.. <int>,
## #   How.many.schools.accepted.you. <int>,
## #   How.many.schools.rejected.you. <int>,
## #   How.many.schools.waitlisted.you. <int>,
## #   If.you.are.waiting.on.schools..list.the.number.that.have.not.released.below. <int>,
## #   How.many.ECs.did.you.list.on.your.Common.App...if.applicable. <int>,
## #   Finally..what.school.will.you.be.attending.next.year.. <fct>,
## #   American.University <fct>, Amherst.College <fct>,
## #   Babson.College <fct>, Barnard.College <fct>, Boston.College <fct>,
## #   Boston.University <fct>, Bowdoin.College <fct>,
## #   Brandeis.University <fct>, Brown.University <fct>,
## #   California.Institute.of.Technology..Caltech. <fct>,
## #   California.Polytechnic.State.University..Cal.Poly.SLO. <fct>,
## #   Carleton.College <fct>, Carnegie.Mellon.University <fct>,
## #   Case.Western.Reserve.University <fct>,
## #   Claremont.McKenna.College <fct>, Colby.College <fct>,
## #   Colgate.University <fct>, Columbia.University <fct>,
## #   Cornell.University <fct>, Dartmouth.College <fct>,
## #   Davidson.College <fct>, Duke.University <fct>, Emory.University <fct>,
## #   Franklin...Marshall.College <fct>, Fordham.University <fct>,
## #   Georgetown.University <fct>,
## #   Georgia.Institute.of.Technology..Georgia.Tech. <fct>,
## #   Grinnell.College <fct>, Hamilton.College <fct>,
## #   Harvard.University <fct>, Harvey.Mudd.College <fct>,
## #   Johns.Hopkins.University <fct>, Kenyon.College <fct>,
## #   Lehigh.University <fct>,
## #   Massachusetts.Institute.of.Technology..MIT. <fct>,
## #   Middlebury.College <fct>, New.York.University..NYU. <fct>,
## #   Northeastern.University <fct>, Northwestern.University <fct>,
## #   Occidental.College <fct>, Pitzer.College <fct>, Pomona.College <fct>,
## #   Princeton.University <fct>, Reed.College <fct>, Rice.University <fct>,
## #   Scripps.College <fct>, Smith.College <fct>, Stanford.University <fct>,
## #   Swarthmore.College <fct>, Syracuse.University <fct>,
## #   Tufts.University <fct>,
## #   University.of.North.Carolina..Chapel.Hill..UNC.Chapel.Hill. <fct>,
## #   University.of.California..Berkeley..Cal. <fct>,
## #   University.of.California..Davis..UC.Davis. <fct>,
## #   University.of.California..Irvine..UCI. <fct>,
## #   University.of.California..Los.Angeles..UCLA. <fct>,
## #   University.of.California..Merced..UC.Merced. <fct>,
## #   University.of.California..Riverside..UC.Riverside. <fct>,
## #   University.of.California..San.Diego..UCSD. <fct>,
## #   University.of.California..Santa.Barbara..UCSB. <fct>,
## #   University.of.California..Santa.Cruz..UCSC. <fct>,
## #   University.of.Chicago..UChicago. <fct>,
## #   University.of.Massachusetts..Amherst..UMass.Amherst. <fct>,
## #   University.of.Michigan..UMich. <fct>, University.of.Notre.Dame <fct>,
## #   University.of.Pennsylvania..UPenn. <fct>,
## #   University.of.Southern.California..USC. <fct>,
## #   University.of.Virginia..UVA. <fct>,
## #   University.of.Washington..UW. <fct>,
## #   University.of.Texas..Austin..UT.Austin. <fct>,
## #   Vanderbilt.University <fct>, Vassar.College <fct>,
## #   Wake.Forest.University <fct>, …

We’ve got entries of students with attributes like Gender, Race, SAT composites (single-sitting and superscore), ACT composites (single-sitting and superscore), number of AP courses taken, and college admission results (Accepted, Rejected, Waitlisted, didn’t apply).

First, let’s rename some columns to make them easier to work with

setnames(df,
         old = c("Gender", "Are.you.Hispanic.or.Latino.", "Race..check.all.that.apply.", "Are.you.a.first.generation.applicant.", "What.kind.of.school.do.you.attend.", "Did.you.apply.for.Questbridge.", "Did.you.apply.for.financial.aid..i.e...FAFSA..CSS.Profile..", "Are.you.a.legacy.applicant.", "ACT.Composite..one.sitting..if.applicable.", "ACT.Composite..superscored..if.applicable.", "SAT.Composite..one.sitting..if.applicable.", "SAT.Composite..superscored..if.applicable.", "Unweighted.GPA", "Weighted.GPA..if.applicable.", "How.many.Advanced.Placement..AP..courses.did.you.take.", "How.many.Dual.Enrollment..DE..courses.did.you.take.", "How.many.International.Baccalaureate..IB..courses.did.you.take.", "For.any.missed.colleges.put..College.Name...decision...Ex..Gonzaga.University..accepted."),
         new = c("gender", "hisp_or_latino" , "race", "first_gen", "HS_type", "app_qb", "app_finaid", "legacy", "ACT_single", "ACT_super", "SAT_single", "SAT_super", "GPA", "WGPA", "num_AP", "num_DE", "num_IB", "other_college")
)

Dropping some columns that won’t be used in this kernel

df <- df %>%
   select(-one_of("Timestamp", "X", "X.1", "X.2"))

Exploratory Data Analysis

Let’s take a look at the dataset as a whole.

# source noted in Introduction
NAT_SAT_50P <- 1055
NAT_SAT_99P <- 1480
A2C_SAT_50P <- median(df$SAT_single, na.rm = TRUE)

# parameters for positioning labels
aes_v_offset <- 50
aes_h_offset <- 20

SAT_vals <- c(NAT_SAT_50P, NAT_SAT_99P, A2C_SAT_50P)
vline_cols <- c('blue', 'blue', 'green2')

df %>% 
  filter(!is.na(SAT_single)) %>%
  ggplot(aes(x = SAT_single)) + 
  geom_histogram(binwidth = 10) + 
  geom_text(aes(x=NAT_SAT_50P - aes_h_offset, y = aes_v_offset, 
                label = sprintf("National 50%%ile (%d)", NAT_SAT_50P)), angle = 90, color = 'blue') +
  geom_text(aes(x=NAT_SAT_99P - aes_h_offset, y = aes_v_offset, 
                label = sprintf("National 99%%ile (%d)", NAT_SAT_99P)), angle = 90, color = 'blue') + 
  geom_text(aes(x=A2C_SAT_50P + aes_h_offset, y = aes_v_offset - 12, 
                label = sprintf("r/A2C 50%%ile (%d)", A2C_SAT_50P)), angle = 90, color = 'green2') +
  geom_vline(xintercept = SAT_vals, color = vline_cols) +
  ggtitle('SAT scores from r/A2C 2018 dataset') + 
  xlab('best SAT score (single sitting, /1600)')

As mentioned earlier, the data from r/A2C is extremely skewed (left-tailed); in fact, the average survey respondant had a non-superscored SAT above the national 99th percentile.

Note: superscoring is when the highest Math and highest Reading/Writing subscores are considered as one. For example, an applicant who had taken the SAT twice, scoring 1500 (700RW / 800M) and 1530 (800RW / 730M), would have a superscored 1600 (\(max(700, 800) + max(800, 700) = 1600\)) but a best single-sitting score of 1530 (\(max(1500, 1530) = 1530\)). Superscoring for ACT is similar, considering the highest of English, Math, Reading, and Science subscores. You can read more about superscoring here.

Let’s take a look at ACT scores.

# source noted in Introduction
NAT_ACT_50P <- 20
NAT_ACT_99P <- 34
A2C_ACT_50P <- median(df$ACT_single, na.rm = TRUE)

# parameters for positioning labels
aes_v_offset <- 110
aes_h_offset <- 0.3
delta <- 0.02

ACT_vals <- c(NAT_ACT_50P, NAT_ACT_99P - delta, A2C_ACT_50P + delta)
vline_cols <- c('blue', 'lightblue1', 'green2')

df %>% 
  filter(!is.na(ACT_single)) %>%
  ggplot(aes(x = ACT_single)) + 
  geom_histogram(binwidth = 1) + 
  geom_text(aes(x=NAT_ACT_50P - aes_h_offset, y = aes_v_offset, 
                label = sprintf("National 50%%ile (%d)", NAT_ACT_50P)), angle = 90, color = 'blue') +
  geom_text(aes(x=NAT_ACT_99P - aes_h_offset, y = aes_v_offset, 
                label = sprintf("National 99%%ile (%d)", NAT_ACT_99P)), angle = 90, color = 'lightblue1') + 
  geom_text(aes(x=A2C_ACT_50P + aes_h_offset, y = aes_v_offset, 
                label = sprintf("r/A2C 50%%ile (%d)", A2C_ACT_50P)), angle = 90, color = 'green2') +
  geom_vline(xintercept = ACT_vals, color = vline_cols) +
  ggtitle('ACT scores from r/A2C 2018 dataset') + 
  xlab('best ACT score (single sitting, /36)')

We see similar results, with the national 99th percentile score being the median for survey respondants.

Data Extraction & Transformation

We would like to investigate the marginal utility of attaining higher standardized test scores in college admissions. While I would have liked to use University of Maryland as an example here, the admissions results for UMD were simply too sparse in the A2C 2018 dataset.

df %>%
  filter(grepl(pattern = "UMD|college park|College Park", x = other_college)) %>%
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1    26

Only 26 respondants indicated applying to UMD at all, so this is virtually unusable. Instead we will be looking at admission into any Ivy League university.

In order to maximize our usable entries, we will be converting SAT scores into ACT scores as per the official conversion table (see table 7). For this section, we will only consider respondants who applied to at least one Ivy League university. We will use superscored standardized test scores, as that is typical of universities when considering applicants for admission. Weighted GPA will not be considered here, as different school systems have different weighting formulas.

Renaming columns

iv <- c('Brown', 'Columbia', 'Cornell', 'Dartmouth', 'Harvard', 'Princeton', 'UPenn', 'Yale')
setnames(df, 
         old = c("Brown.University", "Columbia.University", "Cornell.University", "Dartmouth.College", "Harvard.University", "Princeton.University", "University.of.Pennsylvania..UPenn.", "Yale.University"),
         new = iv
         )

Indicating TRUE if accepted, FALSE if rejected or waitlisted, and NA if did not apply

for (uni in iv) {
  df[uni] <- as.vector(mapply(function(result) {
      ifelse('' == result, NA, 'Accepted' == result)
  }, df[uni]))
}

df2 <- df %>%
  select(c(iv, c("gender", "hisp_or_latino", "race", "first_gen", "ACT_super", "ACT_single", "SAT_super", "SAT_single", "GPA", "num_AP", "num_IB", "num_DE")))
as_tibble(df2)

## # A tibble: 1,326 x 20
##    Brown Columbia Cornell Dartmouth Harvard Princeton UPenn Yale  gender
##    <lgl> <lgl>    <lgl>   <lgl>     <lgl>   <lgl>     <lgl> <lgl> <fct> 
##  1 TRUE  NA       NA      NA        NA      NA        NA    NA    Male  
##  2 NA    NA       NA      NA        NA      NA        NA    NA    Male  
##  3 NA    NA       FALSE   NA        NA      NA        NA    NA    Male  
##  4 NA    NA       NA      NA        NA      NA        NA    NA    Female
##  5 NA    NA       NA      NA        NA      NA        NA    NA    Female
##  6 NA    NA       FALSE   NA        NA      NA        NA    NA    Female
##  7 NA    NA       NA      NA        NA      NA        FALSE NA    Female
##  8 NA    NA       NA      NA        NA      NA        TRUE  NA    Male  
##  9 NA    NA       NA      NA        NA      NA        NA    NA    Male  
## 10 NA    NA       NA      NA        NA      NA        NA    NA    Male  
## # ... with 1,316 more rows, and 11 more variables: hisp_or_latino <fct>,
## #   race <fct>, first_gen <fct>, ACT_super <int>, ACT_single <int>,
## #   SAT_super <int>, SAT_single <int>, GPA <dbl>, num_AP <fct>,
## #   num_IB <fct>, num_DE <fct>

Now to clean up the data:

Considering only students who applied to at least one Ivy league University
Filling in missing superscore data from single best scores
SAT \(\rightarrow\) ACT, considering higher if applicant has both, using official concordance table
Converting GPAs to /4.0 scale (/100 scale used in NY) using CollegeBoard guidelines
Counting college-level courses together (merging AP, IB, and DE) handling edge cases
Converting Yes/No values into boolean
Merging admission results into their logical disjunction

# function for SAT -> ACT
bin_SATs <- function(score) {
  # from official conversion tables
  if (score == 1600) {36}
  else if (1560 <= score) {35}  else if (1520 <= score) {34}  else if (1490 <= score) {33}
  else if (1450 <= score) {32}  else if (1420 <= score) {31}  else if (1390 <= score) {30}
  else if (1350 <= score) {29}  else if (1310 <= score) {28}  else if (1280 <= score) {27}
  else if (1240 <= score) {26}  else if (1200 <= score) {25}  else if (1160 <= score) {24}
  else if (1130 <= score) {23}  else if (1100 <= score) {22}  else if (1060 <= score) {21}
  else if (1020 <= score) {20}  else if (980 <= score) {19}  else if (940 <= score) {18}
  else if (900 <= score) {17}  else if (860 <= score) {16}  else if (810 <= score) {15}
  else if (760 <= score) {14}  else if (720 <= score) {13}  else if (630 <= score) {12}
  else if (560 <= score) {11}
  # official concordance table does not lower than this
  else if (490 <= score) {10}
  else if (400 <= score) {9}
  else {
    # handles missing scores
    # note: 0 is not a valid SAT score; it is just a placeholder
    #warning(sprintf("Bad SAT Score: %d\n", score))
    0
  }
}

conv_GPA <- function(gpa) {
  if (gpa <= 4.0) {
    return(gpa)
  }
  else {
    if (97 <= gpa && gpa <= 100) {return(4.0)} # A+
    else if (93 <= gpa) {return(4.0)} # A
    else if (90 <= gpa) {return(3.7)} # A-
    else { # no applicant on 100pt scale had < 90 gpa
      warning(sprintf("Bad GPA: %d\n", gpa))
      3.5
    }
  }
}

handleAP <- function(how_many) {
  if (is.na(how_many)) {0}
  else if ('' == how_many) {0}
  else if (grepl(pattern = "school does not offer", x = how_many)) {0}
  else if (grepl(pattern = '20[+]', x = how_many)) {22}
  else {as.integer(how_many)}
}

df2 <- df2 %>%
  # consider only if applied to at least one Ivy League Uni
  filter(!(is.na(Brown) & is.na(Columbia) & is.na(Cornell) & is.na(Dartmouth) 
           & is.na(Harvard) & is.na(Princeton) & is.na(UPenn) & is.na(Yale))) %>%
  
  # drop if missing GPA or not on either 4.0 or 100pt scale
  # because they entered their weighted GPA
  filter(!is.na(GPA)) %>%
  filter(GPA <= 4.0 | GPA > 90) %>%# everyone on 100pt scale had >90
  
  # Filling in missing superscore data from single best scores
  mutate(ACT_super = pmax(0, ACT_super, ACT_single, na.rm = TRUE)) %>%
  mutate(SAT_super = pmax(0, SAT_super, SAT_single, na.rm = TRUE)) %>%

  # SAT -> ACT, considering higher if applicant has both
  mutate(ACT = pmax(0, ACT_super, mapply(bin_SATs, SAT_super), na.rm = TRUE)) %>%
  filter(ACT != 0) %>% # note: 0 is not a valid ACT score; the minimum score is 1

  # Converting GPAs to /4.0 scale (/100 scale used in NY)
  mutate(GPA = mapply(conv_GPA, GPA)) %>%

  # handle edge cases for num of AP/IB/DE courses
  mutate(num_AP = mapply(handleAP, num_AP)) %>%
  mutate(num_IB = mapply(handleAP, num_IB)) %>%
  mutate(num_DE = mapply(handleAP, num_DE)) %>%

  # counting college-level courses together
  mutate(num_col_lv = num_AP + num_IB + num_DE) %>%

  # turn yes/no answers into logical vectors
  mutate(hisp_or_latino = (!is.na(hisp_or_latino) & ('Yes' == hisp_or_latino))) %>%
  mutate(first_gen = (!is.na(first_gen) & ('Yes' == first_gen))) %>%

  # merge admission results into logical disjunction
  mutate(adm_ivy = Brown | Columbia | Cornell | Dartmouth | Harvard | Princeton | UPenn | Yale ) %>%
  mutate(adm_ivy = ifelse(is.na(adm_ivy), FALSE, TRUE)) %>%

  select(-c(SAT_single, SAT_super, ACT_single, ACT_super)) %>%
  select(-c(Brown, Columbia, Cornell, Dartmouth, Harvard, Princeton, UPenn, Yale))

as_tibble(df2)

## # A tibble: 725 x 11
##    gender hisp_or_latino race   first_gen   GPA num_AP num_IB num_DE   ACT
##    <fct>  <lgl>          <fct>  <lgl>     <dbl>  <dbl>  <dbl>  <dbl> <dbl>
##  1 Male   FALSE          White  FALSE      4        14      0      0    34
##  2 Male   FALSE          Asian  FALSE      3.76      2      0      0    29
##  3 Female FALSE          Asian  FALSE      3.7       0      0      0    29
##  4 Male   FALSE          Asian  FALSE      4         6      0      0    35
##  5 Male   FALSE          Asian  FALSE      4         4      0      0    35
##  6 Male   FALSE          Black… FALSE      3.85      7      0      0    34
##  7 Male   FALSE          White  FALSE      3.9      22      0      0    36
##  8 Male   FALSE          Asian  FALSE      3.58      3      0      0    35
##  9 Male   FALSE          White  TRUE       3.67     10      0      0    33
## 10 Male   FALSE          Asian  FALSE      3.9      23      0      0    36
## # ... with 715 more rows, and 2 more variables: num_col_lv <dbl>,
## #   adm_ivy <lgl>

Hypothesis Testing

It has often been asserted on r/A2C that ‘after a certain point, your scores don’t matter’. (The claim is usually along the lines of \(\geq 1500SAT/33ACT\) or \(\geq 1530SAT/34ACT\) makes no difference.) But how much truth is there to this statement? Today we will take a look a results just for Ivy League admissions.

Question of the Day: Is there any marginal utility in improving test scores beyond 34 ACT (or equivalent SAT)

Null hypothesis

\(H_0: (\forall n \in \{35, 36\})[P(admit(n)) \leq P(admit(34))]\), that is to say, the chances of being admitted into at least one Ivy League university do not improve for scores above 34.

We will estimate the true probability with this sample probability.

num_34s <- df2 %>%
  filter(ACT == 34) %>%
  count()
num_34s_admit <- df2 %>%
  filter(ACT == 34 & adm_ivy) %>%
  count()
p_hat_34 <- as.double(num_34s_admit / num_34s)

num_35s <- df2 %>%
  filter(ACT == 35) %>%
  count()
num_35s_admit <- df2 %>%
  filter(ACT == 35 & adm_ivy) %>%
  count()
p_hat_35 <- as.double(num_35s_admit / num_35s)

num_36s <- df2 %>%
  filter(ACT == 36) %>%
  count()
num_36s_admit <- df2 %>%
  filter(ACT == 36 & adm_ivy) %>%
  count()
p_hat_36 <- as.double(num_36s_admit / num_36s)

vals <- c(p_hat_34, p_hat_35, p_hat_36)
cbind.data.frame("Score" = c(34, 35, 36), "p_hat" = vals, "percent better than 34" = 100 * (vals - p_hat_34) / p_hat_34)

##   Score     p_hat percent better than 34
## 1    34 0.3691589               0.000000
## 2    35 0.3771930               2.176327
## 3    36 0.4639175              25.668798

It seems like there is a large difference between 36 and 34 (>25%), and even 35 outperforms 34 by ~2%. But are these results statistically signifigant?

Notation: \(p_{34} = P[admit(34)], p_{35} = P[admin(35)], etc\) and the sample-based estimates are \(\hat{p_{35}}, etc\). \(\bar{X}\) is \(p_{34}\).

Using conventional criteria of \(\alpha = 0.05\).

score 35

Null hypothesis: \(p_{35} \leq p_{34}\); we will reject if \(P[\bar{X} > \hat{p_{35}}] < \alpha\)

\(E[\bar{X}] = p_{34}\)

\(Var(\bar{X}) = \frac{p_{34}(1 - p_{34})}{n}\)

n <- num_35s
VarX_bar <- as.double((p_hat_34 * (1 - p_hat_34)) / n)
EX_bar <- p_hat_34

# P(X_bar > p_hat_35)
p <- pnorm(p_hat_35, mean = EX_bar, sd = sqrt(VarX_bar), lower.tail = FALSE)
p

## [1] 0.4007585

\(P[\bar{X} > p_{35}] = 0.4007... > \alpha = 0.05\) So we fail to reject the null hypothesis for scores of 35.

What about 36?

score 36

Null hypothesis: \(p_{36} \leq p_{34}\); we will reject if \(P[\bar{X} > \hat{p_{36}}] < \alpha\)

n <- num_36s
VarX_bar <- as.double((p_hat_34 * (1 - p_hat_34)) / n)
EX_bar <- p_hat_34

# P(X_bar > p_hat_36)
p <- pnorm(p_hat_36, mean = EX_bar, sd = sqrt(VarX_bar), lower.tail = FALSE)
p

## [1] 0.0265616

\(P[\bar{X} > p_{36}] = 0.0265... < \alpha = 0.05\) So we reject the null hypothesis for scores of 36; an applicant with an ACT (or equivalent SAT) score of 36 is signifigantly more likely to get into at least one Ivy League University. However, this does not account for how standardized test scores correlate with other parts of an application, like GPA, course rigor, and extracurriculars. (i.e. a student who scores a 36 or equivalent is likely to have a higher GPA and course rigor as well.)

Let’s take a look at test scores vs GPA

df2 %>%
  ggplot(aes(x = factor(ACT), y = GPA)) + 
  geom_violin() +
  ggtitle("Standardized Test Scores vs GPA") + 
  xlab("ACT (or equivalent SAT score)") + 
  ylab("unweighted HS GPA")

df2 %>%
  ggplot(aes(x = ACT, y = GPA)) + 
  geom_point() + 
  geom_smooth(method = 'lm', formula = y~x) + 
  ggtitle("Standardized Test Scores vs GPA") + 
  xlab("ACT (or equivalent SAT score)") + 
  ylab("unweighted HS GPA")

Clealy there is a positive correlation here.

df2 %>%
  ggplot(aes(x = factor(ACT), y = num_col_lv)) + 
  geom_violin() + 
  ggtitle("Standardized Test Scores vs college-level courses") + 
  xlab("ACT (or equivalent SAT score)") + 
  ylab("# of college-level courses taken (AP+IB+DE)")

df2 %>%
  ggplot(aes(x = ACT, y = num_col_lv)) + 
  geom_point() + 
  geom_smooth(method = 'lm', formula = y~x) + 
  ggtitle("Standardized Test Scores vs college-level courses") + 
  xlab("ACT (or equivalent SAT score)") + 
  ylab("# of college-level courses taken (AP+IB+DE)")

The correlation here is less prominent, so let’s look consider GPA for the following: How do chances of admission scale with both GPA and test scores?

df2 <- df2 %>%
  mutate(rGPA = round((GPA * 2), 1) / 2)
df3 <- df2 %>%
  filter(rGPA >= 3.7 & ACT >= 34) %>%
  group_by(rGPA, ACT) %>%
  mutate(dummy_1 = 1) %>%
  mutate(n = sum(dummy_1)) %>%
  mutate(adm_n = sum(ifelse(adm_ivy, dummy_1, 0))) %>%
  mutate(adm_r = adm_n / n) %>%
  select(c(ACT, rGPA, adm_r)) %>%
  arrange(ACT, rGPA) %>%
  unique()
df3

## # A tibble: 20 x 3
## # Groups:   rGPA, ACT [20]
##      ACT  rGPA adm_r
##    <dbl> <dbl> <dbl>
##  1    34  3.7  0.286
##  2    34  3.75 0.143
##  3    34  3.8  0.35 
##  4    34  3.85 0.368
##  5    34  3.9  0.324
##  6    34  3.95 0.286
##  7    34  4    0.519
##  8    35  3.7  0.375
##  9    35  3.75 0.125
## 10    35  3.8  0.167
## 11    35  3.85 0.25 
## 12    35  3.9  0.4  
## 13    35  3.95 0.212
## 14    35  4    0.509
## 15    36  3.75 0    
## 16    36  3.8  0.333
## 17    36  3.85 0.286
## 18    36  3.9  0.5  
## 19    36  3.95 0.533
## 20    36  4    0.481

df3 %>%
  group_by(ACT) %>%
  ggplot(aes(x = rGPA, y = adm_r, color = factor(ACT))) + 
  geom_point() + 
  geom_line() +
  ggtitle("standardized testing and gpa vs admission rate") + 
  ylab('admit rate') +
  xlab('rounded GPA')

Keep in mind that the sample size here is pretty small, so the dip to 0% at 3.75GPA 36ACT is due to the lack of respondants having a 3.75GPA and managing to score a 36 or 1600.

When accounting for GPA, the data here seems a lot more inconclusive, but the gap 36 has at 3.90 and 3.95 over 34 and 35 is likely signifigant, as I suspect that is where the average GPA is for admits.

df2 %>%
  filter(adm_ivy) %>%
  summarise(mean(GPA))

##   mean(GPA)
## 1  3.932506

Right between 3.90 and 3.95.

For those curious, here’s a visualization showing scores down to 32 and GPAs down to 3.50.

df3 <- df2 %>%
  filter(rGPA >= 3.5 & ACT >= 32) %>%
  group_by(rGPA, ACT) %>%
  mutate(dummy_1 = 1) %>%
  mutate(n = sum(dummy_1)) %>%
  mutate(adm_n = sum(ifelse(adm_ivy, dummy_1, 0))) %>%
  mutate(adm_r = adm_n / n) %>%
  select(c(ACT, rGPA, adm_r)) %>%
  arrange(ACT, rGPA) %>%
  unique()
df3 %>%
  group_by(ACT) %>%
  ggplot(aes(x = rGPA, y = adm_r, color = factor(ACT))) + 
  geom_point() + 
  geom_line() +
  ggtitle("standardized testing and gpa vs admission rate") + 
  ylab('admit rate') +
  xlab('rounded GPA')

Remember that the crazy spikes are because of small dataset size. (no, 36, 3.6 does not give you a 100% chance. no, having 32 is not better than a 36 if you have a 3.95GPA exactly)

Conclusions

Applicants that score 36 or 1600 are signifigantly more likely to get admitted into at least one Ivy League University. However, it is inconclusive whether this is directly caused by scoring higher (as we lack too much data on extracurriculars, etc).

Predictive Analysis

We would like to investigate how SAT/ACT, GPA, and number of AP/IB/DE (Advanced Placement, International Baccalaureate, and Dual Enrollment) courses affect college admissions. Standardized test scores, Grade Point Average, and course rigor are often if not always some of the most important factors in college admissions. (Source: common data sets; you can read more about those here.)

While considering factors like race and gender in these kind of models is usually unethical (or unconstitutional), they will be taken into account here because universities do in their ‘holistic’ admissions process. Again, see the Common Data Sets for more details here.

Data Preprocessing

Before providing data to models, we will one-hot encode gender and race. Well, those that indicated multiple races in the survey will get multiple 1s.

df3 <- df2 %>%
  mutate(Male = ('Male' == gender)) %>%
  mutate(Female = ('Female' == gender)) %>%
  mutate(White = grepl(pattern = 'White', x = race)) %>%
  mutate(Black = grepl(pattern = 'Black or African American', x = race)) %>%
  mutate(Asian = grepl(pattern = 'Asian', x = race)) %>%
  mutate(Native = grepl(pattern = 'American Indian or Alaska Native', x = race)) %>%
  mutate(PIsland = grepl(pattern = 'Pacific Islander', x = race)) %>%
  select(c(adm_ivy, GPA, ACT, num_col_lv, Male, Female, first_gen, hisp_or_latino, White, Black, Asian, Native, PIsland))
write.csv(df3, file = "preprocessed.csv")
as_tibble(df3)

## # A tibble: 725 x 13
##    adm_ivy   GPA   ACT num_col_lv Male  Female first_gen hisp_or_latino
##    <lgl>   <dbl> <dbl>      <dbl> <lgl> <lgl>  <lgl>     <lgl>         
##  1 TRUE     4       34         14 TRUE  FALSE  FALSE     FALSE         
##  2 FALSE    3.76    29          2 TRUE  FALSE  FALSE     FALSE         
##  3 FALSE    3.7     29          0 FALSE TRUE   FALSE     FALSE         
##  4 TRUE     4       35          6 TRUE  FALSE  FALSE     FALSE         
##  5 TRUE     4       35          4 TRUE  FALSE  FALSE     FALSE         
##  6 TRUE     3.85    34          7 TRUE  FALSE  FALSE     FALSE         
##  7 FALSE    3.9     36         22 TRUE  FALSE  FALSE     FALSE         
##  8 TRUE     3.58    35          3 TRUE  FALSE  FALSE     FALSE         
##  9 TRUE     3.67    33         10 TRUE  FALSE  TRUE      FALSE         
## 10 FALSE    3.9     36         23 TRUE  FALSE  FALSE     FALSE         
## # ... with 715 more rows, and 5 more variables: White <lgl>, Black <lgl>,
## #   Asian <lgl>, Native <lgl>, PIsland <lgl>

Let’s try to predict admissions using deep learning. I will be using python for this part.

import sys
print(sys.version)
# supresses the messages printed to stderr about tensorflow
#sys.stderr = open('/dev/null', 'w')

## 3.6.5 (default, Apr  1 2018, 05:46:30) 
## [GCC 7.3.0]

import pandas as pd
import numpy as np
from keras import Sequential

## /home/george/.local/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
##   from ._conv import register_converters as _register_converters
## Using TensorFlow backend.

from keras.layers import Dense, Dropout
np.random.seed(320)
df = pd.read_csv('preprocessed.csv')
# turn booleans into 1s and 0s
for col in df.columns:
    if type(df[col].values[0]) is np.bool_:
        df[col] = df[col].apply(int)
# split up training and testing data
train = df.sample(frac=0.9)
test = df.drop(train.index)
y_train, y_test = train['adm_ivy'].values, test['adm_ivy'].values
del train['adm_ivy']
del test['adm_ivy']
x_train, x_test = train.values, test.values
# build the model
model = Sequential()
model.add(Dense(32, input_dim=13, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dropout(rate=0.5))
model.add(Dense(8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# train
model.fit(x_train, y_train, epochs=30, batch_size=10, verbose = False)
# evaluate
scores = model.evaluate(x_test, y_test)

## 
## 32/73 [============>.................] - ETA: 0s
## 73/73 [==============================] - 0s 410us/step

print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))

## 
## acc: 68.49%

Wow! 68% accuracy. That’s impressive for a task as complex as college admissions, especially with such a limited feature set. A little too impressive… What’s going on here?

n_admit <- df3 %>% filter(adm_ivy) %>% count()
n <- df3 %>% count()
1 - (n_admit / n)

##           n
## 1 0.6565517

Suspiciously close to the percentage of applicants not admitted – the model just learned to guess “Not admitted” every time. Variance from (1 - admit_rate) are due to random sampling of training and testing data. Let’s make sure that the training data is balanced.

import sys
print(sys.version)
# supresses the messages printed to stderr about tensorflow
#sys.stderr = open('/dev/null', 'w')

## 3.6.5 (default, Apr  1 2018, 05:46:30) 
## [GCC 7.3.0]

import pandas as pd
import numpy as np
from keras import Sequential
from keras.layers import Dense, Dropout
df = pd.read_csv('preprocessed.csv')
# turn booleans into 1s and 0s
for col in df.columns:
    if type(df[col].values[0]) is np.bool_:
        df[col] = df[col].apply(int)
# puts training data at 50/50 admit/not admit
def normalize(train):
    yes_admit = train.query('adm_ivy == 1')
    n_admit = yes_admit.shape[0]
    no_admit = train.query('adm_ivy == 0')
    no_admit = no_admit.sample(n = n_admit)
    frames = [yes_admit, no_admit]
    result = pd.concat(frames)
    result = result.sample(frac=1).reset_index(drop=True)
    return result
# split up training and testing data
train = df.sample(frac=0.85)
test = df.drop(train.index)
train = normalize(train)
y_train, y_test = train['adm_ivy'].values, test['adm_ivy'].values
del train['adm_ivy']
del test['adm_ivy']
x_train, x_test = train.values, test.values
# build the model
model = Sequential()
model.add(Dense(32, input_dim=13, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dropout(rate=0.5))
model.add(Dense(8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
# train
model.fit(x_train, y_train, epochs=15, batch_size=12, verbose = False)
# evaluate
scores = model.evaluate(x_test, y_test)

## 
##  32/109 [=======>......................] - ETA: 0s
## 109/109 [==============================] - 0s 377us/step

print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1] * 100))

## 
## acc: 77.06%

n = sum(list(y_test)) / len(y_test)
print(f"{n} | {1-n}")

## 0.21100917431192662 | 0.7889908256880733

Even when the training set is set to 50/50, it’s still guessing one result the vast majority of the time. Predicting admissions results is a very challenging task, especially so from a restricted feature set and small dataset.

CMSC320 final project

George T