I was fortunate to get access to a large volume of data of student swipes at my university’s recreational center. This is all swipes for a complete academic year, anonymized of course.

Screen Shot 2020-03-16 at 12.15.16 PM.png

Screen Shot 2020-03-16 at 12.16.53 PM.png

Exploratory data analysis and data cleaning

Pandas Profiling has been a game-changer since I found it. Excellent to get a good first dive into the data.

Screen Shot 2020-03-16 at 12.18.37 PM.png

Screen Shot 2020-03-16 at 12.20.13 PM.png

Screen Shot 2020-03-16 at 12.21.34 PM.png

Yikes, that’s 60 columns. But there are a lot of columns that are highly collinear and some others that are missing a lot of values. We will remove these as they contain no information. We will use the results from our earlier profiling to select the columns to delete. We will remove all columns that have >50% missing values, which is not a totally unreasonable thing to do. It also happens that it is intuitively unlikely for these columns to have any effect on the outcome.

Frequency of use versus student GPA

corec1.png

This is a fantastic early indication that the frequency of use is strongly correlated with GPA.

data['Type of User'].value_counts()

Screen Shot 2020-03-16 at 12.26.37 PM.png

It looks like these classes are imbalanced.

Role of gender

corec2.png

International students

corec3.png

Surprising to me that international students have such a high average GPA compared to domestic students.

Screen Shot 2020-03-16 at 12.31.40 PM.png

Screen Shot 2020-03-16 at 12.31.46 PM.png

It seems that being a regular recreational center user has a more pronounced influence on the GPA for women.

corec4.png

There is definitely some weak correlation between number of swipes and overall GPA.

corec5.png

corec6.png

Graduate students tend to be more active.

Honors students

Screen Shot 2020-03-16 at 1.18.10 PM.png

corec7.png

Let’s try to predict GPA classes

Screen Shot 2020-03-16 at 1.19.20 PM.png

Total Features:  25 categorical + 15 numerical = 40 features

Screen Shot 2020-03-16 at 1.21.43 PM.png

We have to be careful about how we handle the missing values - for instance, NaN for residence hall could just mean that the student did not use on-campus housing. So, NaN in that case could actually provide useful information. From the above breakdown, this is what we will plan to do with the NaN values:

  1. Geocluster: Impute with mode
  2. Citizenship type: Leave as is- i.e., forms a unique value of its own
  3. Academic School Grouping, Program: Impute with mode
  4. Year GPA: Impute with median
  5. Semester Honors: Leave as is
  6. Spring Credits Attempted, Spring Credits Earned, Spring GPA: impute with median All this magic happens below:

Now we are on to encoding the features. The categorical features are encoded using OneHotEncoder, the output is encoded using Label encoder.

Screen Shot 2020-03-16 at 1.23.16 PM.png

Looks great. Let’s train and test then. We will start with a dummy classifier.

Screen Shot 2020-03-16 at 1.26.06 PM.png

Screen Shot 2020-03-16 at 1.26.44 PM.png

Screen Shot 2020-03-16 at 1.27.39 PM.png

Screen Shot 2020-03-16 at 1.28.20 PM.png

We are not done yet. Our results are strongly dependent on the test/train split of our dataset. Based on the input random_state value, our prediction parameters will vary. We can address this by running the model on a bunch of test/train datasets as below:

corec8.png

corec9.png

The above distribution provides us confidence that the predicted accuracy and F1 score are in the same range as we saw previously.

Conclusion

The moral of the story is “Study hard but also make sure you stay physically active”. And also, we can predict your GPA with 72% accuracy. It seems like staying physically active has a non-trivial effect on academic performance.

Following were some of the interesting observations:

  1. In general, student who use the rec center regularly had a higher average GPA.
  2. Female students have a higher average GPA than male counterparts and the effect of being a heavy rec center user are more pronounced for them.
  3. International students tend to have higher average GPA.
  4. Honors students use the rec center more and have a higher than average GPA.
  5. We can predict the GPA class of a student from other characteristics and frequency of use with a reasonable accuracy (72%).