---
title: "Predictive modelling: predicting course outcomes in a blended postgraduate course"
output: html_notebook
---

Load the required R packages (additional ones will be loaded as needed)
```{r message=FALSE}
library(ggplot2)
library(dplyr)
library(tidyr)
```

## Exploratory data analysis

### Load and explore logged events
```{r}
events = read.csv("data/events.csv")
```

```{r}
str(events)
```

Transform time stamp data into the format suitable for processing date and time data
```{r}
events$ts <- as.POSIXct(events$ts)
```

```{r}
events$ts[1:10]
```

We can order the events, for each user, based on the time stamp 

Note: we will be using R's pipe notation (|>) to make the code easier to understand and follow 
```{r}
events |>
  arrange(user, ts) -> events

head(events)
```

Let's start by examining the time range the data is available for. It should (roughly) coincide with the start and the end of the course
```{r}
course_start <- min(events$ts)
print(course_start)
```

```{r}
course_end <- max(events$ts)
print(course_end)
```

The course length (in weeks):
```{r}
difftime(course_end, course_start, units="week")
```

Since we want to make predictions based on the first couple of weeks data, we need to add the week variable 
```{r}
events |>
  mutate(week = strftime(ts, "%V") |> as.integer(),
         current_week = week - min(week) + 1) |>
  select(-week) -> events
```

Check the distribution of action counts across the course weeks
```{r}
table(events$current_week)
```
Also in proportions
```{r}
table(events$current_week) |> prop.table() |> round(digits = 3)
```

Examine character variables that represent different types of actions and logged events
```{r}
apply(events |> select(action, log), 
      2, 
      function(x) length(unique(x)))
```

Let's examine the actions
```{r}
table(events$action)|> prop.table() |> round(digits = 3)
```
Some of these actions refer to individual course topics, that is, to the access to lecture materials on distinct course topics. These are:
General, Applications, Theory,  Ethics, Feedback, La_types. 
We will rename the actions to make the meaning clearer
```{r}
course_topics <- c("General", "Applications", "Theory",  "Ethics", "Feedback", "La_types")

events |>
  mutate(action = ifelse(test = action %in% course_topics, 
                         yes = paste("Lecture", action, sep = "_"), 
                         no = action)) -> events
```

```{r}
unique(events$action)
```

Examine also the log column
```{r}
table(events$log) |> as.data.frame() |> arrange(desc(Freq)) |> View()
```
Too many distinct values, we will leave this variable aside, at least for now

### Load and examine grades data
```{r}
grades <- read.csv("data/grades.csv")
```

```{r}
str(grades)
```
Examine the summary statistics and distribution of the final grade
```{r}
summary(grades$grade)
```

```{r}
ggplot(grades, aes(x = grade)) +
  geom_density() +
  labs(x = "Final grade", 
       title = "Distribution of the final grade") +
  scale_x_continuous(breaks = 1:10) +
  theme_minimal()
```
It is not normally distributed, but skewed towards higher grade values

Let's add *course_outcome* as a binary variable indicating if a student had a good or weak course outcome.
Students whose final grade is above 50th percentile (median) will be considered as having good course outcome (HIGH), the rest will be considered as having weak course outcome (LOW)
```{r}
grades |>
  mutate(course_outcome = ifelse(test=grade > median(grade),
                                 yes = "High", no = "Low")) |>
  mutate(course_outcome=factor(course_outcome)) -> grades
```

```{r}
table(grades$course_outcome)
```
This gives us a perfectly balanced data set for the outcome prediction (classification) task. 


## Features

Two groups of action-based features will be computed and used for prediction:
(note: active days are days with at least one learning action)

* Features based on learning action counts:
** Total number of each type of learning actions (considering active days only)
** Average number of actions (of any type) per day (considering active days only)
** Entropy of daily action counts (considering active days only)

* Features based on number of active days
** Number of active days
** Average time distance between two consecutive active days

Since the idea is to create prediction models based on different number of weeks data, we will also need to compute feature values for different number of course weeks. Thus, we will create functions that compute features based on the data for the given number of course weeks (the input parameter). 

To compute features based on counts per day, we need to add the date variable
```{r}
events$date = as.Date(events$ts)
```

(1) Start with the total number of each type of learning actions 

Note: to avoid having too many features (as action counts), we will consider all actions related to access to the lecture materials on different topics as one kind of action ('Lecture')
```{r}
actions_tot_count <- function(events_data) {
  events_data |>
    mutate(action = ifelse(startsWith(action, "Lecture"), yes = "Lecture", no = action)) |>
    group_by(user, action) |>
    count() |>
    pivot_wider(id_cols = user, 
                names_from = action, 
                names_prefix = "cnt_",
                values_from = n, 
                values_fill = 0)
}
```

Check the function with the data from the first two weeks of the course
```{r}
events_2_weeks <- events %>% filter(current_week <= 2)
actions_tot_count(events_2_weeks)
```

(2) Next, compute average number of actions (of any type) per day

```{r}
avg_actions_per_day = function(events_data) {
  events_data |>
    count(user, date) |>
    group_by(user) |>
    summarise(avg_daily = median(n))
}
```

```{r}
avg_actions_per_day(events_2_weeks)
```

(3) Entropy of daily action counts

Entropy is a measure of disorder in a system. Here it is used as an indicator of regularity of learning: lower the entropy, higher is the regularity and vice versa. 
Note: A nice explanation of the intuition behind the formula of Shannon entropy is given in [this video](https://www.youtube.com/watch?v=0GCGaw0QOhA).

Since we want to compute entropy of daily action counts, we need to compute (approximate) the probability of action counts for each day. We will do that by taking the proportion of daily action counts with respect to the total action counts for the given student
```{r}
entropy_of_action_counts = function(events_data) {
  events_data |>
    group_by(user) |>
    mutate(tot_action_count = n())|>
    ungroup() |>
    group_by(user, date) |>
    summarise(daily_action_count = n(),
              daily_action_prop = daily_action_count/tot_action_count) |>
    ungroup() |>
    distinct() |>
    group_by(user) |>
    summarise(entropy = -sum(daily_action_prop*log(daily_action_prop)))
}
```

```{r}
entropy_of_action_counts(events_2_weeks)
```

(4) Number of active days (= days with at least one learning action)

```{r}
active_days_count = function(events_data) {
  events_data |>
    group_by(user) |>
    summarise(active_days_cnt = n_distinct(date))
}
```

```{r}
active_days_count(events_2_weeks)
```

(5) Average time distance between two consecutive active days

Note: for student with only 1 active day, avg_aday_dist will be NA. To avoid losing students due to the missing value of this feature, we will replace NAs with a large number (e.g., 2 x max distance), thus indicating that a student rarely (if ever) got back to the course activities
```{r}
avg_dist_active_days = function(events_data) {
  events_data |>
    group_by(user) |>
    distinct(date) |>
    arrange(date) |>
    mutate(prev_aday = lag(date)) |>
    mutate(aday_dist = ifelse(is.na(prev_aday),
                              yes = NA,
                              no = difftime(date, prev_aday, units = "days"))) |>
    summarise(avg_aday_dist = median(aday_dist, na.rm = TRUE)) |>
    ungroup() -> df

  max_aday_dist = max(df$avg_aday_dist, na.rm = TRUE)
  df |>
    mutate(avg_aday_dist = ifelse(is.na(avg_aday_dist),
                                  yes = 2*max_aday_dist,
                                  no = avg_aday_dist))
}
```

```{r}
avg_dist_active_days(events_2_weeks) |> summary()
```

### Create feature set for 2 weeks of data and examine feature relevance

Create a function that will allow for creating a feature set for any (given) number of course weeks 
```{r}
create_feature_set = function(events_data) {
  f1 = actions_tot_count(events_data)
  f2 = avg_actions_per_day(events_data)
  f3 = entropy_of_action_counts(events_data)
  f4 = active_days_count(events_data)
  f5 = avg_dist_active_days(events_data)

  f1 |> 
    inner_join(f2, by='user') |>
    inner_join(f3, by='user') |>
    inner_join(f4, by='user') |>
    inner_join(f5, by='user') |>
    as.data.frame()
}
```

Create the feature set based on the first two weeks of data
```{r}
w2_feature_set = create_feature_set(events_2_weeks)

str(w2_feature_set)
```

```{r}
summary(w2_feature_set)
```

Add the outcome variable
```{r}
w2_feature_set |>
  inner_join(grades |> select(user, course_outcome) ) -> w2_data
```

Examine the relevance of features for the prediction of the outcome variable

Let's first see how we can do it for one variable 
```{r}
ggplot(w2_data, aes(x = cnt_Course_view, fill=course_outcome)) +
  geom_density(alpha=0.5) +
  theme_minimal()
```

Now, do for all at once

Note: the notation `.data[[fn]]` in the code below allow us to access column from the 'current' data frame (in this case, `w2_data`) with the name given as the input variable of the function (`fn`) 
```{r}
feature_names <- colnames(w2_feature_set |> select(-user))

lapply(feature_names,
       function(fn) {
         ggplot(w2_data, aes(x = .data[[fn]], fill=course_outcome)) +
           geom_density(alpha=0.5) +
           theme_minimal()
       })
```

The plots suggest that all features are potentially relevant for predicting the course outcome.


## Predictive modeling

Load additional R packages required for model building and evaluation 
```{r message=FALSE}
library(caret)
library(rpart)
```

We will use decision tree (as implemented in the rpart package) as the classification method, and will build a couple of decision tree (DT) models, one for each of the first five weeks of the course. We will build each model using the optimal value of the `cp` hyper-parameter, identified through 10-fold cross-validation (as we did before). 

We will evaluate the models using the same metrics used before: accuracy, precision, recall, F1
```{r}
build_DT_model <- function(train_data) {
  
  cp_grid <- expand.grid(.cp = seq(0.001, 0.1, 0.005))
  
  ctrl <- trainControl(method = "CV", 
                       number = 10,
                       classProbs = TRUE,
                       summaryFunction = twoClassSummary)
  
  dt <- train(x = train_data |> select(-course_outcome),
              y = train_data$course_outcome,
              method = "rpart",
              metric = "ROC",
              tuneGrid = cp_grid,
              trControl = ctrl)
  
  dt$finalModel
}
```


```{r}
get_evaluation_measures <- function(model, test_data) {
  
  predicted_vals <- predict(model, 
                            test_data |> select(-course_outcome),
                            type = 'class')
  actual_vals <- test_data$course_outcome
  
  cm <- table(actual_vals, predicted_vals)
  
  # low achievement in the course is considered the positive class
  TP <- cm[2,2]
  TN <- cm[1,1]
  FP <- cm[1,2]
  FN <- cm[2,1]

  accuracy = sum(diag(cm)) / sum(cm)
  precision <- TP / (TP + FP)
  recall <- TP / (TP + FN)
  F1 <- (2 * precision * recall) / (precision + recall)
  
  c(Accuracy = accuracy, 
    Precision = precision, 
    Recall = recall, 
    F1 = F1)
  
}
```


### Create (classification) models for predicting course outcome, based on progresively more weeks of events data

Starting from week 1, up to week 5, create predictive models and examine their performance
```{r warning=FALSE, message=FALSE}
models <- list()
eval_measures <- list()

for(k in 1:5) {
  
  print(paste("Starting computations for week", k))
  
  week_k_events <- events |> filter(current_week <= k)
  ds <- create_feature_set(week_k_events)
  ds <- inner_join(ds, grades)
  
  set.seed(2023)
  train_indices <- createDataPartition(ds$course_outcome, 
                                       p = 0.8, list = FALSE)
  train_ds <- ds[train_indices,] |> select(-c(user, grade))
  test_ds <- ds[-train_indices,] |> select(-c(user, grade))

  dt <- build_DT_model(train_ds)
  eval_dt <- get_evaluation_measures(dt, test_ds)
  
  models[[k]] <- dt
  eval_measures[[k]] <- eval_dt
}
```

Compare the models based on the evaluation measures
```{r}
# transform the eval_measures list into a df
eval_df <- bind_rows(eval_measures)

# embellish the evaluation report by: 
# 1) adding the week column; 
# 2) rounding the metric values to 4 digits; 
# 3) rearanging the order of columns 
eval_df |>
  mutate(week = 1:5) |>
  mutate(across(Accuracy:F1, round, digits=4)) |>
  select(week, Accuracy:F1)
```
The results suggest that already early in this course, it is possible to fairly well predict students at risk of low performance.

Examine the importance of features in an early in the course model with good performance
```{r}
models[[2]]$variable.importance |> 
  as.data.frame() |>
  rename(importance = `models[[2]]$variable.importance`)
```
Regularity of daily action counts and the frequency of accessing the information on the course homepage (course info, syllabus, news, ...) are the most important features. Next come the engagement in group work as well as the level of the overall engagement in the course (number of active days and average number of actions per (active) day)