---
title: "Predictive modelling: predicting course outcomes in a blended postgraduate course"
output: html_notebook
---

Load the required R packages (additional ones will be loaded as needed)
```{r message=FALSE}
library(ggplot2)
library(dplyr)
library(tidyr)
```

## Exploratory data analysis

### Load and explore logged events
```{r}
events <- read.csv("data/events.csv")
```

```{r}
str(events)
```

Transform time stamp data into the format suitable for processing date and time data
```{r}
events$ts <- as.POSIXct(events$ts)
```

```{r}
events$ts[1:10]
```

```{r}
class(events$ts)
```


We can now order the events, for each user, based on the time stamp 

Note: we will be using R's pipe notation (|>) to make the code easier to understand and follow 
```{r}
events |> arrange(user, ts) -> events
```

```{r}
events |> head(20) 
```

Let's start by examining the time range the data is available for. 
It should (roughly) coincide with the start and the end of the course
```{r}
# start of the course
start_ts <- min(events$ts) 
start_ts
```

```{r}
# end of the course
end_ts <- max(events$ts)
end_ts
```

The course length (in weeks):
```{r}
difftime(end_ts, start_ts, units="weeks")
```

Since we want to make predictions based on the first couple of weeks data, we need to add the week variable 
```{r}
# ?strftime
# strftime(events$ts[1:10], format = "%V") |> as.integer()

events |>
  mutate(week = strftime(ts, format = "%V") |> as.integer()) |>
  mutate(course_week = week - min(week) + 1) |> 
  select(-week) -> events 

```

Check the distribution of event counts across the course weeks
```{r}
table(events$course_week)
```

Also in proportions
```{r}
table(events$course_week) |> prop.table() |> round(digits = 4)
```

Examine character variables that represent different types of actions and logged events
```{r}
#unique(events$action)
unique(events$log)
```

Let's examine the actions closer
```{r}
table(events$action) |> prop.table() |> round(4)
```
Some of these actions refer to individual course topics, that is, to the access to lecture materials on distinct course topics. These are:
General, Applications, Theory,  Ethics, Feedback, La_types. 
We will rename the actions to make the meaning clearer
```{r}
course_topics <- c("General", "Applications", "Theory",  "Ethics", "Feedback", "La_types")

events |>
  mutate(action = ifelse(action %in% course_topics,
                         paste0("Lecture_", action),
                         action)) -> events
```

```{r}
table(events$action) |> prop.table() |> round(4)
```

Examine also the log column
```{r}

```



### Load and examine grades data
```{r}
grades <- read.csv("data/grades.csv")
```

```{r}
str(grades)
```

Examine the summary statistics and distribution of the final grade
```{r}
summary(grades$grade)
```

```{r}
ggplot(grades, aes(x=grade)) +
  geom_density() +
  scale_x_continuous(breaks = 1:10) +
  theme_minimal()
```

Let's add *course_outcome* as a binary variable indicating if a student had a low grade. 
Students whose final grade is above 50th percentile (median) will be considered as having good course outcome (HIGH), the rest will be considered as having weak course outcome (LOW)
```{r}
#sum(is.na(grades$grade))

grades |>
  mutate(course_outcome = ifelse(grade > median(grade), "HIGH", "LOW"),
         course_outcome = as.factor(course_outcome)) -> grades
```

Examine the distribution of the outcome variable (though we should already know it)
```{r}
table(grades$course_outcome)
```


## Features

Two groups of action-based features will be computed and used for prediction:

* Features based on learning action counts:
** Total number of each type of learning actions 
** Average number of actions (of any type) per day (considering active days only)
** Entropy of daily action counts (considering active days only)

* Features based on number of active days (= days with at least one learning action)
** Number of active days
** Average time distance between two consecutive active days

Since the idea is to create prediction models based on different number of weeks data, we will also need to compute feature values for different number of course weeks. Thus, we will create functions that compute features based on the data for the given number of course weeks (the input parameter). 

To compute features based on counts per day, we need to add the date variable
```{r}
events$date <- as.Date(events$ts)
head(events, 20)
```

(1) Start with the total number of each type of learning actions 

Note: to avoid having too many features (as action counts), we will consider all actions related to access to the lecture materials on different topics as one kind of action ('Lecture')
```{r}
actions_tot_count <- function(events_data) {
  events_data |>
    mutate(action = ifelse(startsWith(action, "Lecture"), "Lecture", action)) |>
    group_by(user, action) |>
    count() |>
    pivot_wider(id_cols = user,
                names_from = action,
                names_prefix = "Cnt_",
                values_from = n,
                values_fill = 0)
}
```

Check the function with the data from the first two weeks of the course
```{r}
w2_data <- events |> filter(course_week <= 2)
actions_tot_count(w2_data)
```

(2) Next, compute average number of actions (of any type) per day

```{r}
avg_actions_per_day = function(events_data) {
  
  events_data |>
    count(user, date) |>
    group_by(user) |>
    summarise(avg_action_cnt = median(n))
  
}
```

Check the function with the data from the first two weeks of the course
```{r}
avg_actions_per_day(w2_data)
```

(3) Entropy of daily action counts

Entropy is a measure of disorder in a system. Here it is used as an indicator of regularity of learning: lower the entropy, higher is the regularity and vice versa. 
Note: A nice explanation of the intuition behind the formula of Shannon entropy is given in [this video](https://www.youtube.com/watch?v=0GCGaw0QOhA).

Since we want to compute entropy of daily action counts, we need to compute (approximate) the probability of action counts for each day. We will do that by taking the proportion of daily action counts with respect to the total action counts for the given student
```{r}
entropy_of_action_counts = function(events_data) {
  
  events_data |>
    group_by(user) |>
    mutate(tot_action_cnt = n())|>
    group_by(user, date) |>
    summarise(daily_action_cnt = n(),
              daily_action_prop = daily_action_cnt/tot_action_cnt) |>
    distinct() |>
    group_by(user) |>
    summarise(action_cnt_entropy = -sum(daily_action_prop * log(daily_action_prop)))
    
  
}
```

Check the function with the data from the first two weeks of the course
```{r}
entropy_of_action_counts(w2_data)
```

(4) Number of active days (= days with at least one learning action)

```{r}
active_days_count = function(events_data) {
 
  events_data |>
    group_by(user) |>
    summarise(aday_cnt = n_distinct(date))
  
}
```

Check the function with the data from the first two weeks of the course
```{r}
active_days_count(w2_data)
```

(5) Average time distance between two consecutive active days

Note: for student with only 1 active day, avg_aday_dist will be NA. To avoid losing students due to the missing value of this feature, we will replace NAs with a large number (e.g., 2 x max distance), thus indicating that a student rarely (if ever) got back to the course activities
```{r}
avg_dist_active_days = function(events_data) {
  
  events_data |>
    distinct(user, date) |>
    arrange(user, date) |>
    group_by(user) |>
    mutate(prev_aday = lag(date)) |>
    mutate(aday_dist = ifelse(is.na(prev_aday), 
                              NA, 
                              difftime(date, prev_aday, units = "days"))) |>
    summarise(avg_aday_dist = median(aday_dist, na.rm = TRUE)) ->df
  
  max_dist = max(df$avg_aday_dist,na.rm=T)
  df |>
    mutate(avg_aday_dist = ifelse(is.na(avg_aday_dist),
                                  2*max_dist,
                                  avg_aday_dist))
  
}
```

Check the function with the data from the first two weeks of the course
```{r}
avg_dist_active_days(w2_data) |> summary()
```

### Create feature set for 2 weeks of data and examine feature relevance

Create a function that will allow for creating a feature set for any (given) number of course weeks 
```{r}
create_feature_set = function(events_data) {
  
  f1 = actions_tot_count(events_data)
  f2 = avg_actions_per_day(events_data)
  f3 = entropy_of_action_counts(events_data)
  f4 = active_days_count(events_data)
  f5 = avg_dist_active_days(events_data)
  
  f1 |> 
    inner_join(f2, by="user") |>
    inner_join(f3) |>
    inner_join(f4) |>
    inner_join(f5) |>
    as.data.frame()
  
}
```

Create the feature set based on the first two weeks of data
```{r}

w2_features <- create_feature_set(w2_data)

```

Examine the feature set
```{r}
str(w2_features)
```

Add the outcome variable
```{r}
w2_features |>
  inner_join(grades) |>
  select(-grade) -> w2_data

str(w2_data)
```

Examine the relevance of features for the prediction of the outcome variable

Let's first see how we can do it for one variable 
```{r}
ggplot(w2_data, aes(x=Cnt_Group_work, fill=course_outcome)) +
  geom_density(alpha=0.5) +
  theme_minimal()

```

Now, do for all at once

Note: the notation `.data[[f]]` in the code below allow us to access column from the 'current' data frame (in this case, `w2_data`) with the name given as the input variable of the function (`f`) 
```{r}
feature_names <- colnames(w2_features |> select(-user))

lapply(feature_names,
      function(fn) {
        ggplot(w2_data, 
               aes(x=.data[[fn]], fill=course_outcome)) +
          geom_density(alpha=0.5) +
          theme_minimal()
      })

```




## Predictive modeling

Load additional R packages required for model building and evaluation 
```{r message=FALSE}
library(caret)
library(rpart)
library(rpart.plot)

```

We will use decision tree (as implemented in the rpart package) as the classification method, and will build a couple of decision tree (DT) models, one for each of the first five weeks of the course. We will build each model using the optimal value of the `cp` hyper-parameter, identified through 10-fold cross-validation (as we did before). 

We will evaluate the models using the same metrics used before: accuracy, precision, recall, F1
```{r}
build_DT_model <- function(dataset) {
  
  cp_grid <- expand.grid(.cp = seq(0.001, 0.1, 0.005))
  
  ctrl <- trainControl(method = "CV", 
                       number = 10,
                       classProbs = TRUE,
                       summaryFunction = twoClassSummary)
  
  dt <- train(x = dataset |> select(-course_outcome),
              y = dataset$course_outcome,
              method = "rpart",
              metric = "ROC",
              tuneGrid = cp_grid,
              trControl = ctrl)
  
  dt$finalModel
}
```


```{r}
get_evaluation_measures <- function(model, test_data) {
  
  predicted_vals <- predict(model, 
                            test_data |> select(-course_outcome),
                            type = 'class')
  actual_vals <- test_data$course_outcome
  
  cm <- table(actual_vals, predicted_vals)
  
  # low achievement in the course is considered the positive class
  TP <- cm[2,2]
  TN <- cm[1,1]
  FP <- cm[1,2]
  FN <- cm[2,1]

  accuracy = sum(diag(cm)) / sum(cm)
  precision <- TP / (TP + FP)
  recall <- TP / (TP + FN)
  F1 <- (2 * precision * recall) / (precision + recall)
  
  c(Accuracy = accuracy, 
    Precision = precision, 
    Recall = recall, 
    F1 = F1)
  
}
```


### Create (classification) models for predicting course outcome, based on progresively more weeks of events data

Starting from week 1, up to week 5, create predictive models and examine their performance
```{r warning=FALSE, message=FALSE}
models <- list()
eval_measures <- list()

for(k in 1:5) {
  
  print(paste("Starting computations for week", k))
  
  # create the dataset (features + outcome variable) for the given number of weeks (k) 
  wK_data <- events |> filter(course_week <= k)
  wK_features <- create_feature_set(wK_data)
  wK_features |>
    inner_join(grades) |> select(-grade) -> wK_ds 
  
  # split the data into train and test sets
  set.seed(2023)
  train_indices <- createDataPartition(wK_ds$course_outcome, p=0.8, list = FALSE)
  train_ds <- wK_ds[train_indices, ]|> select(-user)
  test_ds <- wK_ds[-train_indices, ] |> select(-user)

  # build the model (through CV) and compute eval.measures
  wK_dt <- build_DT_model(train_ds)
  wK_eval <- get_evaluation_measures(wK_dt, test_ds)
  
  # add the model and its evaluation measures to the corresponding lists 
  models[[k]] <- wK_dt
  eval_measures[[k]] <- wK_eval
  
}
```

Compare the models based on the evaluation measures
```{r}
# transform the eval_measures list into a df
eval_measures_df <- bind_rows(eval_measures)

# embellish the evaluation report by: 
# 1) adding the week column; 
# 2) rounding the metric values to 4 digits; 
# 3) rearranging the order of columns 
eval_measures_df |>
  mutate(across(.fns = round, digits =4)) |>
  mutate(Week = 1:5) |>
  select(Week, Accuracy:F1)
  
```

Examine the importance of features in an early in the course model with good performance
```{r}
chosen_dt <- models[[2]]
rpart.plot(chosen_dt)
```

```{r}
chosen_dt$variable.importance |> 
  as.data.frame() |>
  rename(VarImp = `chosen_dt$variable.importance`)
```

