Load the required R packages (additional ones will be loaded as needed)

Exploratory data analysis

Load and explore logged events

Transform time stamp data into the format suitable for processing date and time data

We can now order the events, for each user, based on the time stamp

Note: we will be using R’s pipe notation (|>) to make the code easier to understand and follow

Let’s start by examining the time range the data is available for. It should (roughly) coincide with the start and the end of the course

# start of the course
# end of the course

The course length (in weeks):

Since we want to make predictions based on the first couple of weeks data, we need to add the week variable

Check the distribution of event counts across the course weeks

Also in proportions

Examine character variables that represent different types of actions and logged events

Let’s examine the actions closer

Some of these actions refer to individual course topics, that is, to the access to lecture materials on distinct course topics. These are: General, Applications, Theory, Ethics, Feedback, La_types. We will rename the actions to make the meaning clearer

# course_topics <- c("General", "Applications", "Theory",  "Ethics", "Feedback", "La_types")

Examine also the log column

Load and examine grades data

Examine the summary statistics and distribution of the final grade

Let’s add course_outcome as a binary variable indicating if a student had a low grade. Students whose final grade is above 50th percentile (median) will be considered as having good course outcome (HIGH), the rest will be considered as having weak course outcome (LOW)

Examine the distribution of the outcome variable (though we should already know it)

Features

Two groups of action-based features will be computed and used for prediction: (note: active days are days with at least one learning action)

Since the idea is to create prediction models based on different number of weeks data, we will also need to compute feature values for different number of course weeks. Thus, we will create functions that compute features based on the data for the given number of course weeks (the input parameter).

To compute features based on counts per day, we need to add the date variable

  1. Start with the total number of each type of learning actions

Note: to avoid having too many features (as action counts), we will consider all actions related to access to the lecture materials on different topics as one kind of action (‘Lecture’)

actions_tot_count <- function(events_data) {
  
}

Check the function with the data from the first two weeks of the course

  1. Next, compute average number of actions (of any type) per day
avg_actions_per_day = function(events_data) {
  
  
}

Check the function with the data from the first two weeks of the course

  1. Entropy of daily action counts

Entropy is a measure of disorder in a system. Here it is used as an indicator of regularity of learning: lower the entropy, higher is the regularity and vice versa. Note: A nice explanation of the intuition behind the formula of Shannon entropy is given in this video.

Since we want to compute entropy of daily action counts, we need to compute (approximate) the probability of action counts for each day. We will do that by taking the proportion of daily action counts with respect to the total action counts for the given student

entropy_of_action_counts = function(events_data) {
  
  
}

Check the function with the data from the first two weeks of the course

  1. Number of active days (= days with at least one learning action)
active_days_count = function(events_data) {
 
  
}

Check the function with the data from the first two weeks of the course

  1. Average time distance between two consecutive active days

Note: for student with only 1 active day, avg_aday_dist will be NA. To avoid losing students due to the missing value of this feature, we will replace NAs with a large number (e.g., 2 x max distance), thus indicating that a student rarely (if ever) got back to the course activities

avg_dist_active_days = function(events_data) {
  
  
}

Check the function with the data from the first two weeks of the course

Create feature set for 2 weeks of data and examine feature relevance

Create a function that will allow for creating a feature set for any (given) number of course weeks

create_feature_set = function(events_data) {
  
  
}

Create the feature set based on the first two weeks of data

Examine the feature set

Add the outcome variable

Examine the relevance of features for the prediction of the outcome variable

Let’s first see how we can do it for one variable

Now, do for all at once

Note: the notation .data[[f]] in the code below allow us to access column from the ‘current’ data frame (in this case, w2_data) with the name given as the input variable of the function (f)

Predictive modeling

Load additional R packages required for model building and evaluation

We will use decision tree (as implemented in the rpart package) as the classification method, and will build a couple of decision tree (DT) models, one for each of the first five weeks of the course. We will build each model using the optimal value of the cp hyper-parameter, identified through 10-fold cross-validation (as we did before).

We will evaluate the models using the same metrics used before: accuracy, precision, recall, F1

build_DT_model <- function(train_data) {
  
  cp_grid <- expand.grid(.cp = seq(0.001, 0.1, 0.005))
  
  ctrl <- trainControl(method = "CV", 
                       number = 10,
                       classProbs = TRUE,
                       summaryFunction = twoClassSummary)
  
  dt <- train(x = train_data |> select(-course_outcome),
              y = train_data$course_outcome,
              method = "rpart",
              metric = "ROC",
              tuneGrid = cp_grid,
              trControl = ctrl)
  
  dt$finalModel
}
get_evaluation_measures <- function(model, test_data) {
  
  predicted_vals <- predict(model, 
                            test_data |> select(-course_outcome),
                            type = 'class')
  actual_vals <- test_data$course_outcome
  
  cm <- table(actual_vals, predicted_vals)
  
  # low achievement in the course is considered the positive class
  TP <- cm[2,2]
  TN <- cm[1,1]
  FP <- cm[1,2]
  FN <- cm[2,1]

  accuracy = sum(diag(cm)) / sum(cm)
  precision <- TP / (TP + FP)
  recall <- TP / (TP + FN)
  F1 <- (2 * precision * recall) / (precision + recall)
  
  c(Accuracy = accuracy, 
    Precision = precision, 
    Recall = recall, 
    F1 = F1)
  
}

Create (classification) models for predicting course outcome, based on progresively more weeks of events data

Starting from week 1, up to week 5, create predictive models and examine their performance

models <- list()
eval_measures <- list()

for(k in 1:5) {
  
  print(paste("Starting computations for week", k))
  
  # create the dataset (features + outcome variable) for the given number of weeks (k) 
  
  
  # split the data into train and test sets
  

  # build the model (through CV) and compute eval.measures
  
  
  # add the model and its evaluation measures to the corresponding lists 
  
}

Compare the models based on the evaluation measures

# transform the eval_measures list into a df


# embellish the evaluation report by: 
# 1) adding the week column; 
# 2) rounding the metric values to 4 digits; 
# 3) rearranging the order of columns 

Examine the importance of features in an early in the course model with good performance

---
title: "Predictive modelling: predicting course outcomes in a blended postgraduate course"
output: html_notebook
---

Load the required R packages (additional ones will be loaded as needed)
```{r message=FALSE}


```

## Exploratory data analysis

### Load and explore logged events
```{r}

```

```{r}

```

Transform time stamp data into the format suitable for processing date and time data
```{r}

```

```{r}

```

We can now order the events, for each user, based on the time stamp 

Note: we will be using R's pipe notation (|>) to make the code easier to understand and follow 
```{r}

```

Let's start by examining the time range the data is available for. 
It should (roughly) coincide with the start and the end of the course
```{r}
# start of the course

```

```{r}
# end of the course


```

The course length (in weeks):
```{r}

```

Since we want to make predictions based on the first couple of weeks data, we need to add the week variable 
```{r}


```

Check the distribution of event counts across the course weeks
```{r}

```

Also in proportions
```{r}

```

Examine character variables that represent different types of actions and logged events
```{r}


```

Let's examine the actions closer
```{r}

```
Some of these actions refer to individual course topics, that is, to the access to lecture materials on distinct course topics. These are:
General, Applications, Theory,  Ethics, Feedback, La_types. 
We will rename the actions to make the meaning clearer
```{r}
# course_topics <- c("General", "Applications", "Theory",  "Ethics", "Feedback", "La_types")


```

```{r}

```

Examine also the log column
```{r}

```



### Load and examine grades data
```{r}

```

```{r}

```

Examine the summary statistics and distribution of the final grade
```{r}

```

```{r}

```


Let's add *course_outcome* as a binary variable indicating if a student had a low grade. 
Students whose final grade is above 50th percentile (median) will be considered as having good course outcome (HIGH), the rest will be considered as having weak course outcome (LOW)
```{r}

```

Examine the distribution of the outcome variable (though we should already know it)
```{r}

```


## Features

Two groups of action-based features will be computed and used for prediction:
(note: active days are days with at least one learning action)

* Features based on learning action counts:
** Total number of each type of learning actions 
** Average number of actions (of any type) per day
** Entropy of daily action counts (considering active days only)

* Features based on number of active days
** Number of active days
** Average time distance between two consecutive active days

Since the idea is to create prediction models based on different number of weeks data, we will also need to compute feature values for different number of course weeks. Thus, we will create functions that compute features based on the data for the given number of course weeks (the input parameter). 

To compute features based on counts per day, we need to add the date variable
```{r}

```

(1) Start with the total number of each type of learning actions 

Note: to avoid having too many features (as action counts), we will consider all actions related to access to the lecture materials on different topics as one kind of action ('Lecture')
```{r}
actions_tot_count <- function(events_data) {
  
}
```

Check the function with the data from the first two weeks of the course
```{r}

```

(2) Next, compute average number of actions (of any type) per day

```{r}
avg_actions_per_day = function(events_data) {
  
  
}
```

Check the function with the data from the first two weeks of the course
```{r}

```

(3) Entropy of daily action counts

Entropy is a measure of disorder in a system. Here it is used as an indicator of regularity of learning: lower the entropy, higher is the regularity and vice versa. 
Note: A nice explanation of the intuition behind the formula of Shannon entropy is given in [this video](https://www.youtube.com/watch?v=0GCGaw0QOhA).

Since we want to compute entropy of daily action counts, we need to compute (approximate) the probability of action counts for each day. We will do that by taking the proportion of daily action counts with respect to the total action counts for the given student
```{r}
entropy_of_action_counts = function(events_data) {
  
  
}
```

Check the function with the data from the first two weeks of the course
```{r}

```

(4) Number of active days (= days with at least one learning action)

```{r}
active_days_count = function(events_data) {
 
  
}
```

Check the function with the data from the first two weeks of the course
```{r}

```

(5) Average time distance between two consecutive active days

Note: for student with only 1 active day, avg_aday_dist will be NA. To avoid losing students due to the missing value of this feature, we will replace NAs with a large number (e.g., 2 x max distance), thus indicating that a student rarely (if ever) got back to the course activities
```{r}
avg_dist_active_days = function(events_data) {
  
  
}
```

Check the function with the data from the first two weeks of the course
```{r}

```

### Create feature set for 2 weeks of data and examine feature relevance

Create a function that will allow for creating a feature set for any (given) number of course weeks 
```{r}
create_feature_set = function(events_data) {
  
  
}
```

Create the feature set based on the first two weeks of data
```{r}

```

Examine the feature set
```{r}

```

Add the outcome variable
```{r}


```

Examine the relevance of features for the prediction of the outcome variable

Let's first see how we can do it for one variable 
```{r}


```

Now, do for all at once

Note: the notation `.data[[f]]` in the code below allow us to access column from the 'current' data frame (in this case, `w2_data`) with the name given as the input variable of the function (`f`) 
```{r}


```




## Predictive modeling

Load additional R packages required for model building and evaluation 
```{r message=FALSE}


```

We will use decision tree (as implemented in the rpart package) as the classification method, and will build a couple of decision tree (DT) models, one for each of the first five weeks of the course. We will build each model using the optimal value of the `cp` hyper-parameter, identified through 10-fold cross-validation (as we did before). 

We will evaluate the models using the same metrics used before: accuracy, precision, recall, F1
```{r}
build_DT_model <- function(train_data) {
  
  cp_grid <- expand.grid(.cp = seq(0.001, 0.1, 0.005))
  
  ctrl <- trainControl(method = "CV", 
                       number = 10,
                       classProbs = TRUE,
                       summaryFunction = twoClassSummary)
  
  dt <- train(x = train_data |> select(-course_outcome),
              y = train_data$course_outcome,
              method = "rpart",
              metric = "ROC",
              tuneGrid = cp_grid,
              trControl = ctrl)
  
  dt$finalModel
}
```


```{r}
get_evaluation_measures <- function(model, test_data) {
  
  predicted_vals <- predict(model, 
                            test_data |> select(-course_outcome),
                            type = 'class')
  actual_vals <- test_data$course_outcome
  
  cm <- table(actual_vals, predicted_vals)
  
  # low achievement in the course is considered the positive class
  TP <- cm[2,2]
  TN <- cm[1,1]
  FP <- cm[1,2]
  FN <- cm[2,1]

  accuracy = sum(diag(cm)) / sum(cm)
  precision <- TP / (TP + FP)
  recall <- TP / (TP + FN)
  F1 <- (2 * precision * recall) / (precision + recall)
  
  c(Accuracy = accuracy, 
    Precision = precision, 
    Recall = recall, 
    F1 = F1)
  
}
```


### Create (classification) models for predicting course outcome, based on progresively more weeks of events data

Starting from week 1, up to week 5, create predictive models and examine their performance
```{r warning=FALSE, message=FALSE}
models <- list()
eval_measures <- list()

for(k in 1:5) {
  
  print(paste("Starting computations for week", k))
  
  # create the dataset (features + outcome variable) for the given number of weeks (k) 
  
  
  # split the data into train and test sets
  

  # build the model (through CV) and compute eval.measures
  
  
  # add the model and its evaluation measures to the corresponding lists 
  
}
```

Compare the models based on the evaluation measures
```{r}
# transform the eval_measures list into a df


# embellish the evaluation report by: 
# 1) adding the week column; 
# 2) rounding the metric values to 4 digits; 
# 3) rearranging the order of columns 

```


Examine the importance of features in an early in the course model with good performance
```{r}

```