Load the required R packages (additional ones will be loaded as
needed)
Exploratory data analysis
Load and explore logged events
Transform time stamp data into the format suitable for processing
date and time data
We can now order the events, for each user, based on the time
stamp
Note: we will be using R’s pipe notation (|>) to make the code
easier to understand and follow
Let’s start by examining the time range the data is available for. It
should (roughly) coincide with the start and the end of the course
# start of the course
# end of the course
The course length (in weeks):
Since we want to make predictions based on the first couple of weeks
data, we need to add the week variable
Check the distribution of event counts across the course weeks
Also in proportions
Examine character variables that represent different types of actions
and logged events
Let’s examine the actions closer
Some of these actions refer to individual course topics, that is, to
the access to lecture materials on distinct course topics. These are:
General, Applications, Theory, Ethics, Feedback, La_types. We will
rename the actions to make the meaning clearer
# course_topics <- c("General", "Applications", "Theory", "Ethics", "Feedback", "La_types")
Examine also the log column
Load and examine grades data
Examine the summary statistics and distribution of the final
grade
Let’s add course_outcome as a binary variable indicating if
a student had a low grade. Students whose final grade is above 50th
percentile (median) will be considered as having good course outcome
(HIGH), the rest will be considered as having weak course outcome
(LOW)
Examine the distribution of the outcome variable (though we should
already know it)
Features
Two groups of action-based features will be computed and used for
prediction: (note: active days are days with at least one learning
action)
Features based on learning action counts: ** Total number of each
type of learning actions ** Average number of actions (of any type) per
day ** Entropy of daily action counts (considering active days
only)
Features based on number of active days ** Number of active days
** Average time distance between two consecutive active days
Since the idea is to create prediction models based on different
number of weeks data, we will also need to compute feature values for
different number of course weeks. Thus, we will create functions that
compute features based on the data for the given number of course weeks
(the input parameter).
To compute features based on counts per day, we need to add the date
variable
- Start with the total number of each type of learning actions
Note: to avoid having too many features (as action counts), we will
consider all actions related to access to the lecture materials on
different topics as one kind of action (‘Lecture’)
actions_tot_count <- function(events_data) {
}
Check the function with the data from the first two weeks of the
course
- Next, compute average number of actions (of any type) per day
avg_actions_per_day = function(events_data) {
}
Check the function with the data from the first two weeks of the
course
- Entropy of daily action counts
Entropy is a measure of disorder in a system. Here it is used as an
indicator of regularity of learning: lower the entropy, higher is the
regularity and vice versa. Note: A nice explanation of the intuition
behind the formula of Shannon entropy is given in this video.
Since we want to compute entropy of daily action counts, we need to
compute (approximate) the probability of action counts for each day. We
will do that by taking the proportion of daily action counts with
respect to the total action counts for the given student
entropy_of_action_counts = function(events_data) {
}
Check the function with the data from the first two weeks of the
course
- Number of active days (= days with at least one learning
action)
active_days_count = function(events_data) {
}
Check the function with the data from the first two weeks of the
course
- Average time distance between two consecutive active days
Note: for student with only 1 active day, avg_aday_dist will be NA.
To avoid losing students due to the missing value of this feature, we
will replace NAs with a large number (e.g., 2 x max distance), thus
indicating that a student rarely (if ever) got back to the course
activities
avg_dist_active_days = function(events_data) {
}
Check the function with the data from the first two weeks of the
course
Create feature set for 2 weeks of data and examine feature
relevance
Create a function that will allow for creating a feature set for any
(given) number of course weeks
create_feature_set = function(events_data) {
}
Create the feature set based on the first two weeks of data
Examine the feature set
Add the outcome variable
Examine the relevance of features for the prediction of the outcome
variable
Let’s first see how we can do it for one variable
Now, do for all at once
Note: the notation .data[[f]]
in the code below allow us
to access column from the ‘current’ data frame (in this case,
w2_data
) with the name given as the input variable of the
function (f
)
Predictive modeling
Load additional R packages required for model building and
evaluation
We will use decision tree (as implemented in the rpart package) as
the classification method, and will build a couple of decision tree (DT)
models, one for each of the first five weeks of the course. We will
build each model using the optimal value of the cp
hyper-parameter, identified through 10-fold cross-validation (as we did
before).
We will evaluate the models using the same metrics used before:
accuracy, precision, recall, F1
build_DT_model <- function(train_data) {
cp_grid <- expand.grid(.cp = seq(0.001, 0.1, 0.005))
ctrl <- trainControl(method = "CV",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
dt <- train(x = train_data |> select(-course_outcome),
y = train_data$course_outcome,
method = "rpart",
metric = "ROC",
tuneGrid = cp_grid,
trControl = ctrl)
dt$finalModel
}
get_evaluation_measures <- function(model, test_data) {
predicted_vals <- predict(model,
test_data |> select(-course_outcome),
type = 'class')
actual_vals <- test_data$course_outcome
cm <- table(actual_vals, predicted_vals)
# low achievement in the course is considered the positive class
TP <- cm[2,2]
TN <- cm[1,1]
FP <- cm[1,2]
FN <- cm[2,1]
accuracy = sum(diag(cm)) / sum(cm)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
F1 <- (2 * precision * recall) / (precision + recall)
c(Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1 = F1)
}
Create (classification) models for predicting course outcome, based
on progresively more weeks of events data
Starting from week 1, up to week 5, create predictive models and
examine their performance
models <- list()
eval_measures <- list()
for(k in 1:5) {
print(paste("Starting computations for week", k))
# create the dataset (features + outcome variable) for the given number of weeks (k)
# split the data into train and test sets
# build the model (through CV) and compute eval.measures
# add the model and its evaluation measures to the corresponding lists
}
Compare the models based on the evaluation measures
# transform the eval_measures list into a df
# embellish the evaluation report by:
# 1) adding the week column;
# 2) rounding the metric values to 4 digits;
# 3) rearranging the order of columns
Examine the importance of features in an early in the course model
with good performance
---
title: "Predictive modelling: predicting course outcomes in a blended postgraduate course"
output: html_notebook
---

Load the required R packages (additional ones will be loaded as needed)
```{r message=FALSE}


```

## Exploratory data analysis

### Load and explore logged events
```{r}

```

```{r}

```

Transform time stamp data into the format suitable for processing date and time data
```{r}

```

```{r}

```

We can now order the events, for each user, based on the time stamp 

Note: we will be using R's pipe notation (|>) to make the code easier to understand and follow 
```{r}

```

Let's start by examining the time range the data is available for. 
It should (roughly) coincide with the start and the end of the course
```{r}
# start of the course

```

```{r}
# end of the course


```

The course length (in weeks):
```{r}

```

Since we want to make predictions based on the first couple of weeks data, we need to add the week variable 
```{r}


```

Check the distribution of event counts across the course weeks
```{r}

```

Also in proportions
```{r}

```

Examine character variables that represent different types of actions and logged events
```{r}


```

Let's examine the actions closer
```{r}

```
Some of these actions refer to individual course topics, that is, to the access to lecture materials on distinct course topics. These are:
General, Applications, Theory,  Ethics, Feedback, La_types. 
We will rename the actions to make the meaning clearer
```{r}
# course_topics <- c("General", "Applications", "Theory",  "Ethics", "Feedback", "La_types")


```

```{r}

```

Examine also the log column
```{r}

```



### Load and examine grades data
```{r}

```

```{r}

```

Examine the summary statistics and distribution of the final grade
```{r}

```

```{r}

```


Let's add *course_outcome* as a binary variable indicating if a student had a low grade. 
Students whose final grade is above 50th percentile (median) will be considered as having good course outcome (HIGH), the rest will be considered as having weak course outcome (LOW)
```{r}

```

Examine the distribution of the outcome variable (though we should already know it)
```{r}

```


## Features

Two groups of action-based features will be computed and used for prediction:
(note: active days are days with at least one learning action)

* Features based on learning action counts:
** Total number of each type of learning actions 
** Average number of actions (of any type) per day
** Entropy of daily action counts (considering active days only)

* Features based on number of active days
** Number of active days
** Average time distance between two consecutive active days

Since the idea is to create prediction models based on different number of weeks data, we will also need to compute feature values for different number of course weeks. Thus, we will create functions that compute features based on the data for the given number of course weeks (the input parameter). 

To compute features based on counts per day, we need to add the date variable
```{r}

```

(1) Start with the total number of each type of learning actions 

Note: to avoid having too many features (as action counts), we will consider all actions related to access to the lecture materials on different topics as one kind of action ('Lecture')
```{r}
actions_tot_count <- function(events_data) {
  
}
```

Check the function with the data from the first two weeks of the course
```{r}

```

(2) Next, compute average number of actions (of any type) per day

```{r}
avg_actions_per_day = function(events_data) {
  
  
}
```

Check the function with the data from the first two weeks of the course
```{r}

```

(3) Entropy of daily action counts

Entropy is a measure of disorder in a system. Here it is used as an indicator of regularity of learning: lower the entropy, higher is the regularity and vice versa. 
Note: A nice explanation of the intuition behind the formula of Shannon entropy is given in [this video](https://www.youtube.com/watch?v=0GCGaw0QOhA).

Since we want to compute entropy of daily action counts, we need to compute (approximate) the probability of action counts for each day. We will do that by taking the proportion of daily action counts with respect to the total action counts for the given student
```{r}
entropy_of_action_counts = function(events_data) {
  
  
}
```

Check the function with the data from the first two weeks of the course
```{r}

```

(4) Number of active days (= days with at least one learning action)

```{r}
active_days_count = function(events_data) {
 
  
}
```

Check the function with the data from the first two weeks of the course
```{r}

```

(5) Average time distance between two consecutive active days

Note: for student with only 1 active day, avg_aday_dist will be NA. To avoid losing students due to the missing value of this feature, we will replace NAs with a large number (e.g., 2 x max distance), thus indicating that a student rarely (if ever) got back to the course activities
```{r}
avg_dist_active_days = function(events_data) {
  
  
}
```

Check the function with the data from the first two weeks of the course
```{r}

```

### Create feature set for 2 weeks of data and examine feature relevance

Create a function that will allow for creating a feature set for any (given) number of course weeks 
```{r}
create_feature_set = function(events_data) {
  
  
}
```

Create the feature set based on the first two weeks of data
```{r}

```

Examine the feature set
```{r}

```

Add the outcome variable
```{r}


```

Examine the relevance of features for the prediction of the outcome variable

Let's first see how we can do it for one variable 
```{r}


```

Now, do for all at once

Note: the notation `.data[[f]]` in the code below allow us to access column from the 'current' data frame (in this case, `w2_data`) with the name given as the input variable of the function (`f`) 
```{r}


```




## Predictive modeling

Load additional R packages required for model building and evaluation 
```{r message=FALSE}


```

We will use decision tree (as implemented in the rpart package) as the classification method, and will build a couple of decision tree (DT) models, one for each of the first five weeks of the course. We will build each model using the optimal value of the `cp` hyper-parameter, identified through 10-fold cross-validation (as we did before). 

We will evaluate the models using the same metrics used before: accuracy, precision, recall, F1
```{r}
build_DT_model <- function(train_data) {
  
  cp_grid <- expand.grid(.cp = seq(0.001, 0.1, 0.005))
  
  ctrl <- trainControl(method = "CV", 
                       number = 10,
                       classProbs = TRUE,
                       summaryFunction = twoClassSummary)
  
  dt <- train(x = train_data |> select(-course_outcome),
              y = train_data$course_outcome,
              method = "rpart",
              metric = "ROC",
              tuneGrid = cp_grid,
              trControl = ctrl)
  
  dt$finalModel
}
```


```{r}
get_evaluation_measures <- function(model, test_data) {
  
  predicted_vals <- predict(model, 
                            test_data |> select(-course_outcome),
                            type = 'class')
  actual_vals <- test_data$course_outcome
  
  cm <- table(actual_vals, predicted_vals)
  
  # low achievement in the course is considered the positive class
  TP <- cm[2,2]
  TN <- cm[1,1]
  FP <- cm[1,2]
  FN <- cm[2,1]

  accuracy = sum(diag(cm)) / sum(cm)
  precision <- TP / (TP + FP)
  recall <- TP / (TP + FN)
  F1 <- (2 * precision * recall) / (precision + recall)
  
  c(Accuracy = accuracy, 
    Precision = precision, 
    Recall = recall, 
    F1 = F1)
  
}
```


### Create (classification) models for predicting course outcome, based on progresively more weeks of events data

Starting from week 1, up to week 5, create predictive models and examine their performance
```{r warning=FALSE, message=FALSE}
models <- list()
eval_measures <- list()

for(k in 1:5) {
  
  print(paste("Starting computations for week", k))
  
  # create the dataset (features + outcome variable) for the given number of weeks (k) 
  
  
  # split the data into train and test sets
  

  # build the model (through CV) and compute eval.measures
  
  
  # add the model and its evaluation measures to the corresponding lists 
  
}
```

Compare the models based on the evaluation measures
```{r}
# transform the eval_measures list into a df


# embellish the evaluation report by: 
# 1) adding the week column; 
# 2) rounding the metric values to 4 digits; 
# 3) rearranging the order of columns 

```


Examine the importance of features in an early in the course model with good performance
```{r}

```