Load the required R packages (additional ones will be loaded as needed)
library(ggplot2)
library(dplyr)
library(tidyr)
#library(lubridate)
events = read.csv("data/events.csv")
str(events)
'data.frame': 95626 obs. of 4 variables:
$ user : chr "9d744e5bf" "91489f7a9" "278a75edf" "53d6ab60c" ...
$ log : chr "Assignment: Final Project" "Assignment: Final Project" "Assignment: Final Project" "Assignment: Final Project" ...
$ action: chr "Assignment" "Assignment" "Assignment" "Assignment" ...
$ ts : chr "2019-10-26 09:37:12" "2019-10-26 09:09:34" "2019-10-18 12:05:28" "2019-10-19 13:28:37" ...
Transform time stamp data into the format suitable for processing date and time data
events$ts <- as.POSIXct(events$ts)
events$ts[1:10]
[1] "2019-10-26 09:37:12 CEST" "2019-10-26 09:09:34 CEST" "2019-10-18 12:05:28 CEST" "2019-10-19 13:28:37 CEST"
[5] "2019-10-15 23:38:13 CEST" "2019-10-18 17:51:43 CEST" "2019-10-18 15:22:56 CEST" "2019-10-22 13:46:51 CEST"
[9] "2019-10-15 14:58:17 CEST" "2019-10-19 13:28:38 CEST"
We can order the events, for each user, based on the time stamp
Note: we will be using R’s pipe notation (|>) to make the code easier to understand and follow
events |>
arrange(user, ts) -> events
head(events)
Let’s start by examining the time range the data is available for. It should (roughly) coincide with the start and the end of the course
course_start <- min(events$ts)
print(course_start)
[1] "2019-09-09 14:08:01 CEST"
#print(wday(course_start, week_start = 1, label = T))
course_end <- max(events$ts)
print(course_end)
[1] "2019-10-27 19:27:41 CET"
#print(wday(course_end, week_start = 1, label = T))
The course length (in weeks):
difftime(course_end, course_start, units="week")
Time difference of 6.894808 weeks
Since we want to make predictions based on the first couple of weeks data, we need to add the week variable
events |>
mutate(week = strftime(ts, "%V") |> as.integer(),
current_week = week - min(week) + 1) |>
select(-week) -> events
Check the distribution of action counts across the course weeks
table(events$current_week)
1 2 3 4 5 6 7
9271 11471 13699 18115 21838 15742 5490
Also in proportions
table(events$current_week) |> prop.table() |> round(digits = 3)
1 2 3 4 5 6 7
0.097 0.120 0.143 0.189 0.228 0.165 0.057
Examine character variables that represent different types of actions and logged events
apply(events |> select(action, log),
2,
function(x) length(unique(x)))
action log
12 80
Let’s examine the actions
table(events$action)|> prop.table() |> round(digits = 3)
Applications Assignment Course_view Ethics Feedback General Group_work Instructions La_types
0.008 0.077 0.264 0.011 0.033 0.035 0.342 0.068 0.020
Practicals Social Theory
0.105 0.023 0.014
Some of these actions refer to individual course topics, that is, to the access to lecture materials on distinct course topics. These are: General, Applications, Theory, Ethics, Feedback, La_types. We will rename the actions to make the meaning clearer
course_topics <- c("General", "Applications", "Theory", "Ethics", "Feedback", "La_types")
events |>
mutate(action = ifelse(test = action %in% course_topics,
yes = paste("Lecture", action, sep = "_"),
no = action)) -> events
unique(events$action)
[1] "Course_view" "Instructions" "Group_work" "Lecture_General" "Practicals"
[6] "Social" "Lecture_La_types" "Assignment" "Lecture_Feedback" "Lecture_Applications"
[11] "Lecture_Ethics" "Lecture_Theory"
Examine also the log column
table(events$log) |> as.data.frame() |> arrange(desc(Freq)) |> View()
Too many distinct values, we will leave this variable aside, at least for now
grades <- read.csv("data/grades.csv")
str(grades)
'data.frame': 130 obs. of 2 variables:
$ user : chr "6eba3ff82" "05b604102" "111422ee7" "b4658c3a9" ...
$ grade: num 2.63 4.67 9.24 0 8.24 ...
Examine the summary statistics and distribution of the final grade
summary(grades$grade)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 5.666 7.954 7.254 9.006 10.000
ggplot(grades, aes(x = grade)) +
geom_density() +
labs(x = "Final grade",
title = "Distribution of the final grade") +
scale_x_continuous(breaks = 1:10) +
theme_minimal()
It is not normally distributed, but skewed towards higher grade values
Let’s add course_outcome as a binary variable indicating if a student had a good or weak course outcome. Students whose final grade is above 50th percentile (median) will be considered as having good course outcome (HIGH), the rest will be considered as having weak course outcome (LOW)
grades |>
mutate(course_outcome = ifelse(test=grade > median(grade),
yes = "High", no = "Low")) |>
mutate(course_outcome=factor(course_outcome)) -> grades
table(grades$course_outcome)
High Low
65 65
This gives us a perfectly balanced data set for the outcome prediction (classification) task.
Two groups of action-based features will be computed and used for prediction: (note: active days are days with at least one learning action)
Features based on learning action counts: ** Total number of each type of learning actions (considering active days only) ** Average number of actions (of any type) per day (considering active days only) ** Entropy of daily action counts (considering active days only)
Features based on number of active days ** Number of active days ** Average time distance between two consecutive active days
Since the idea is to create prediction models based on different number of weeks data, we will also need to compute feature values for different number of course weeks. Thus, we will create functions that compute features based on the data for the given number of course weeks (the input parameter).
To compute features based on counts per day, we need to add the date variable
events$date = as.Date(events$ts)
Note: to avoid having too many features (as action counts), we will consider all actions related to access to the lecture materials on different topics as one kind of action (‘Lecture’)
actions_tot_count <- function(events_data) {
events_data |>
mutate(action = ifelse(startsWith(action, "Lecture"), yes = "Lecture", no = action)) |>
group_by(user, action) |>
count() |>
pivot_wider(id_cols = user,
names_from = action,
names_prefix = "cnt_",
values_from = n,
values_fill = 0)
}
Check the function with the data from the first two weeks of the course
events_2_weeks <- events %>% filter(current_week <= 2)
actions_tot_count(events_2_weeks)
avg_actions_per_day = function(events_data) {
events_data |>
count(user, date) |>
group_by(user) |>
summarise(avg_daily = median(n))
}
avg_actions_per_day(events_2_weeks)
Entropy is a measure of disorder in a system. Here it is used as an indicator of regularity of learning: lower the entropy, higher is the regularity and vice versa. Note: A nice explanation of the intuition behind the formula of Shannon entropy is given in this video.
Since we want to compute entropy of daily action counts, we need to compute (approximate) the probability of action counts for each day. We will do that by taking the proportion of daily action counts with respect to the total action counts for the given student
entropy_of_action_counts = function(events_data) {
events_data |>
group_by(user) |>
mutate(tot_action_count = n())|>
ungroup() |>
group_by(user, date) |>
summarise(daily_action_count = n(),
daily_action_prop = daily_action_count/tot_action_count) |>
ungroup() |>
distinct() |>
group_by(user) |>
summarise(entropy = -sum(daily_action_prop*log(daily_action_prop)))
}
entropy_of_action_counts(events_2_weeks)
`summarise()` has grouped output by 'user', 'date'. You can override using the `.groups` argument.
active_days_count = function(events_data) {
events_data |>
group_by(user) |>
summarise(active_days_cnt = n_distinct(date))
}
active_days_count(events_2_weeks)
Note: for student with only 1 active day, avg_aday_dist will be NA. To avoid losing students due to the missing value of this feature, we will replace NAs with a large number (e.g., 2 x max distance), thus indicating that a student rarely (if ever) got back to the course activities
avg_dist_active_days = function(events_data) {
events_data |>
group_by(user) |>
distinct(date) |>
arrange(date) |>
mutate(prev_aday = lag(date)) |>
mutate(aday_dist = ifelse(is.na(prev_aday),
yes = NA,
no = difftime(date, prev_aday, units = "days"))) |>
summarise(avg_aday_dist = median(aday_dist, na.rm = TRUE)) |>
ungroup() -> df
max_aday_dist = max(df$avg_aday_dist, na.rm = TRUE)
df |>
mutate(avg_aday_dist = ifelse(is.na(avg_aday_dist),
yes = 2*max_aday_dist,
no = avg_aday_dist))
}
avg_dist_active_days(events_2_weeks) |> summary()
user avg_aday_dist
Length:128 Min. : 1.000
Class :character 1st Qu.: 1.000
Mode :character Median : 1.500
Mean : 2.469
3rd Qu.: 2.000
Max. :20.000
Create a function that will allow for creating a feature set for any (given) number of course weeks
create_feature_set = function(events_data) {
f1 = actions_tot_count(events_data)
f2 = avg_actions_per_day(events_data)
f3 = entropy_of_action_counts(events_data)
f4 = active_days_count(events_data)
f5 = avg_dist_active_days(events_data)
f1 |>
inner_join(f2, by='user') |>
inner_join(f3, by='user') |>
inner_join(f4, by='user') |>
inner_join(f5, by='user') |>
as.data.frame()
}
Create the feature set based on the first two weeks of data
w2_feature_set = create_feature_set(events_2_weeks)
`summarise()` has grouped output by 'user', 'date'. You can override using the `.groups` argument.
str(w2_feature_set)
'data.frame': 128 obs. of 12 variables:
$ user : chr "00a05cc62" "042b07ba1" "046c35846" "05b604102" ...
$ cnt_Course_view : int 31 40 15 14 21 92 65 96 45 77 ...
$ cnt_Group_work : int 18 45 1 1 32 86 31 79 27 123 ...
$ cnt_Instructions: int 9 19 6 7 4 32 29 28 13 27 ...
$ cnt_Lecture : int 11 9 1 1 0 48 9 35 22 23 ...
$ cnt_Practicals : int 5 1 1 0 3 16 16 3 3 2 ...
$ cnt_Social : int 4 16 0 0 0 14 2 29 1 11 ...
$ cnt_Assignment : int 0 3 0 1 0 3 3 8 0 12 ...
$ avg_daily : num 10 28 12 12 26 20 25 29 11 8 ...
$ entropy : num 1.373 1.043 0.512 0.512 0.811 ...
$ active_days_cnt : int 5 4 2 2 3 11 7 10 5 9 ...
$ avg_aday_dist : num 2 2 10 10 2 1 1.5 1 2 1 ...
summary(w2_feature_set)
user cnt_Course_view cnt_Group_work cnt_Instructions cnt_Lecture cnt_Practicals
Length:128 Min. : 3.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
Class :character 1st Qu.: 20.50 1st Qu.: 20.00 1st Qu.:10.75 1st Qu.: 4.00 1st Qu.: 1.000
Mode :character Median : 42.50 Median : 32.00 Median :18.00 Median :15.00 Median : 3.000
Mean : 50.88 Mean : 49.00 Mean :21.16 Mean :21.57 Mean : 6.016
3rd Qu.: 66.50 3rd Qu.: 66.25 3rd Qu.:29.00 3rd Qu.:34.00 3rd Qu.: 9.000
Max. :212.00 Max. :205.00 Max. :73.00 Max. :80.00 Max. :33.000
cnt_Social cnt_Assignment avg_daily entropy active_days_cnt avg_aday_dist
Min. : 0.000 Min. : 0.000 Min. : 8.00 Min. :0.000 Min. : 1.000 Min. : 1.000
1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.:11.00 1st Qu.:1.043 1st Qu.: 4.000 1st Qu.: 1.000
Median : 4.500 Median : 1.000 Median :17.00 Median :1.593 Median : 7.000 Median : 1.500
Mean : 9.445 Mean : 3.977 Mean :18.12 Mean :1.423 Mean : 6.586 Mean : 2.469
3rd Qu.:15.000 3rd Qu.: 4.000 3rd Qu.:23.00 3rd Qu.:1.890 3rd Qu.: 9.000 3rd Qu.: 2.000
Max. :45.000 Max. :20.000 Max. :41.50 Max. :2.316 Max. :13.000 Max. :20.000
Add the outcome variable
w2_feature_set |>
inner_join(grades |> select(user, course_outcome) ) -> w2_data
Joining, by = "user"
Examine the relevance of features for the prediction of the outcome variable
Let’s first see how we can do it for one variable
ggplot(w2_data, aes(x = cnt_Course_view, fill=course_outcome)) +
geom_density(alpha=0.5) +
theme_minimal()
Now, do for all at once
Note: the notation .data[[fn]]
in the code below allow
us to access column from the ‘current’ data frame (in this case,
w2_data
) with the name given as the input variable of the
function (fn
)
feature_names <- colnames(w2_feature_set |> select(-user))
lapply(feature_names,
function(fn) {
ggplot(w2_data, aes(x = .data[[fn]], fill=course_outcome)) +
geom_density(alpha=0.5) +
theme_minimal()
})
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
[[11]]
The plots suggest that all features are potentially relevant for predicting the course outcome.
Load additional R packages required for model building and evaluation
library(caret)
library(rpart)
We will use decision tree (as implemented in the rpart package) as
the classification method, and will build a couple of decision tree (DT)
models, one for each of the first five weeks of the course. We will
build each model using the optimal value of the cp
hyper-parameter, identified through 10-fold cross-validation (as we did
before).
We will evaluate the models using the same metrics used before: accuracy, precision, recall, F1
build_DT_model <- function(train_data) {
cp_grid <- expand.grid(.cp = seq(0.001, 0.1, 0.005))
ctrl <- trainControl(method = "CV",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
dt <- train(x = train_data |> select(-course_outcome),
y = train_data$course_outcome,
method = "rpart",
metric = "ROC",
tuneGrid = cp_grid,
trControl = ctrl)
dt$finalModel
}
get_evaluation_measures <- function(model, test_data) {
predicted_vals <- predict(model,
test_data |> select(-course_outcome),
type = 'class')
actual_vals <- test_data$course_outcome
cm <- table(actual_vals, predicted_vals)
# low achievement in the course is considered the positive class
TP <- cm[2,2]
TN <- cm[1,1]
FP <- cm[1,2]
FN <- cm[2,1]
accuracy = sum(diag(cm)) / sum(cm)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
F1 <- (2 * precision * recall) / (precision + recall)
c(Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1 = F1)
}
Starting from week 1, up to week 5, create predictive models and examine their performance
models <- list()
eval_measures <- list()
for(k in 1:5) {
print(paste("Starting computations for week", k))
week_k_events <- events |> filter(current_week <= k)
ds <- create_feature_set(week_k_events)
ds <- inner_join(ds, grades)
set.seed(2023)
train_indices <- createDataPartition(ds$course_outcome,
p = 0.8, list = FALSE)
train_ds <- ds[train_indices,] |> select(-c(user, grade))
test_ds <- ds[-train_indices,] |> select(-c(user, grade))
dt <- build_DT_model(train_ds)
eval_dt <- get_evaluation_measures(dt, test_ds)
models[[k]] <- dt
eval_measures[[k]] <- eval_dt
}
[1] "Starting computations for week 1"
[1] "Starting computations for week 2"
[1] "Starting computations for week 3"
[1] "Starting computations for week 4"
[1] "Starting computations for week 5"
Compare the models based on the evaluation measures
# transform the eval_measures list into a df
eval_df <- bind_rows(eval_measures)
# embellish the evaluation report by:
# 1) adding the week column;
# 2) rounding the metric values to 4 digits;
# 3) rearanging the order of columns
eval_df |>
mutate(week = 1:5) |>
mutate(across(Accuracy:F1, round, digits=4)) |>
select(week, Accuracy:F1)
The results suggest that already early in this course, it is possible to fairly well predict students at risk of low performance.
Examine the importance of features in an early in the course model with good performance
models[[2]]$variable.importance |>
as.data.frame() |>
rename(importance = `models[[2]]$variable.importance`)
Regularity of daily action counts and the frequency of accessing the information on the course homepage (course info, syllabus, news, …) are the most important features. Next come the engagement in group work as well as the level of the overall engagement in the course (number of active days and average number of actions per (active) day)