Title: | Process Data Analysis |
---|---|
Description: | Provides tools for exploratory process data analysis. Process data refers to the data describing participants' problem-solving processes in computer-based assessments. It is often recorded in computer log files. This package provides functions to read, process, and write process data. It also implements two feature extraction methods to compress the information stored in process data into standard numerical vectors. This package also provides recurrent neural network based models that relate response processes with other binary or scale variables of interest. The functions that involve training and evaluating neural networks are wrappers of functions in 'keras'. |
Authors: | Xueying Tang [aut, cre], Susu Zhang [aut], Zhi Wang [aut], Jingchen Liu [aut], Zhiliang Ying [aut] |
Maintainer: | Xueying Tang <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.4.0 |
Built: | 2025-02-15 05:55:47 UTC |
Source: | https://github.com/xytangtang/procdata |
General tools for exploratory process data analysis. Process data refers to the data describing participants' problem solving processes in computer-based assessments. It is often recorded in computer log files. This package a process dataset and functions for reading processes from a csv file, process manipulation, action sequence generators. It also implements two automatic feature extraction methods that compress the information stored in process data, which often has a nonstandard format, into standard numerical vectors. This package also provides recurrent neural network based models that relate response processes with other binary or scale variables of interest. The functions that involve training and evaluating neural networks are based on functions in keras.
ProcData
organizes response processes as an object of class proc
.
Some functions are provided for summarizing and manipulating proc
objects.
summary.proc
calculates summary statistics of a proc
object.
remove_action
removes actions and the corresponding timestamps
replace_action
replaces an action by another action
combine_actions
combines consecutive action into one action.
read.seqs
reads response processes from a csv file.
seq_gen
generates action sequences of an imaginery simulation-based item.
seq_gen2
generates action sequences according to a given probability
transition matrix.
seq_gen3
generates action sequences according to a recurrent neural network.
seq2feature_mds
extracts features from response processes by
multidimensional scaling.
seq2feature_seq2seq
extracts features from response processes by
autoencoder.
seq2feature_ngram
extracts ngram features from response processes.
seqm
fits a neural network model that relates response processes
with a response variable.
predict.seqm
makes predictions from the models fitted by seqm
.
Maintainer: Xueying Tang [email protected]
Authors:
Susu Zhang [email protected]
Zhi Wang [email protected]
Jingchen Liu [email protected]
Zhiliang Ying [email protected]
Useful links:
Report bugs at https://github.com/xytangtang/ProcData/issues
Summarize action sequences
action_seqs_summary(action_seqs)
action_seqs_summary(action_seqs)
action_seqs |
a list of action sequences. |
a list containing the following objects:
n_seq |
the number of action sequences |
n_action |
the number of distinct actions |
action |
the action set |
seq_length |
sequence lengths |
action_freq |
action counts |
action_seqfreq |
the number of sequences that each action appears |
trans_count |
a |
time_seqs_summary
for summarizing timestamp sequences.
action2entropy
fit a recurrent-neural-network-based action prediction
model to a set of action sequences action_seqs
.
action2entropy(action_seqs, rnn_dim = 20, n_epoch = 50, step_size = 0.001, batch_size = 1, optimizer_name = "rmsprop", index_valid = 0.2, verbose = FALSE)
action2entropy(action_seqs, rnn_dim = 20, n_epoch = 50, step_size = 0.001, batch_size = 1, optimizer_name = "rmsprop", index_valid = 0.2, verbose = FALSE)
action_seqs |
a list of action sequences |
rnn_dim |
latent dimension of RNN |
n_epoch |
the number of training epochs. |
step_size |
the learning rate of optimizer. |
batch_size |
the batch size used in training. |
optimizer_name |
a character string specifying the optimizer to be used
for training. Availabel options are |
index_valid |
proportion of sequences used as the validation set or a vector of indices specifying the validation set. |
verbose |
logical. If TRUE, training progress is printed. |
action2entropy
returns a list containing
entropy_seqs |
a list of entropy sequences. The length of each entropy sequence is one less than that of the corresponding action sequence. |
loss_history |
a |
rnn_dim |
the latent dimension of the recurrent neural network |
model_fit |
a vector of class |
actions |
a vector of the actions in |
max_len |
maximum length of the action sequences. |
Wang, Z., Tang, X., Liu, J., and Ying, Z. (2020) Subtask analysis of process data through a predictive model. https://arxiv.org/abs/2009.00717
entropy2segment
and segment2subtask
for
steps 2 and 3 of the subtask analysis procedure; subtask_analysis
for the complete procedure.
aseq2feature_seq2seq
extract features from action sequences by action
sequence autoencoder.
aseq2feature_seq2seq(aseqs, K, rnn_type = "lstm", n_epoch = 50, method = "last", step_size = 1e-04, optimizer_name = "adam", samples_train, samples_valid, samples_test = NULL, pca = TRUE, verbose = TRUE, return_theta = TRUE)
aseq2feature_seq2seq(aseqs, K, rnn_type = "lstm", n_epoch = 50, method = "last", step_size = 1e-04, optimizer_name = "adam", samples_train, samples_valid, samples_test = NULL, pca = TRUE, verbose = TRUE, return_theta = TRUE)
aseqs |
a list of |
K |
the number of features to be extracted. |
rnn_type |
the type of recurrent unit to be used for modeling
response processes. |
n_epoch |
the number of training epochs for the autoencoder. |
method |
the method for computing features from the output of an
recurrent neural network in the encoder. Available options are
|
step_size |
the learning rate of optimizer. |
optimizer_name |
a character string specifying the optimizer to be used
for training. Availabel options are |
samples_train |
vectors of indices specifying the training, validation and test sets for training autoencoder. |
samples_valid |
vectors of indices specifying the training, validation and test sets for training autoencoder. |
samples_test |
vectors of indices specifying the training, validation and test sets for training autoencoder. |
pca |
logical. If TRUE, the principal components of features are returned. Default is TRUE. |
verbose |
logical. If TRUE, training progress is printed. |
return_theta |
logical. If TRUE, extracted features are returned. |
This function trains a sequence-to-sequence autoencoder using keras. The encoder of the autoencoder consists of an embedding layer and a recurrent neural network. The decoder consists of another recurrent neural network and a fully connect layer with softmax activation. The outputs of the encoder are the extracted features.
The output of the encoder is a function of the encoder recurrent neural network.
It is the last output of the encoder recurrent neural network if method="last"
and the average of the encoder recurrent nenural network if method="avg"
.
aseq2feature_seq2seq
returns a list containing
theta |
a matrix containing |
train_loss |
a vector of length |
valid_loss |
a vector of length |
test_loss |
a vector of length |
chooseK_seq2seq
for choosing K
through cross-validation.
Other feature extraction methods: atseq2feature_seq2seq
,
seq2feature_mds_large
,
seq2feature_mds
,
seq2feature_ngram
,
seq2feature_seq2seq
,
tseq2feature_seq2seq
if (!system("python -c 'import tensorflow as tf'", ignore.stdout = TRUE, ignore.stderr= TRUE)) { n <- 50 seqs <- seq_gen(n) seq2seq_res <- aseq2feature_seq2seq(seqs$action_seqs, 5, rnn_type="lstm", n_epoch=5, samples_train=1:40, samples_valid=41:50) features <- seq2seq_res$theta plot(seq2seq_res$train_loss, col="blue", type="l") lines(seq2seq_res$valid_loss, col="red") }
if (!system("python -c 'import tensorflow as tf'", ignore.stdout = TRUE, ignore.stderr= TRUE)) { n <- 50 seqs <- seq_gen(n) seq2seq_res <- aseq2feature_seq2seq(seqs$action_seqs, 5, rnn_type="lstm", n_epoch=5, samples_train=1:40, samples_valid=41:50) features <- seq2seq_res$theta plot(seq2seq_res$train_loss, col="blue", type="l") lines(seq2seq_res$valid_loss, col="red") }
atseq2feature_seq2seq
extract features from action and timestamp sequences by a
sequence autoencoder.
atseq2feature_seq2seq(atseqs, K, weights = c(1, 0.5), cumulative = FALSE, log = TRUE, rnn_type = "lstm", n_epoch = 50, method = "last", step_size = 1e-04, optimizer_name = "rmsprop", samples_train, samples_valid, samples_test = NULL, pca = TRUE, verbose = TRUE, return_theta = TRUE)
atseq2feature_seq2seq(atseqs, K, weights = c(1, 0.5), cumulative = FALSE, log = TRUE, rnn_type = "lstm", n_epoch = 50, method = "last", step_size = 1e-04, optimizer_name = "rmsprop", samples_train, samples_valid, samples_test = NULL, pca = TRUE, verbose = TRUE, return_theta = TRUE)
atseqs |
a list of two elements, first element is the list of |
K |
the number of features to be extracted. |
weights |
a vector of 2 elements for the weight of the loss of action sequences (categorical_crossentropy) and time sequences (mean squared error), respectively. The total loss is calculated as the weighted sum of the two losses. |
cumulative |
logical. If TRUE, the sequence of cumulative time up to each event is used as input to the neural network. If FALSE, the sequence of inter-arrival time (gap time between an event and the previous event) will be used as input to the neural network. Default is FALSE. |
log |
logical. If TRUE, for the timestamp sequences, input of the neural net is the base-10 log of the original sequence of times plus 1 (i.e., log10(t+1)). If FALSE, the original sequence of times is used. |
rnn_type |
the type of recurrent unit to be used for modeling
response processes. |
n_epoch |
the number of training epochs for the autoencoder. |
method |
the method for computing features from the output of an
recurrent neural network in the encoder. Available options are
|
step_size |
the learning rate of optimizer. |
optimizer_name |
a character string specifying the optimizer to be used
for training. Availabel options are |
samples_train |
vectors of indices specifying the training, validation and test sets for training autoencoder. |
samples_valid |
vectors of indices specifying the training, validation and test sets for training autoencoder. |
samples_test |
vectors of indices specifying the training, validation and test sets for training autoencoder. |
pca |
logical. If TRUE, the principal components of features are returned. Default is TRUE. |
verbose |
logical. If TRUE, training progress is printed. |
return_theta |
logical. If TRUE, extracted features are returned. |
This function trains a sequence-to-sequence autoencoder using keras. The encoder of the autoencoder consists of a recurrent neural network. The decoder consists of another recurrent neural network followed by a fully connected layer with softmax activation for actions and another fully connected layer with ReLU activation for times. The outputs of the encoder are the extracted features.
The output of the encoder is a function of the encoder recurrent neural network.
It is the last latent state of the encoder recurrent neural network if method="last"
and the average of the encoder recurrent neural network latent states if method="avg"
.
tseq2feature_seq2seq
returns a list containing
theta |
a matrix containing |
train_loss |
a vector of length |
valid_loss |
a vector of length |
test_loss |
a vector of length |
chooseK_seq2seq
for choosing K
through cross-validation.
Other feature extraction methods: aseq2feature_seq2seq
,
seq2feature_mds_large
,
seq2feature_mds
,
seq2feature_ngram
,
seq2feature_seq2seq
,
tseq2feature_seq2seq
if (!system("python -c 'import tensorflow as tf'", ignore.stdout = TRUE, ignore.stderr= TRUE)) { n <- 50 data(cc_data) samples <- sample(1:length(cc_data$seqs$time_seqs), n) atseqs <- sub_seqs(cc_data$seqs, samples) action_and_time_seq2seq_res <- atseq2feature_seq2seq(atseqs, 5, rnn_type="lstm", n_epoch=5, samples_train=1:40, samples_valid=41:50) features <- action_and_time_seq2seq_res$theta plot(action_and_time_seq2seq_res$train_loss, col="blue", type="l", ylim = range(c(action_and_time_seq2seq_res$train_loss, action_and_time_seq2seq_res$valid_loss))) lines(action_and_time_seq2seq_res$valid_loss, col="red", type = 'l') }
if (!system("python -c 'import tensorflow as tf'", ignore.stdout = TRUE, ignore.stderr= TRUE)) { n <- 50 data(cc_data) samples <- sample(1:length(cc_data$seqs$time_seqs), n) atseqs <- sub_seqs(cc_data$seqs, samples) action_and_time_seq2seq_res <- atseq2feature_seq2seq(atseqs, 5, rnn_type="lstm", n_epoch=5, samples_train=1:40, samples_valid=41:50) features <- action_and_time_seq2seq_res$theta plot(action_and_time_seq2seq_res$train_loss, col="blue", type="l", ylim = range(c(action_and_time_seq2seq_res$train_loss, action_and_time_seq2seq_res$valid_loss))) lines(action_and_time_seq2seq_res$valid_loss, col="red", type = 'l') }
Calculate "oss_action" dissimilarity matrix through Rcpp
calculate_dist_cpp(seqs)
calculate_dist_cpp(seqs)
seqs |
a list of action sequences |
calculate_dist_cpp
returns the "oss_action" dissimilarity matrix of
the action sequences in seqs
.
A dataset containing the response processes and binary response outcomes of 16763 respondents.
cc_data
cc_data
A list with two elements.
An object of class "proc"
containing the action sequences and
the time sequences of the respondents.
Binary responses of 16763 respondents. The order of the respondents
matches that in seqs
.
item interface: http://www.oecd.org/pisa/test-2012/testquestions/question3/
chooseK_mds
choose the number of multidimensional scaling features
to be extracted by cross-validation.
chooseK_mds(seqs = NULL, K_cand, dist_type = "oss_action", n_fold = 5, max_epoch = 100, step_size = 0.01, tot = 1e-06, return_dist = FALSE, L_set = 1:3)
chooseK_mds(seqs = NULL, K_cand, dist_type = "oss_action", n_fold = 5, max_epoch = 100, step_size = 0.01, tot = 1e-06, return_dist = FALSE, L_set = 1:3)
seqs |
a |
K_cand |
the candidates of the number of features. |
dist_type |
a character string specifies the dissimilarity measure for two response processes. See 'Details'. |
n_fold |
the number of folds for cross-validation. |
max_epoch |
the maximum number of epochs for stochastic gradient descent. |
step_size |
the step size of stochastic gradient descent. |
tot |
the accuracy tolerance for determining convergence. |
return_dist |
logical. If |
L_set |
length of ngrams considered |
chooseK_mds
returns a list containing
K |
the value in |
K_cand |
the candidates of the number of features. |
cv_loss |
the cross-validation loss for each candidate in |
dist_mat |
the dissimilary matrix. This element exists only if |
Gomez-Alonso, C. and Valls, A. (2008). A similarity measure for sequences of categorical data based on the ordering of common elements. In V. Torra & Y. Narukawa (Eds.) Modeling Decisions for Artificial Intelligence, (pp. 134-145). Springer Berlin Heidelberg.
seq2feature_mds
for feature extraction after choosing
the number of features.
n <- 50 set.seed(12345) seqs <- seq_gen(n) K_res <- chooseK_mds(seqs, 5:10, return_dist=TRUE) theta <- seq2feature_mds(K_res$dist_mat, K_res$K)$theta
n <- 50 set.seed(12345) seqs <- seq_gen(n) K_res <- chooseK_mds(seqs, 5:10, return_dist=TRUE) theta <- seq2feature_mds(K_res$dist_mat, K_res$K)$theta
chooseK_seq2seq
chooses the number of features to be extracted
by cross-validation.
chooseK_seq2seq(seqs, ae_type, K_cand, rnn_type = "lstm", n_epoch = 50, method = "last", step_size = 1e-04, optimizer_name = "adam", n_fold = 5, cumulative = FALSE, log = TRUE, weights = c(1, 0.5), valid_prop = 0.1, verbose = TRUE)
chooseK_seq2seq(seqs, ae_type, K_cand, rnn_type = "lstm", n_epoch = 50, method = "last", step_size = 1e-04, optimizer_name = "adam", n_fold = 5, cumulative = FALSE, log = TRUE, weights = c(1, 0.5), valid_prop = 0.1, verbose = TRUE)
seqs |
an object of class |
ae_type |
a string specifies the type of autoencoder. The autoencoder can be an action sequence autoencoder ("action"), a time sequence autoencoder ("time"), or an action-time sequence autoencoder ("both"). |
K_cand |
the candidates of the number of features. |
rnn_type |
the type of recurrent unit to be used for modeling
response processes. |
n_epoch |
the number of training epochs for the autoencoder. |
method |
the method for computing features from the output of an
recurrent neural network in the encoder. Available options are
|
step_size |
the learning rate of optimizer. |
optimizer_name |
a character string specifying the optimizer to be used
for training. Availabel options are |
n_fold |
the number of folds for cross-validation. |
cumulative |
logical. If TRUE, the sequence of cumulative time up to each event is used as input to the neural network. If FALSE, the sequence of inter-arrival time (gap time between an event and the previous event) will be used as input to the neural network. Default is FALSE. |
log |
logical. If TRUE, for the timestamp sequences, input of the neural net is the base-10 log of the original sequence of times plus 1 (i.e., log10(t+1)). If FALSE, the original sequence of times is used. |
weights |
a vector of 2 elements for the weight of the loss of action sequences (categorical_crossentropy) and time sequences (mean squared error), respectively. The total loss is calculated as the weighted sum of the two losses. |
valid_prop |
the proportion of validation samples in each fold. |
verbose |
logical. If TRUE, training progress is printed. |
chooseK_seq2seq
returns a list containing
K |
the candidate in |
K_cand |
the candidates of number of features. |
cv_loss |
the cross-validation loss for each candidate in |
seq2feature_seq2seq
for feature extraction given the number of features.
Combine the action pattern described in old_actions
into a single action
new_action
. The timestamp of the combined action can be the timestamp of the
first action in the action pattern, the timestamp of the last action in the action
pattern, or the average of the two timestamps.
combine_actions(seqs, old_actions, new_action, timestamp = "first")
combine_actions(seqs, old_actions, new_action, timestamp = "first")
seqs |
an object of class |
old_actions |
a character vector giving consecutive actions to be replaced. |
new_action |
a string giving the combined action |
timestamp |
"first", "last", or "avg", specifying how the timestamp of the combined action should be derived. |
an object of class "proc"
seqs <- seq_gen(100) new_seqs <- combine_actions(seqs, old_actions=c("OPT1_3", "OPT2_2", "RUN"), new_action="KEY_ACTION")
seqs <- seq_gen(100) new_seqs <- combine_actions(seqs, old_actions=c("OPT1_3", "OPT2_2", "RUN"), new_action="KEY_ACTION")
This function counts the appearances of each action in actions
in
action sequence x
.
count_actions(x, actions)
count_actions(x, actions)
x |
an action sequence. |
actions |
a set of actions whose number of appearances will be count. |
an integer vector of counts.
entropy2segment
segments the entropy sequences in entropy_seqs
using segment_function
.
entropy2segment(entropy_seqs, lambda = 0.3, verbose = FALSE)
entropy2segment(entropy_seqs, lambda = 0.3, verbose = FALSE)
entropy_seqs |
a list of entropy sequences |
lambda |
a number between 0 and 1 |
verbose |
print progress if TRUE. default is FALSE |
a list containg the segment boundaries of each entropy sequence.
Wang, Z., Tang, X., Liu, J., and Ying, Z. (2020) Subtask analysis of process data through a predictive model. https://arxiv.org/abs/2009.00717
action2entropy
and segment2subtask
for
steps 1 and 3 of the subtask analysis procedure; subtask_analysis
for the complete procedure.
Plot Subtask Analysis Results for One Sequence
plot_subtask_seq(action_seq, entropy_seq, subtask_seq, subtasks, col.subtask = 1:length(subtasks), cex.action = 0.5, lty = 1, pch = 16, srt = -90, plot_legend = TRUE, legend_pos = "topleft", ...)
plot_subtask_seq(action_seq, entropy_seq, subtask_seq, subtasks, col.subtask = 1:length(subtasks), cex.action = 0.5, lty = 1, pch = 16, srt = -90, plot_legend = TRUE, legend_pos = "topleft", ...)
action_seq |
an action sequence |
entropy_seq |
an entropy sequence |
subtask_seq |
a subtask sequence |
subtasks |
a vector of all subtasks |
col.subtask |
a vector of colors for subtasks |
lty |
line types |
pch |
point characters |
plot_legend |
a logical value. If |
legend_pos |
a character string or the coordinates to be used to position the legend. |
... |
other arguments passed to |
this function does not return values
plot_subtask_seqs
for plotting results for all sequences.
plot.subtask
for the plot method of "subtask"
object.
Plot Subtask Analysis Results for Entire Dataset
plot_subtask_seqs(subtask_seqs, subtasks, max_len = 5, col.subtask = 1:length(subtasks), plot_legend = TRUE, legend_pos = "topright", ...)
plot_subtask_seqs(subtask_seqs, subtasks, max_len = 5, col.subtask = 1:length(subtasks), plot_legend = TRUE, legend_pos = "topright", ...)
subtask_seqs |
a list of subtask sequences |
subtasks |
a vector of all subtasks |
max_len |
maximum length of plotted subtasks |
col.subtask |
a vector of colors for subtasks |
plot_legend |
a logical value. If |
legend_pos |
a character string or the coordinates to be used to position the legend. |
... |
other arguments passed to |
this function does not return values
plot_subtask_seq
for ploting results for one sequence.
plot.subtask
for ploting an object of class "subtask"
Plot the subtask analysis results for either the entire dataset or individual sequences.
## S3 method for class 'subtask' plot(object, type = "all", index = NULL, max_len = 5, col.subtask = 1:length(object$subtasks), cex.action = 0.5, lty = 1, pch = 16, srt = -90, plot_legend = TRUE, legend_pos = "topright", ...)
## S3 method for class 'subtask' plot(object, type = "all", index = NULL, max_len = 5, col.subtask = 1:length(object$subtasks), cex.action = 0.5, lty = 1, pch = 16, srt = -90, plot_legend = TRUE, legend_pos = "topright", ...)
object |
an object of class |
type |
|
index |
a vector of indices of sequences to plot |
max_len |
maximum length of plotted subtasks |
col.subtask |
a vector of colors for subtasks |
lty |
line types |
pch |
point characters |
plot_legend |
a logical value. If |
legend_pos |
a character string or the coordinates to be used to position the legend. |
... |
other arguments passed to |
this function does not return values
plot_subtask_seq
, plot_subtask_seqs
.
Obtains predictions from a fitted sequence model object.
## S3 method for class 'seqm' predict(object, new_seqs, new_covariates = NULL, type = "response", ...)
## S3 method for class 'seqm' predict(object, new_seqs, new_covariates = NULL, type = "response", ...)
object |
a fitted object of class |
new_seqs |
an object of class |
new_covariates |
a new covariate matrix with which to predict. |
type |
a string specifying whether to predict responses ( |
... |
further arguments to be passed to |
It unserialize object$model_fit
to obtain a keras model of class
"keras.engin.training.Model"
and then calls predict
to obtain predictions.
If type="response"
, a vector of predictions. The vector gives the
probabilities of the response variable being one if response_type="binary"
.
If type="feature"
, a matrix of rnn outputs. If type="both"
, a list
containing both the vector of response variable prediction and the rnn output matrix.
seqm
for fitting sequence models.
"proc"
Print method for class "proc"
## S3 method for class 'proc' print(x, n = 5, index = NULL, quote = FALSE, ...)
## S3 method for class 'proc' print(x, n = 5, index = NULL, quote = FALSE, ...)
x |
an object of class |
n |
number of processes to be printed. |
index |
indice of processes to be printed. |
quote |
logical, indicating whether or not strings should be printed with surrounding quotes. |
... |
not used. |
print.proc
invisibly returns the "proc"
object it prints.
"summary.proc"
Print method for class "summary.proc"
## S3 method for class 'summary.proc' print(x, ...)
## S3 method for class 'summary.proc' print(x, ...)
x |
an object of class |
... |
not used. |
No return value.
"proc"
constructorCreate a "proc"
object from given action sequences and timestamp sequences
proc(action_seqs, time_seqs, ids = NULL)
proc(action_seqs, time_seqs, ids = NULL)
action_seqs |
a list of action sequences. |
time_seqs |
a list of timestamp sequences. |
ids |
ids identifiers of response processes. |
An object of
class "proc"
is a list containing the following components:
action_seqsa list of action sequences.
time_seqsa list of timestamp sequences.
The names of the elements in seqs$action_seqs
and seqs$time_seqs
are
process identifiers.
an object of class "proc"
containing the provided action and
timestamp sequences.
Reads a csv file and creates response process data.
read.seqs(file, style, id_var = NULL, action_var = NULL, time_var = NULL, step_sep = ",", ...)
read.seqs(file, style, id_var = NULL, action_var = NULL, time_var = NULL, step_sep = ",", ...)
file |
the name of the csv file from which the response processes are to be read. |
style |
the style that the response processes are stored. See 'Details'. |
id_var |
a string giving the name of the variable storing the process identifier. |
action_var |
a string giving the name of the variable storing action sequences. |
time_var |
a string giving the name of the variable storing timestamp sequences. |
step_sep |
the step separator characters. It is only used if |
... |
further arguments to be passed to |
read.seqs
calls read.csv
to read process data stored in a csv file into R
.
The csv file to be read should at least include an identifier of distinct response processes,
and action sequences. It can also include timestamp sequences.
The response processes (action sequences and timestamp sequences) stored in csv files can
be in one of the two styles, "single"
and "multiple"
. In "single"
style,
each response process occupies a single line. Actions and timestamps at different steps
are separated by step_sep
. In "multiple"
style, each response process occupies
multiple lines with each step taking up one line.
read.seqs
returns an object of class "proc"
.
Remove actions in actions
and the corresponding timestamps
in response processes seqs
.
remove_action(seqs, actions)
remove_action(seqs, actions)
seqs |
an object of class |
actions |
a character vector. Each element is an action to be removed. |
an object of class "proc"
with actions in actions
and the corresponding timestamps removed.
seqs <- seq_gen(10) new_seqs <- remove_action(seqs, c("RUN", "Start"))
seqs <- seq_gen(10) new_seqs <- remove_action(seqs, c("RUN", "Start"))
Remove repeated actions
remove_repeat(seqs, ignore = NULL)
remove_repeat(seqs, ignore = NULL)
seqs |
an object of class |
ignore |
repeated actions in ignore will not be deleted. |
an object of class "proc"
Replace old_action
with new_action
in seqs
. Timestamp
sequences are not affected.
replace_action(seqs, old_action, new_action)
replace_action(seqs, old_action, new_action)
seqs |
an object of class |
old_action |
a string giving the action to be replaced. |
new_action |
a string giving the action replacing |
an object of class "proc"
seqs <- seq_gen(10) new_seqs <- replace_action(seqs, "Start", "Begin")
seqs <- seq_gen(10) new_seqs <- replace_action(seqs, "Start", "Begin")
segment_function
segments the entropy sequence entropy_seq
by identifying deep U-shaped curves in it.
segment_function(entropy_seq, lambda)
segment_function(entropy_seq, lambda)
entropy_seq |
a vector of entropies |
lambda |
a number between 0 and 1. |
a vector of segment boundaries
segment2subtask
clustering action sequence segments according to their action
frequency profiles. Each cluster forms a subtask.
segment2subtask(action_seqs, seg_seqs, n_subtask, actions, verbose = FALSE, ...)
segment2subtask(action_seqs, seg_seqs, n_subtask, actions, verbose = FALSE, ...)
action_seqs |
a list of action sequences |
seg_seqs |
a list of segment locations |
n_subtask |
the desired number of subtasks or a vector of candidate number of subtasks |
actions |
a set of actions |
verbose |
logical. If TRUE, training progress is printed. |
... |
additional arguments passed to |
a list containing
n_subtask |
the number of subtasks |
subtasks |
a vector of subtasks |
subtask_seqs |
a list of subtask sequences |
tot.withinss |
a vector of total within cluster sum of squares |
relative_cluster_profiles |
a |
Wang, Z., Tang, X., Liu, J., and Ying, Z. (2020) Subtask analysis of process data through a predictive model. https://arxiv.org/abs/2009.00717
action2entropy
and segment2subtask
for
steps 1 and 3 of the subtask analysis procedure; subtask_analysis
for the complete procedure.
seq_gen
generates action sequences of an imaginary simulation-based item.
seq_gen(n, action_set1 = c("OPT1_1", "OPT1_2", "OPT1_3"), action_set2 = c("OPT2_1", "OPT2_2"), answer_set = c("CHECK_A", "CHECK_B", "CHECK_C", "CHECK_D"), p1 = rep(1, length(action_set1)), p2 = rep(1, length(action_set2)), p_answer = rep(1, length(answer_set)), p_continue = 0.5, p_choose = 0.5, include_time = FALSE, time_intv_dist = list("exp", 1))
seq_gen(n, action_set1 = c("OPT1_1", "OPT1_2", "OPT1_3"), action_set2 = c("OPT2_1", "OPT2_2"), answer_set = c("CHECK_A", "CHECK_B", "CHECK_C", "CHECK_D"), p1 = rep(1, length(action_set1)), p2 = rep(1, length(action_set2)), p_answer = rep(1, length(answer_set)), p_continue = 0.5, p_choose = 0.5, include_time = FALSE, time_intv_dist = list("exp", 1))
n |
An integer. The number of action sequences to be generated. |
action_set1 , action_set2
|
Character vectors giving the choices for the first and the second conditions. |
answer_set |
A character vector giving the choices for the answer. |
p1 , p2
|
Nonnegative numeric vectors. They are the weights for sampling
from |
p_answer |
A nonnegative numeric vector giving the weights for sampling
from |
p_continue |
Probability of running an/another experiment. |
p_choose |
Probability of choosing an answer. |
include_time |
logical. Indicate if timestamp sequences should be generated. Default is FALSE. |
time_intv_dist |
A list specifying the distribution of the inter-arrival time. |
The format of the generated sequences resembles that of the response processes of simulation-based items. In these items, participants are asked to answer a question by running simulated experiments in which two conditions can be controlled. A simulated experiment can be run by setting the two conditions at one of the given choices and click "Run" button.
The possible actions are "Start", "End", "Run", and the elements in action_set1
,
action_set2
, and answer_set
. The generated sequences begin with "Start"
and continue with groups of three actions. Each group of three actions, representing
one experiment, consists of an action chosen from action_set1
according to
p1
, an action chosen from action_set2
according to p2
, and "Run".
The probability of performing an experiment after "Start" or one experiment is
p_continue
. After the experiment process, with probability p_choose
, an
answer will be chosen. The chosen answer is randomly sampled from answer_set
according to p_answer
. All generated sequences end with "End".
An object of class "proc"
with time_seqs = NULL
.
Other sequence generators: seq_gen2
,
seq_gen3
seq_gen2
generates action sequences according to a given probability
transition matrix.
seq_gen2(n, Pmat = NULL, events = letters, start_index = 1, end_index = length(events), max_len = 200, include_time = FALSE, time_intv_dist = list("exp", 1))
seq_gen2(n, Pmat = NULL, events = letters, start_index = 1, end_index = length(events), max_len = 200, include_time = FALSE, time_intv_dist = list("exp", 1))
n |
An integer. The number of action sequences to be generated. |
Pmat |
An |
events |
A character vector specifying the set of |
start_index |
Index of the action indicating the start of an item in
|
end_index |
Index of the action indicating the end of an item in
|
max_len |
Maximum length of generated sequences. |
include_time |
logical. Indicate if timestamp sequences should be generated. Default is FALSE. |
time_intv_dist |
A list specifying the distribution of the inter-arrival time. |
This function generates n
action sequences according Pmat
. The
set of possible actions is events
. All generated sequences start with
events[start_index]
and end with events[end_index]
. If
Pmat
is not supplied, actions is uniformly drawn from
events[-start_index]
until events[end_index]
appears.
An object of class "proc"
with time_seqs = NULL
.
Other sequence generators: seq_gen3
,
seq_gen
seq_gen3
generates action sequences according to a recurrent neural network
seq_gen3(n, events = letters, rnn_type = "lstm", K = 10, weights = NULL, max_len = 100, initial_state = NULL, start_index = 1, end_index = length(events), include_time = FALSE, time_intv_dist = list("exp", 1))
seq_gen3(n, events = letters, rnn_type = "lstm", K = 10, weights = NULL, max_len = 100, initial_state = NULL, start_index = 1, end_index = length(events), include_time = FALSE, time_intv_dist = list("exp", 1))
n |
An integer. The number of action sequences to be generated. |
events |
A character vector specifying the set of |
rnn_type |
the type of recurrent unit to be used for generating sequences.
|
K |
the latent dimension of the recurrent unit. |
weights |
a list containing the weights in the embedding layer, the recurrent unit, the fully connected layer. If not (properly) specified, randomly generated weights are used. |
max_len |
Maximum length of generated sequences. |
initial_state |
a list containing the initial state of the recurrent neural
network. If |
start_index |
Index of the action indicating the start of an item in
|
end_index |
Index of the action indicating the end of an item in
|
include_time |
logical. Indicate if timestamp sequences should be generated. Default is FALSE. |
time_intv_dist |
A list specifying the distribution of the inter-arrival time. |
A list containing the following elements
seqs |
an object of class |
weights |
a list containing the weights used for generating sequences. |
Other sequence generators: seq_gen2
,
seq_gen
seq2feature_mds
extracts K
features from response processes by
multidimensional scaling.
seq2feature_mds(seqs = NULL, K = 2, method = "auto", dist_type = "oss_action", pca = TRUE, subset_size = 100, subset_method = "random", n_cand = 10, return_dist = FALSE, L_set = 1:3)
seq2feature_mds(seqs = NULL, K = 2, method = "auto", dist_type = "oss_action", pca = TRUE, subset_size = 100, subset_method = "random", n_cand = 10, return_dist = FALSE, L_set = 1:3)
seqs |
a |
K |
the number of features to be extracted. |
method |
a character string specifies the algorithm used for performing MDS. See 'Details'. |
dist_type |
a character string specifies the dissimilarity measure for two response processes. See 'Details'. |
pca |
logical. If |
subset_size , n_cand
|
two parameters used in the large data algorithm. See 'Details'
and |
subset_method |
a character string specifying the method for choosing the subset
in the large data algorithm. See 'Details' and |
return_dist |
logical. If |
L_set |
length of ngrams considered |
Since the classical MDS has a computational complexity of order where
is the number of response processes, it is computational expensive to
perform classical MDS when a large number of response processes is considered.
In addition, storing an
dissimilarity matrix when
is large
require a large amount of memory. In
seq2feature_mds
, the algorithm proposed
in Paradis (2018) is implemented to obtain MDS for large datasets. method
specifies the algorithm to be used for obtaining MDS features. If method = "small"
,
classical MDS is used by calling cmdscale
. If method = "large"
,
the algorithm for large datasets will be used. If method = "auto"
(default),
seq2feature_mds
selects the algorithm automatically based on the sample size.
dist_type
specifies the dissimilarity to be used for measuring the discrepancy
between two response processes. If dist_type = "oss_action"
, the order-based
sequence similarity (oss) proposed in Gomez-Alonso and Valls (2008) is used
for action sequences. If dist_type = "oss_both"
, both action sequences and
timestamp sequences are used to compute a time-weighted oss.
The number of features to be extracted K
can be selected by cross-validation
using chooseK_mds
.
seq2feature_mds
returns a list containing
theta |
a numeric matrix giving the |
dist_mat |
the dissimilary matrix. This element exists only if
|
Gomez-Alonso, C. and Valls, A. (2008). A similarity measure for sequences of categorical data based on the ordering of common elements. In V. Torra & Y. Narukawa (Eds.) Modeling Decisions for Artificial Intelligence, (pp. 134-145). Springer Berlin Heidelberg.
Paradis, E. (2018). Multidimensional scaling with very large datasets. Journal of Computational and Graphical Statistics, 27(4), 935-939.
Tang, X., Wang, Z., He, Q., Liu, J., and Ying, Z. (2020) Latent Feature Extraction for Process Data via Multidimensional Scaling. Psychometrika, 85, 378-397.
chooseK_mds
for choosing K
.
Other feature extraction methods: aseq2feature_seq2seq
,
atseq2feature_seq2seq
,
seq2feature_mds_large
,
seq2feature_ngram
,
seq2feature_seq2seq
,
tseq2feature_seq2seq
n <- 50 set.seed(12345) seqs <- seq_gen(n) theta <- seq2feature_mds(seqs, 5)$theta
n <- 50 set.seed(12345) seqs <- seq_gen(n) theta <- seq2feature_mds(seqs, 5)$theta
seq2feature_mds_large
extracts MDS features from a large number of
response processes. The algorithm proposed in Paradis (2018) is implemented with minor
variations to perform MDS. The algorithm first selects a relatively small subset of
response processes to perform the classical MDS. Then the coordinate of each of the
other response processes are obtained by minimizing the loss function related to the target
response processes and the those in the subset through BFGS.
seq2feature_mds_large(seqs, K, dist_type = "oss_action", subset_size, subset_method = "random", n_cand = 10, pca = TRUE, L_set = 1:3)
seq2feature_mds_large(seqs, K, dist_type = "oss_action", subset_size, subset_method = "random", n_cand = 10, pca = TRUE, L_set = 1:3)
seqs |
an object of class |
K |
the number of features to be extracted. |
dist_type |
a character string specifies the dissimilarity measure for two response processes. See 'Details'. |
subset_size |
the size of the subset on which classical MDS is performed. |
subset_method |
a character string specifying the method for choosing the subset.
It must be one of |
n_cand |
The size of the candidate set when selecting the subset. It is only used when
|
pca |
logical. If |
L_set |
length of ngrams considered |
seq2feature_mds_large
returns an matrix of extracted
features.
Paradis, E. (2018). Multidimensional Scaling with Very Large Datasets. Journal of Computational and Graphical Statistics, 27, 935–939.
Other feature extraction methods: aseq2feature_seq2seq
,
atseq2feature_seq2seq
,
seq2feature_mds
,
seq2feature_ngram
,
seq2feature_seq2seq
,
tseq2feature_seq2seq
Feature extraction by stochastic mds
seq2feature_mds_stochastic(seqs = NULL, K = 2, dist_type = "oss_action", max_epoch = 100, step_size = 0.01, pca = TRUE, tot = 1e-06, return_dist = FALSE, L_set = 1:3)
seq2feature_mds_stochastic(seqs = NULL, K = 2, dist_type = "oss_action", max_epoch = 100, step_size = 0.01, pca = TRUE, tot = 1e-06, return_dist = FALSE, L_set = 1:3)
seqs |
a |
K |
the number of features to be extracted. |
dist_type |
a character string specifies the dissimilarity measure for two response processes. See 'Details'. |
max_epoch |
the maximum number of epochs for stochastic gradient descent. |
step_size |
the step size of stochastic gradient descent. |
pca |
a logical scalar. If |
tot |
the accuracy tolerance for determining convergence. |
return_dist |
logical. If |
L_set |
length of ngrams considered. |
seq2feature_mds_stochastic
returns a list containing
theta |
a numeric matrix giving the |
loss |
the value of the multidimensional scaling objective function. |
dist_mat |
the dissimilary matrix. This element exists only if |
seq2feature_ngram
extracts ngram features from response processes.
seq2feature_ngram(seqs, level = 2, type = "binary", sep = "\t")
seq2feature_ngram(seqs, level = 2, type = "binary", sep = "\t")
seqs |
an object of class |
level |
an integer specifying the max length of ngrams |
type |
a character string ( |
sep |
action seperator within ngram. |
Three types of ngram features can be extracted. type = "binary"
gives
binary ngram features indicating whether an ngram appears in a response process.
type = "freq"
gives ngram frequency features. Each feature is the count of
the corresponding ngram in a response process. type = "weighted"
gives the
weighted ngram features proposed in He and von Davier (2015).
a matrix of ngram features
He Q., von Davier M. (2015). Identifying Feature Sequences from Process Data in Problem-Solving Items with N-Grams. In: van der Ark L., Bolt D., Wang WC., Douglas J., Chow SM. (eds) Quantitative Psychology Research. Springer Proceedings in Mathematics & Statistics, vol 140. Springer, Cham.
Other feature extraction methods: aseq2feature_seq2seq
,
atseq2feature_seq2seq
,
seq2feature_mds_large
,
seq2feature_mds
,
seq2feature_seq2seq
,
tseq2feature_seq2seq
seqs <- seq_gen(100) theta <- seq2feature_ngram(seqs)
seqs <- seq_gen(100) theta <- seq2feature_ngram(seqs)
seq2feature_seq2seq
extract features from response processes by autoencoder.
seq2feature_seq2seq(seqs, ae_type = "action", K, rnn_type = "lstm", n_epoch = 50, method = "last", step_size = 1e-04, optimizer_name = "adam", cumulative = FALSE, log = TRUE, weights = c(1, 0.5), samples_train, samples_valid, samples_test = NULL, pca = TRUE, verbose = TRUE, return_theta = TRUE)
seq2feature_seq2seq(seqs, ae_type = "action", K, rnn_type = "lstm", n_epoch = 50, method = "last", step_size = 1e-04, optimizer_name = "adam", cumulative = FALSE, log = TRUE, weights = c(1, 0.5), samples_train, samples_valid, samples_test = NULL, pca = TRUE, verbose = TRUE, return_theta = TRUE)
seqs |
an object of class |
ae_type |
a string specifies the type of autoencoder. The autoencoder can be an action sequence autoencoder ("action"), a time sequence autoencoder ("time"), or an action-time sequence autoencoder ("both"). |
K |
the number of features to be extracted. |
rnn_type |
the type of recurrent unit to be used for modeling
response processes. |
n_epoch |
the number of training epochs for the autoencoder. |
method |
the method for computing features from the output of an
recurrent neural network in the encoder. Available options are
|
step_size |
the learning rate of optimizer. |
optimizer_name |
a character string specifying the optimizer to be used
for training. Availabel options are |
cumulative |
logical. If TRUE, the sequence of cumulative time up to each event is used as input to the neural network. If FALSE, the sequence of inter-arrival time (gap time between an event and the previous event) will be used as input to the neural network. Default is FALSE. |
log |
logical. If TRUE, for the timestamp sequences, input of the neural net is the base-10 log of the original sequence of times plus 1 (i.e., log10(t+1)). If FALSE, the original sequence of times is used. |
weights |
a vector of 2 elements for the weight of the loss of action sequences (categorical_crossentropy) and time sequences (mean squared error), respectively. The total loss is calculated as the weighted sum of the two losses. |
samples_train , samples_valid , samples_test
|
vectors of indices specifying the training, validation and test sets for training autoencoder. |
pca |
logical. If TRUE, the principal components of features are returned. Default is TRUE. |
verbose |
logical. If TRUE, training progress is printed. |
return_theta |
logical. If TRUE, extracted features are returned. |
This function wraps aseq2feature_seq2seq
,
tseq2feature_seq2seq
, and atseq2feature_seq2seq
.
seq2feature_seq2seq
returns a list containing
theta |
a matrix containing |
train_loss |
a vector of length |
valid_loss |
a vector of length |
test_loss |
a vector of length |
Tang, X., Wang, Z., Liu, J., and Ying, Z. (2020) An exploratory analysis of the latent structure of process data via action sequence autoencoders. British Journal of Mathematical and Statistical Psychology. 74(1), 1-33.
chooseK_seq2seq
for choosing K
through cross-validation.
Other feature extraction methods: aseq2feature_seq2seq
,
atseq2feature_seq2seq
,
seq2feature_mds_large
,
seq2feature_mds
,
seq2feature_ngram
,
tseq2feature_seq2seq
if (!system("python -c 'import tensorflow as tf'", ignore.stdout = TRUE, ignore.stderr= TRUE)) { n <- 50 data(cc_data) samples <- sample(1:length(cc_data$seqs$time_seqs), n) seqs <- sub_seqs(cc_data$seqs, samples) # action sequence autoencoder K_res <- chooseK_seq2seq(seqs=seqs, ae_type="action", K_cand=c(5, 10), n_epoch=5, n_fold=2, valid_prop=0.2) seq2seq_res <- seq2feature_seq2seq(seqs=seqs, ae_type="action", K=K_res$K, n_epoch=5, samples_train=1:40, samples_valid=41:50) theta <- seq2seq_res$theta # time sequence autoencoder K_res <- chooseK_seq2seq(seqs=seqs, ae_type="time", K_cand=c(5, 10), n_epoch=5, n_fold=2, valid_prop=0.2) seq2seq_res <- seq2feature_seq2seq(seqs=seqs, ae_type="time", K=K_res$K, n_epoch=5, samples_train=1:40, samples_valid=41:50) theta <- seq2seq_res$theta # action and time sequence autoencoder K_res <- chooseK_seq2seq(seqs=seqs, ae_type="both", K_cand=c(5, 10), n_epoch=5, n_fold=2, valid_prop=0.2) seq2seq_res <- seq2feature_seq2seq(seqs=seqs, ae_type="both", K=K_res$K, n_epoch=5, samples_train=1:40, samples_valid=41:50) theta <- seq2seq_res$theta plot(seq2seq_res$train_loss, col="blue", type="l") lines(seq2seq_res$valid_loss, col="red") }
if (!system("python -c 'import tensorflow as tf'", ignore.stdout = TRUE, ignore.stderr= TRUE)) { n <- 50 data(cc_data) samples <- sample(1:length(cc_data$seqs$time_seqs), n) seqs <- sub_seqs(cc_data$seqs, samples) # action sequence autoencoder K_res <- chooseK_seq2seq(seqs=seqs, ae_type="action", K_cand=c(5, 10), n_epoch=5, n_fold=2, valid_prop=0.2) seq2seq_res <- seq2feature_seq2seq(seqs=seqs, ae_type="action", K=K_res$K, n_epoch=5, samples_train=1:40, samples_valid=41:50) theta <- seq2seq_res$theta # time sequence autoencoder K_res <- chooseK_seq2seq(seqs=seqs, ae_type="time", K_cand=c(5, 10), n_epoch=5, n_fold=2, valid_prop=0.2) seq2seq_res <- seq2feature_seq2seq(seqs=seqs, ae_type="time", K=K_res$K, n_epoch=5, samples_train=1:40, samples_valid=41:50) theta <- seq2seq_res$theta # action and time sequence autoencoder K_res <- chooseK_seq2seq(seqs=seqs, ae_type="both", K_cand=c(5, 10), n_epoch=5, n_fold=2, valid_prop=0.2) seq2seq_res <- seq2feature_seq2seq(seqs=seqs, ae_type="both", K=K_res$K, n_epoch=5, samples_train=1:40, samples_valid=41:50) theta <- seq2seq_res$theta plot(seq2seq_res$train_loss, col="blue", type="l") lines(seq2seq_res$valid_loss, col="red") }
seqm
is used to fit a neural network model relating a response process
with a variable.
seqm(seqs, response, covariates = NULL, response_type, actions = unique(unlist(seqs$action_seqs)), rnn_type = "lstm", include_time = FALSE, time_interval = TRUE, log_time = TRUE, K_emb = 20, K_rnn = 20, n_hidden = 0, K_hidden = NULL, index_valid = 0.2, verbose = FALSE, max_len = NULL, n_epoch = 20, batch_size = 16, optimizer_name = "rmsprop", step_size = 0.001)
seqm(seqs, response, covariates = NULL, response_type, actions = unique(unlist(seqs$action_seqs)), rnn_type = "lstm", include_time = FALSE, time_interval = TRUE, log_time = TRUE, K_emb = 20, K_rnn = 20, n_hidden = 0, K_hidden = NULL, index_valid = 0.2, verbose = FALSE, max_len = NULL, n_epoch = 20, batch_size = 16, optimizer_name = "rmsprop", step_size = 0.001)
seqs |
an object of class |
response |
response variable. |
covariates |
covariate matrix. |
response_type |
"binary" or "scale". |
actions |
a character vector gives all possible actions. It is will be
expanded to include all actions appear in |
rnn_type |
the type of recurrent unit to be used for modeling
response processes. |
include_time |
logical. If the timestamp sequence should be included in the model. |
time_interval |
logical. If the timestamp sequence is included as a sequence of inter-arrival time. |
log_time |
logical. If take the logarithm of the time sequence. |
K_emb |
the latent dimension of the embedding layer. |
K_rnn |
the latent dimension of the recurrent neural network. |
the number of hidden fully-connected layers. |
|
a vector of length |
|
index_valid |
proportion of sequences used as the validation set or a vector of indices specifying the validation set. |
verbose |
logical. If TRUE, training progress is printed. |
max_len |
the maximum length of response processes. |
n_epoch |
the number of training epochs. |
batch_size |
the batch size used in training. |
optimizer_name |
a character string specifying the optimizer to be used
for training. Availabel options are |
step_size |
the learning rate of optimizer. |
The model consists of an embedding layer, a recurrent layer and one or more
fully connected layers. The embedding layer takes an action sequence and
output a sequences of K
dimensional numeric vectors to the recurrent
layer. If include_time = TRUE
, the embedding sequence is combined with
the timestamp sequence in the response process as the input the recurrent
layer. The last output of the recurrent layer and the covariates specified in
covariates
are used as the input of the subsequent fully connected layer.
If response_type="binary"
, the last layer uses the sigmoid activation
to produce the probability of the response being one. If
response_type="scale"
, the last layer uses the linear activation. The
dimension of the output of other fully connected layers (if any) is specified
by K_hidden
.
The action sequences are re-coded into integer sequences and are padded with
zeros to length max_len
before feeding into the model. If the provided
max_len
is smaller than the length of the longest sequence in
seqs
, it will be overridden.
seqm
returns an object of class "seqm"
, which is a list containing
structure |
a string describing the neural network structure. |
coefficients |
a list of fitted coefficients. The length of the list is
6 + 2 * |
model_fit |
a vector of class |
feature_model |
a vector of class |
include_time |
if the timestamp sequence is included in the model. |
time_interval |
if inter-arrival time is used. |
log_time |
if the logarithm time is used. |
actions |
all possible actions. |
max_len |
the maximum length of action sequences. |
history |
a |
predict.seqm
for the predict
method for seqm
objects.
if (!system("python -c 'import tensorflow as tf'", ignore.stdout = TRUE, ignore.stderr= TRUE)) { n <- 100 data(cc_data) samples <- sample(1:length(cc_data$responses), n) seqs <- sub_seqs(cc_data$seqs, samples) y <- cc_data$responses[samples] x <- matrix(rnorm(n*2), ncol=2) index_test <- 91:100 index_train <- 1:90 seqs_train <- sub_seqs(seqs, index_train) seqs_test <- sub_seqs(seqs, index_test) actions <- unique(unlist(seqs$action_seqs)) ## no covariate is used res1 <- seqm(seqs = seqs_train, response = y[index_train], response_type = "binary", actions=actions, K_emb = 5, K_rnn = 5, n_epoch = 5) pred_res1 <- predict(res1, new_seqs = seqs_test) mean(as.numeric(pred_res1 > 0.5) == y[index_test]) ## add more fully connected layers after the recurrent layer. res2 <- seqm(seqs = seqs_train, response = y[index_train], response_type = "binary", actions=actions, K_emb = 5, K_rnn = 5, n_hidden=2, K_hidden=c(10,5), n_epoch = 5) pred_res2 <- predict(res2, new_seqs = seqs_test) mean(as.numeric(pred_res2 > 0.5) == y[index_test]) ## add covariates res3 <- seqm(seqs = seqs_train, response = y[index_train], covariates = x[index_train, ], response_type = "binary", actions=actions, K_emb = 5, K_rnn = 5, n_epoch = 5) pred_res3 <- predict(res3, new_seqs = seqs_test, new_covariates=x[index_test, ]) ## include time sequences res4 <- seqm(seqs = seqs_train, response = y[index_train], response_type = "binary", actions=actions, include_time=TRUE, K_emb=5, K_rnn=5, n_epoch=5) pred_res4 <- predict(res4, new_seqs = seqs_test) }
if (!system("python -c 'import tensorflow as tf'", ignore.stdout = TRUE, ignore.stderr= TRUE)) { n <- 100 data(cc_data) samples <- sample(1:length(cc_data$responses), n) seqs <- sub_seqs(cc_data$seqs, samples) y <- cc_data$responses[samples] x <- matrix(rnorm(n*2), ncol=2) index_test <- 91:100 index_train <- 1:90 seqs_train <- sub_seqs(seqs, index_train) seqs_test <- sub_seqs(seqs, index_test) actions <- unique(unlist(seqs$action_seqs)) ## no covariate is used res1 <- seqm(seqs = seqs_train, response = y[index_train], response_type = "binary", actions=actions, K_emb = 5, K_rnn = 5, n_epoch = 5) pred_res1 <- predict(res1, new_seqs = seqs_test) mean(as.numeric(pred_res1 > 0.5) == y[index_test]) ## add more fully connected layers after the recurrent layer. res2 <- seqm(seqs = seqs_train, response = y[index_train], response_type = "binary", actions=actions, K_emb = 5, K_rnn = 5, n_hidden=2, K_hidden=c(10,5), n_epoch = 5) pred_res2 <- predict(res2, new_seqs = seqs_test) mean(as.numeric(pred_res2 > 0.5) == y[index_test]) ## add covariates res3 <- seqm(seqs = seqs_train, response = y[index_train], covariates = x[index_train, ], response_type = "binary", actions=actions, K_emb = 5, K_rnn = 5, n_epoch = 5) pred_res3 <- predict(res3, new_seqs = seqs_test, new_covariates=x[index_test, ]) ## include time sequences res4 <- seqm(seqs = seqs_train, response = y[index_train], response_type = "binary", actions=actions, include_time=TRUE, K_emb=5, K_rnn=5, n_epoch=5) pred_res4 <- predict(res4, new_seqs = seqs_test) }
Subset response processes
sub_seqs(seqs, ids)
sub_seqs(seqs, ids)
seqs |
an object of class |
ids |
a vector of indices |
an object of class "proc"
data(cc_data) seqs <- sub_seqs(cc_data$seqs, 1:10)
data(cc_data) seqs <- sub_seqs(cc_data$seqs, 1:10)
subtask_analysis
performs subtask identification procedure.
subtask_analysis(action_seqs, lambda = 0.3, n_subtask, rnn_dim = 20, n_epoch = 20, step_size = 0.001, batch_size = 1, optimizer_name = "rmsprop", index_valid = 0.2, verbose = FALSE, ...)
subtask_analysis(action_seqs, lambda = 0.3, n_subtask, rnn_dim = 20, n_epoch = 20, step_size = 0.001, batch_size = 1, optimizer_name = "rmsprop", index_valid = 0.2, verbose = FALSE, ...)
action_seqs |
a list of action sequences |
lambda |
a number between 0 and 1 |
n_subtask |
the desired number of subtasks or a vector of candidate number of subtasks |
rnn_dim |
latent dimension of RNN |
n_epoch |
the number of training epochs. |
step_size |
the learning rate of optimizer. |
batch_size |
the batch size used in training. |
optimizer_name |
a character string specifying the optimizer to be used
for training. Availabel options are |
index_valid |
proportion of sequences used as the validation set or a vector of indices specifying the validation set. |
verbose |
logical. If TRUE, training progress is printed. |
... |
additional arguments passed to |
an object of class "subtask"
. It is a list containing
action_seqs |
a list of action sequences |
entropy_seqs |
a list of entropy sequences |
seg_seqs |
a list of segment boundaries |
subtask_seqs |
a list of subtask sequences |
subtasks |
a vector of subtasks |
n_subtask |
the number of subtasks |
tot.withinss |
a vector of total within cluster sum of squares |
relative_cluster_profiles |
a |
loss_history |
a |
rnn_dim |
the latent dimension of the recurrent neural network |
model_fit |
a vector of class |
actions |
a vector of the actions in |
max_len |
maximum length of the action sequences. |
Wang, Z., Tang, X., Liu, J., and Ying, Z. (2020) Subtask analysis of process data through a predictive model. https://arxiv.org/abs/2009.00717
action2entropy
, entropy2segment
, and
segment2subtask
for the three steps of subtask analysis.
"proc"
The summary of a "proc" object combines the summary of the action sequences and the summary of the timestamp sequences.
## S3 method for class 'proc' summary(object, ...)
## S3 method for class 'proc' summary(object, ...)
object |
an object of class |
... |
not used. |
a list. Its components are the components returned by action_seqs_summary and time_seqs_summary.
action_seqs_summary and time_seqs_summary
Summarize timestamp sequences
time_seqs_summary(time_seqs)
time_seqs_summary(time_seqs)
time_seqs |
a list of timestamp sequences |
a list containing the following objects
total_time |
total response time of |
mean_react_time |
mean reaction time of |
tseq2feature_seq2seq
extract features from timestamps of action sequences by a
sequence autoencoder.
tseq2feature_seq2seq(tseqs, K, cumulative = FALSE, log = TRUE, rnn_type = "lstm", n_epoch = 50, method = "last", step_size = 1e-04, optimizer_name = "rmsprop", samples_train, samples_valid, samples_test = NULL, pca = TRUE, verbose = TRUE, return_theta = TRUE)
tseq2feature_seq2seq(tseqs, K, cumulative = FALSE, log = TRUE, rnn_type = "lstm", n_epoch = 50, method = "last", step_size = 1e-04, optimizer_name = "rmsprop", samples_train, samples_valid, samples_test = NULL, pca = TRUE, verbose = TRUE, return_theta = TRUE)
tseqs |
a list of |
K |
the number of features to be extracted. |
cumulative |
logical. If TRUE, the sequence of cumulative time up to each event is used as input to the neural network. If FALSE, the sequence of inter-arrival time (gap time between an event and the previous event) will be used as input to the neural network. Default is FALSE. |
log |
logical. If TRUE, for the timestamp sequences, input of the neural net is the base-10 log of the original sequence of times plus 1 (i.e., log10(t+1)). If FALSE, the original sequence of times is used. |
rnn_type |
the type of recurrent unit to be used for modeling
response processes. |
n_epoch |
the number of training epochs for the autoencoder. |
method |
the method for computing features from the output of an
recurrent neural network in the encoder. Available options are
|
step_size |
the learning rate of optimizer. |
optimizer_name |
a character string specifying the optimizer to be used
for training. Availabel options are |
samples_train |
vectors of indices specifying the training, validation and test sets for training autoencoder. |
samples_valid |
vectors of indices specifying the training, validation and test sets for training autoencoder. |
samples_test |
vectors of indices specifying the training, validation and test sets for training autoencoder. |
pca |
logical. If TRUE, the principal components of features are returned. Default is TRUE. |
verbose |
logical. If TRUE, training progress is printed. |
return_theta |
logical. If TRUE, extracted features are returned. |
This function trains a sequence-to-sequence autoencoder using keras. The encoder of the autoencoder consists of a recurrent neural network. The decoder consists of another recurrent neural network and a fully connected layer with ReLU activation. The outputs of the encoder are the extracted features.
The output of the encoder is a function of the encoder recurrent neural network.
It is the last latent state of the encoder recurrent neural network if method="last"
and the average of the encoder recurrent neural network latent states if method="avg"
.
tseq2feature_seq2seq
returns a list containing
theta |
a matrix containing |
train_loss |
a vector of length |
valid_loss |
a vector of length |
test_loss |
a vector of length |
chooseK_seq2seq
for choosing K
through cross-validation.
Other feature extraction methods: aseq2feature_seq2seq
,
atseq2feature_seq2seq
,
seq2feature_mds_large
,
seq2feature_mds
,
seq2feature_ngram
,
seq2feature_seq2seq
if (!system("python -c 'import tensorflow as tf'", ignore.stdout = TRUE, ignore.stderr= TRUE)) { n <- 50 data(cc_data) samples <- sample(1:length(cc_data$seqs$time_seqs), n) tseqs <- cc_data$seqs$time_seqs[samples] time_seq2seq_res <- tseq2feature_seq2seq(tseqs, 5, rnn_type="lstm", n_epoch=5, samples_train=1:40, samples_valid=41:50) features <- time_seq2seq_res$theta plot(time_seq2seq_res$train_loss, col="blue", type="l", ylim = range(c(time_seq2seq_res$train_loss, time_seq2seq_res$valid_loss))) lines(time_seq2seq_res$valid_loss, col="red", type = 'l') }
if (!system("python -c 'import tensorflow as tf'", ignore.stdout = TRUE, ignore.stderr= TRUE)) { n <- 50 data(cc_data) samples <- sample(1:length(cc_data$seqs$time_seqs), n) tseqs <- cc_data$seqs$time_seqs[samples] time_seq2seq_res <- tseq2feature_seq2seq(tseqs, 5, rnn_type="lstm", n_epoch=5, samples_train=1:40, samples_valid=41:50) features <- time_seq2seq_res$theta plot(time_seq2seq_res$train_loss, col="blue", type="l", ylim = range(c(time_seq2seq_res$train_loss, time_seq2seq_res$valid_loss))) lines(time_seq2seq_res$valid_loss, col="red", type = 'l') }
Transform a timestamp sequence into a inter-arrival time sequence
tseq2interval(x)
tseq2interval(x)
x |
a timestamp sequence |
a numeric vector of the same length as x
. The first element in
the returned vector is 0. The t-th returned element is x[t] - x[t-1]
.
Write process data to csv files
write.seqs(seqs, file, style, id_var = "ID", action_var = "Event", time_var = "Time", step_sep = ",", ...)
write.seqs(seqs, file, style, id_var = "ID", action_var = "Event", time_var = "Time", step_sep = ",", ...)
seqs |
an object of class |
file |
the name of the csv file from which the response processes are to be read. |
style |
the style that the response processes are stored. See 'Details'. |
id_var |
a string giving the name of the variable storing the process identifier. |
action_var |
a string giving the name of the variable storing action sequences. |
time_var |
a string giving the name of the variable storing timestamp sequences. |
step_sep |
the step separator characters. It is only used if |
... |
further arguments to be passed to |
No return value.