Title: | Bayesian Context Trees for Discrete Time Series |
---|---|
Description: | An implementation of a collection of tools for exact Bayesian inference with discrete times series. This package contains functions that can be used for prediction, model selection, estimation, segmentation/change-point detection and other statistical tasks. Specifically, the functions provided can be used for the exact computation of the prior predictive likelihood of the data, for the identification of the a posteriori most likely (MAP) variable-memory Markov models, for calculating the exact posterior probabilities and the AIC and BIC scores of these models, for prediction with respect to log-loss and 0-1 loss and segmentation/change-point detection. Example data sets from finance, genetics, animal communication and meteorology are also provided. Detailed descriptions of the underlying theory and algorithms can be found in [Kontoyiannis et al. 'Bayesian Context Trees: Modelling and exact inference for discrete time series.' Journal of the Royal Statistical Society: Series B (Statistical Methodology), April 2022. Available at: <arXiv:2007.14900> [stat.ME], July 2020] and [Lungu et al. 'Change-point Detection and Segmentation of Discrete Data using Bayesian Context Trees' <arXiv:2203.04341> [stat.ME], March 2022]. |
Authors: | Ioannis Papageorgiou, Valentinian Mihai Lungu, Ioannis Kontoyiannis |
Maintainer: | Valentinian Mihai Lungu <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2 |
Built: | 2025-02-05 04:25:26 UTC |
Source: | https://github.com/cran/BCT |
Finds the maximum a posteriori probability (MAP) tree model.
BCT(input_data, depth, beta = NULL)
BCT(input_data, depth, beta = NULL)
input_data |
the sequence to be analysed. The sequence needs to be a "character" object. See the examples section on how to transform any dataset to a "character" object. |
depth |
maximum memory length. |
beta |
hyper-parameter of the model prior. Takes values between 0 and 1. If not initialised in the call function, the default value is 1-2-m+1, where m is the size of the alphabet; for more information see Kontoyiannis et al. (2020). |
returns a list object which includes:
Contexts |
MAP model given as a list object containing the contexts of its leaves. |
Results |
a dataframe with the following columns: prior probability, log(prior probability), posterior probability, log(posterior probability), number of leaves, maximum depth, BIC score, AIC score and maximum log-likelihood. |
# Finding the MAP model with maximum depth <= 10 # for the SP500 dataset (with default value beta): BCT(SP500, 10) # For custom beta (e.g. 0.7): BCT(SP500, 10, 0.7) # The type of the input dataset is "character" # If the dataset is contained within a vector: q <- c(1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0) # Convert a vector to a "character" object: s <- paste(q, collapse = "") BCT(s, 2) # Reading a file using the readChar function # Example 1: The dataset is stored in a .txt file # fileName <- '~/example_data.txt' # fileName stores the path to the dataset # s<-readChar(fileName, file.info(fileName)$size) # Make sure that s does not contain any "\n" at the end of the string # To remove last entry: # s<-gsub('.$', '', s) # To remove any unwanted characters (e.g. "\n"): # s<-gsub('\n', '', s) # Example 2: The dataset is stored in a .csv file # fileName <- '~/example_data.csv' # fileName stores the path to the dataset # s<-readChar(fileName, file.info(fileName)$size) # Depending on the running environment, # s might contain unwanted characters such as: "\n" or "\r\n". # Remove any unwanted characters (e.g. "\r\n"): # s<-gsub('\r\n', '', s) # Always make sure that s does not contain any unwanted characters
# Finding the MAP model with maximum depth <= 10 # for the SP500 dataset (with default value beta): BCT(SP500, 10) # For custom beta (e.g. 0.7): BCT(SP500, 10, 0.7) # The type of the input dataset is "character" # If the dataset is contained within a vector: q <- c(1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0) # Convert a vector to a "character" object: s <- paste(q, collapse = "") BCT(s, 2) # Reading a file using the readChar function # Example 1: The dataset is stored in a .txt file # fileName <- '~/example_data.txt' # fileName stores the path to the dataset # s<-readChar(fileName, file.info(fileName)$size) # Make sure that s does not contain any "\n" at the end of the string # To remove last entry: # s<-gsub('.$', '', s) # To remove any unwanted characters (e.g. "\n"): # s<-gsub('\n', '', s) # Example 2: The dataset is stored in a .csv file # fileName <- '~/example_data.csv' # fileName stores the path to the dataset # s<-readChar(fileName, file.info(fileName)$size) # Depending on the running environment, # s might contain unwanted characters such as: "\n" or "\r\n". # Remove any unwanted characters (e.g. "\r\n"): # s<-gsub('\r\n', '', s) # Always make sure that s does not contain any unwanted characters
This function calculates the exact posterior for a sequence with a single change-point.
calculate_exact_changepoint_posterior(input_data, depth, alphabet)
calculate_exact_changepoint_posterior(input_data, depth, alphabet)
input_data |
the sequence to be analysed. |
depth |
maximum memory length. |
alphabet |
symbols appearing in the sequence. |
empirical posterior of the change-points locations.
# Use the first 300 samples of the simian_40 dataset. # Run the function with 1 change-point, a maximum depth of 2 and the ["a", "c", "g", "t"] alphabet. res <- calculate_exact_changepoint_posterior(substr(simian_40, 1, 300), 2, c("acgt"))
# Use the first 300 samples of the simian_40 dataset. # Run the function with 1 change-point, a maximum depth of 2 and the ["a", "c", "g", "t"] alphabet. res <- calculate_exact_changepoint_posterior(substr(simian_40, 1, 300), 2, c("acgt"))
Computes the count vectors of all contexts up to a certain length (D) for a given dataset. The first D characters are used to construct the initial context and the counting is performed on the remaining characters. These counts are needed for intermediate computations in BCT and kBCT, and can also be viewed as maximum likelihood estimates of associated parameters; see Kontoyiannis et al. (2020).
compute_counts(input_data, depth)
compute_counts(input_data, depth)
input_data |
the sequence to be analysed. The sequence needs to be a "character" object. See the examples section of the BCT/kBCT functions on how to transform any dataset to a "character" object. |
depth |
maximum memory length. |
a list containing the counts of all contexts of length ≤ depth. If a context with a smaller length than the maximum depth is not contained in the output, its associated count vector is 0. 'Root' indicates the empty context.
# For the pewee dataset: compute_counts(pewee, 3)
# For the pewee dataset: compute_counts(pewee, 3)
Computes the prior predictive likelihood of the data given a specific alphabet. This function is used in for change-point point/segmentation problems
CTW(input_data, depth, desired_alphabet = NULL, beta = NULL)
CTW(input_data, depth, desired_alphabet = NULL, beta = NULL)
input_data |
the sequence to be analysed. The sequence needs to be a "character" object. See the examples section of the BCT/kBCT functions on how to transform any dataset to a "character" object. |
depth |
maximum memory length. |
desired_alphabet |
set containing the symbols of the process. If not initialised, the default set contains all the unique symbols which appear in the sequence. This parameter is needed for the segmentation problem where short segments might not contain all the symbols in the alphabet. |
beta |
hyper-parameter of the model prior. Takes values between 0 and 1. If not initialised in the call function, the default value is 1-2-m+1, where m is the size of the alphabet; for more information see Kontoyiannis et al. (2020). |
returns the natural logarithm of the prior predictive likelihood of the data.
# For the gene_s dataset with a maximum depth of 10 (with dafault value of beta): CTW(gene_s, 10) # With the ["0", "1", "2", "3"] alphabet CTW(gene_s, 10, "0123") # For custom beta (e.g. 0.8): CTW(gene_s, 10, ,0.8)
# For the gene_s dataset with a maximum depth of 10 (with dafault value of beta): CTW(gene_s, 10) # With the ["0", "1", "2", "3"] alphabet CTW(gene_s, 10, "0123") # For custom beta (e.g. 0.8): CTW(gene_s, 10, ,0.8)
This function plots the models produced by the BCT and kBCT functions.
draw_models(lst)
draw_models(lst)
lst |
output of the BCT/kBCT function. |
plots of the BCT/kBCT output models.
# Use the pewee dataset as an example: q <- BCT(pewee, 5) # maximum depth of 5 draw_models(q) r <- kBCT(pewee, 5, 3) # maximum depth of 5, and k = 3 (top 3 a posteriori most likely models) draw_models(r)
# Use the pewee dataset as an example: q <- BCT(pewee, 5) # maximum depth of 5 draw_models(q) r <- kBCT(pewee, 5, 3) # maximum depth of 5, and k = 3 (top 3 a posteriori most likely models) draw_models(r)
This dataset consists of 495 annual observations between 1525 to 2020, with 0 representing the absence of an El Nino event and 1 indicating its presence. El Nino is one of the most influential natural climate patterns on earth. It impacts ocean temperatures, the strength of ocean currents, and the local weather in South America.
el_nino
el_nino
An object of class "character"
.
W.H. Quinn, V.T. Neal, and S.E. Antunez De Mayolo. El Nino occurrences over the past four and a half centuries. Journal of Geophysical Research: Oceans, 92(C13):14449–14461, 1987. (Quinn)
(el_nino)s
This dataset contains the 48502 base-pair-long genome of the bacteriophage lambda virus. The bacteriophage lambda is a parasite of the intestinal bacterium Escherichia coli. This virus is a benchmark sequence for the comparison of segmentation algorithms.
enterophage
enterophage
An object of class "character"
.
Enterobacteria phage lambda, complete genome. (Enterobacteria_phage_lambda)
This dataset contains the spike (S) gene, in positions 21,563–25,384 of the SARS-CoV-2 genome. The importance of this gene is that it codes for the surface glycoprotein whose function was identified in Yan et al. (2020) and Lan et al. (2020) as critical, in that it binds onto the Angiotensin Converting Enzyme 2 (ACE2) receptor on human epithelial cells, giving the virus access to the cell and thus facilitating the COVID-19 disease. The gene sequence is mapped to the alphabet 0,1,2,3 via the obvious map A->0, C->1, G->2, T->3.
gene_s
gene_s
An object of class "character"
.
Wu, F., S. Zhao, B. Yu, et al. (2020). A new coronavirus associated with human respiratory disease in China. Nature 579(7798), 265–269. (PMC)
Yan, R., Y. Zhang, Y. Li, L. Xia, Y. Guo, and Q. Zhou (2020). Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2. Science 367(6485), 1444–1448. (aaas)
Lan, J., J. Ge, J. Yu, et al. (2020). Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor. Nature 581, 215–220. (nature)
BCT(gene_s, 5)
BCT(gene_s, 5)
Generates a simulated sequence of data according to a given model and associated parameters. An initial context of length equal to the maximum depth of the model is first generated uniformly and independently, and it is deleted after the desired number of samples has been generated.
generate_data(ct_theta, N)
generate_data(ct_theta, N)
ct_theta |
a list containing the contexts that specify a model, and also a parameter vector for each context. |
N |
length of the sequence to be generated. |
a simulated sequence as a "character" object
# Create a list containing contexts and associated parameters. d1 <- list("0" = c(0.2, 0.8), "10" = c(0.9, 0.1), "11" = c(0,1)) # The contexts need to correspond to the leaves of a proper tree. # The key of each vector is the context. # For example: # For context "0": P(x_{i+1} = 0 | x_{i} = 0) = 0.2 # and P(x_{i+1} = 1 | x_{i} = 0) = 0.8 # If a dataset containing only letters is desired: d2 <- list("ab" = c(0.3, 0.7), "b" = c(0.8, 0.2), "aa" = c(0.5,0.5)) # Generate data from d2 gd <- generate_data(d2, 10000) # Use the BCT function to find the MAP model BCT(gd, 10) # maximum depth of 10 # or the kBCT function can be used: kBCT(gd, 10, 5) # maximum depth of 10 and top 5 models
# Create a list containing contexts and associated parameters. d1 <- list("0" = c(0.2, 0.8), "10" = c(0.9, 0.1), "11" = c(0,1)) # The contexts need to correspond to the leaves of a proper tree. # The key of each vector is the context. # For example: # For context "0": P(x_{i+1} = 0 | x_{i} = 0) = 0.2 # and P(x_{i+1} = 1 | x_{i} = 0) = 0.8 # If a dataset containing only letters is desired: d2 <- list("ab" = c(0.3, 0.7), "b" = c(0.8, 0.2), "aa" = c(0.5,0.5)) # Generate data from d2 gd <- generate_data(d2, 10000) # Use the BCT function to find the MAP model BCT(gd, 10) # maximum depth of 10 # or the kBCT function can be used: kBCT(gd, 10, 5) # maximum depth of 10 and top 5 models
This function implements the Metropolis-Hastings sampling algorithm for inferring the locations of the change-points.
infer_fixed_changepoints( input_data, l, depth, alphabet, iters, fileName = NULL )
infer_fixed_changepoints( input_data, l, depth, alphabet, iters, fileName = NULL )
input_data |
the sequence to be analysed. |
l |
number of change-points. |
depth |
maximum memory length. |
alphabet |
symbols appearing in the sequence. |
iters |
number of iterations; for more information see Lungu et al. (2022). |
fileName |
file path for storing the results. |
return a list object which includes:
positions |
the sampled locations of the change-points. |
acceptance_prob |
the empirical acceptance ratio. |
# Use as an example the three_changes dataset. # Run the function with 3 change-points, a maximum depth of 5 and the [0,1,2] alphabet. # The sampler is run for 100 iterations output <- infer_fixed_changepoints(three_changes, 3, 5, c("012"), 100, fileName = NULL) # If the fileName is not set to NULL, # the output file will contain on each line the sampled locations of the change-points.
# Use as an example the three_changes dataset. # Run the function with 3 change-points, a maximum depth of 5 and the [0,1,2] alphabet. # The sampler is run for 100 iterations output <- infer_fixed_changepoints(three_changes, 3, 5, c("012"), 100, fileName = NULL) # If the fileName is not set to NULL, # the output file will contain on each line the sampled locations of the change-points.
This function implements the Metropolis-Hastings sampling algorithm for inferring the number of change-points and their locations.
infer_unknown_changepoints( input_data, l_max, depth, alphabet, iters, fileName = NULL )
infer_unknown_changepoints( input_data, l_max, depth, alphabet, iters, fileName = NULL )
input_data |
the sequence to be analysed. |
l_max |
maximum number of change-points. |
depth |
maximum memory length. |
alphabet |
symbols appearing in the sequence. |
iters |
number of iterations; for more information see Lungu et al. (2022). |
fileName |
file path for storing the results. |
return a list object which includes:
number_changes |
sampled number of change-points. |
positions |
sampled locations of the change-points. |
acceptance_prob |
the empirical acceptance ratio. |
# Use as an example the three_changes dataset. # Run the function with 5 change-points, a maximum depth of 5 and the [0,1,2] alphabet. # The sampler is run for 100 iterations output <- infer_unknown_changepoints(three_changes, 5, 5, c("012"), 100, fileName = NULL) # If the fileName is not set to NULL, # the output file will contain on each line the sampled number of change-points # and the associated sampled locations of the change-points.
# Use as an example the three_changes dataset. # Run the function with 5 change-points, a maximum depth of 5 and the [0,1,2] alphabet. # The sampler is run for 100 iterations output <- infer_unknown_changepoints(three_changes, 5, 5, c("012"), 100, fileName = NULL) # If the fileName is not set to NULL, # the output file will contain on each line the sampled number of change-points # and the associated sampled locations of the change-points.
Finds the top k a posteriori most likely tree models.
kBCT(input_data, depth, k, beta = NULL)
kBCT(input_data, depth, k, beta = NULL)
input_data |
the sequence to be analysed. The sequence needs to be a "character" object. See the examples section on how to transform any dataset to a "character" object. |
depth |
maximum memory length. |
k |
number of the a posteriori most likely tree models to be identified. |
beta |
hyper-parameter of the model prior. Takes values between 0 and 1. If not initialised in the call function, the default value is 1-2-m+1, where m is the size of the alphabet; for more information see: Kontoyiannis et al. (2020). |
a list object which includes:
Contexts |
top k a posteriori most likely models. Each model given as a list object containing the contexts of its leaves. |
Results |
a dataframe with the following columns: prior probability, log(prior probability), posterior probability, log(posterior probability), posterior odds, number of leaves, maximum depth, BIC score, AIC score and maximum log-likelihood. |
# Finding the first 5 a posteriori most likely models with maximum depth <= 5 # for the SP500 dataset (with default value beta): kBCT(SP500, 5, 2) # For custom beta (e.g. 0.8): kBCT(SP500, 5, 2, 0.8) # The type of the input dataset is "character" # If the dataset is contained within a vector: q <- c(1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0) # Convert a vector to a "character" object: s <- paste(q, collapse = "") kBCT(s, 2, 2) # Reading a file using the readChar function # Example 1: The dataset is stored in a .txt file # fileName <- '~/example_data.txt' # fileName stores the path to the dataset # s<-readChar(fileName, file.info(fileName)$size) # Make sure that s does not contain any "\n" at the end of the string # To remove last entry: # s<-gsub('.$', '', s) # To remove any unwanted characters (e.g. "\n"): # s<-gsub('\n', '', s) # Example 2: The dataset is stored in a .csv file # fileName <- '~/example_data.csv' # fileName stores the path to the dataset # s<-readChar(fileName, file.info(fileName)$size) # Depending on the running environment, # s might contain unwanted characters such as: "\n" or "\r\n". # Remove any unwanted characters (e.g. "\r\n"): # s<-gsub('\r\n', '', s) # Always make sure that s does not contain any unwanted characters
# Finding the first 5 a posteriori most likely models with maximum depth <= 5 # for the SP500 dataset (with default value beta): kBCT(SP500, 5, 2) # For custom beta (e.g. 0.8): kBCT(SP500, 5, 2, 0.8) # The type of the input dataset is "character" # If the dataset is contained within a vector: q <- c(1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0) # Convert a vector to a "character" object: s <- paste(q, collapse = "") kBCT(s, 2, 2) # Reading a file using the readChar function # Example 1: The dataset is stored in a .txt file # fileName <- '~/example_data.txt' # fileName stores the path to the dataset # s<-readChar(fileName, file.info(fileName)$size) # Make sure that s does not contain any "\n" at the end of the string # To remove last entry: # s<-gsub('.$', '', s) # To remove any unwanted characters (e.g. "\n"): # s<-gsub('\n', '', s) # Example 2: The dataset is stored in a .csv file # fileName <- '~/example_data.csv' # fileName stores the path to the dataset # s<-readChar(fileName, file.info(fileName)$size) # Depending on the running environment, # s might contain unwanted characters such as: "\n" or "\r\n". # Remove any unwanted characters (e.g. "\r\n"): # s<-gsub('\r\n', '', s) # Always make sure that s does not contain any unwanted characters
Compute the log-loss incurred in BCT prediction with memory length D. Given an initial context (x-D+1, ..., x0) and training data (x1, ..., xn), the log-loss is computed in sequentially predicting the test data (xn+1, ..., xn+T). The function outputs the cummulative, normalized (per-sample) log-loss, at each prediction step; for more information see Kontoyiannis et al.(2020).
log_loss(input_data, depth, train_size, beta = NULL)
log_loss(input_data, depth, train_size, beta = NULL)
input_data |
the sequence to be analysed. The sequence needs to be a "character" object. See the examples section of BCT/kBCT functions on how to transform any dataset to a "character" object. |
depth |
maximum memory length. |
train_size |
number of samples used in the training set. The training set size should be at least equal to the depth. |
beta |
hyper-parameter of the model prior. Takes values between 0 and 1. If not initialised in the call function, the default value is 1-2-m+1, where m is the size of the alphabet; for more information see Kontoyiannis et al. (2020). |
returns a vector containing the averaged log-loss incurred in the sequential prediction at each time-step.
# Compute the log-loss in the prediction of the last 10 elements # of a dataset. log_loss(pewee, 5, nchar(pewee) - 10) # For custom beta (e.g. 0.7): log_loss(pewee, 5, nchar(pewee) - 10, 0.7)
# Compute the log-loss in the prediction of the last 10 elements # of a dataset. log_loss(pewee, 5, nchar(pewee) - 10) # For custom beta (e.g. 0.7): log_loss(pewee, 5, nchar(pewee) - 10, 0.7)
Returns the parameters of each leaf contained in the MAP model.
MAP_parameters(input_data, depth, beta = NULL)
MAP_parameters(input_data, depth, beta = NULL)
input_data |
the sequence to be analysed. The sequence needs to be a "character" object. See the examples section of BCT/kBCT functions on how to transform any dataset to a "character" object. |
depth |
maximum memory length. |
beta |
hyper-parameter of the model prior. Takes values between 0 and 1. If not initialised in the call function, the default value is 1-2-m+1, where m is the size of the alphabet; for more information see Kontoyiannis et al. (2020). |
list of parameters for each of the context within the MAP model.
# Use the gene_s dataset: q <- BCT(gene_s, 10) expected_contexts <- q[['Contexts']] expected_contexts # [1] "3" "1" "0" "23" "20" "21" "22" # For default beta: v <- MAP_parameters(gene_s, 10) # For custom beta (e.g. 0.8): MAP_parameters(gene_s, 10, 0.8) # generate a sequence of data using the generate_data function s <- generate_data(v, 20000) # Use BCT: r <- BCT(s, 10) # Check the resulting contexts: r[['Contexts']] # [1] "3" "0" "1" "20" "22" "23" "21" # The resulting contexts are as expected
# Use the gene_s dataset: q <- BCT(gene_s, 10) expected_contexts <- q[['Contexts']] expected_contexts # [1] "3" "1" "0" "23" "20" "21" "22" # For default beta: v <- MAP_parameters(gene_s, 10) # For custom beta (e.g. 0.8): MAP_parameters(gene_s, 10, 0.8) # generate a sequence of data using the generate_data function s <- generate_data(v, 20000) # Use BCT: r <- BCT(s, 10) # Check the resulting contexts: r[['Contexts']] # [1] "3" "0" "1" "20" "22" "23" "21" # The resulting contexts are as expected
Computes the logarithm of the likelihood of the observations, maximised over all models and parameters.
ML(input_data, depth)
ML(input_data, depth)
input_data |
the sequence to be analysed. The sequence needs to be a "character" object. See the examples section of the BCT/kBCT functions on how to transform any dataset to a "character" object. |
depth |
maximum memory length. |
returns the natural logarithm of the maximum likelihood.
# Computing the maximum likelihood of the gene_s dataset # with a maximum depth of 5: ML(gene_s, 5)
# Computing the maximum likelihood of the gene_s dataset # with a maximum depth of 5: ML(gene_s, 5)
The twilight song of the wood pewee bird can be described as a sequence consisting of an arrangement of musical phrases taken from an alphabet of three specific, distinct phrases. The dataset consists of a single continuous song by a wood pewee, of length n = 1327 phrases.
pewee
pewee
An object of class "character"
.
W. Craig. The song of the wood pewee (Myiochanes virens Linnaeus): A study of bird music. New York State Museum Bulletin No. 334. University of the State of New York, Albany, NY, 1943. (craig)
BCT(pewee, 5)
BCT(pewee, 5)
This function plots the empirical posterior distribution of the change-points.
plot_changepoint_posterior(res, burn)
plot_changepoint_posterior(res, burn)
res |
the output obtained from the Metropolis-Hastings algorithms. |
burn |
the proportion of the samples discarded as burn-in. |
returns plot of the empirical posterior of the number of change-points (if the results from the infer_unknown_changepoints function were used).
returns plot of the empirical posterior of the change-points.
infer_unknown_changepoints
, infer_fixed_changepoints
# Use as an example the el_nino dataset. # Run the function with l_max = 3 change-points, a maximum depth of 5 and the [0, 1] alphabet. # The sampler is run for 100 iterations res_unknown <- infer_unknown_changepoints(el_nino, 3, 5, c("01"), 100, fileName = NULL) # Plot the posterior distribution of the locations and the posterior of the number of change-points. plot_changepoint_posterior(res_unknown, 0.2) # This function can be also used with the infer_fixed_changepoints. # Assume l = 2. res_fixed <- infer_fixed_changepoints(el_nino, 2, 5, c("01"), 100, fileName = NULL) # Now, the function will only output the posterior distribution of the change-points # (the number is fixed). plot_changepoint_posterior(res_fixed, 0.2)
# Use as an example the el_nino dataset. # Run the function with l_max = 3 change-points, a maximum depth of 5 and the [0, 1] alphabet. # The sampler is run for 100 iterations res_unknown <- infer_unknown_changepoints(el_nino, 3, 5, c("01"), 100, fileName = NULL) # Plot the posterior distribution of the locations and the posterior of the number of change-points. plot_changepoint_posterior(res_unknown, 0.2) # This function can be also used with the infer_fixed_changepoints. # Assume l = 2. res_fixed <- infer_fixed_changepoints(el_nino, 2, 5, c("01"), 100, fileName = NULL) # Now, the function will only output the posterior distribution of the change-points # (the number is fixed). plot_changepoint_posterior(res_fixed, 0.2)
This function plots the conditional posterior distribution of the change-points locations given a specific number of change-points.
plot_individual_changepoint_posterior(res, burn, pm, l = NULL)
plot_individual_changepoint_posterior(res, burn, pm, l = NULL)
res |
the output obtained from the Metropolis-Hastings algorithms (either from infer_fixed_changepoints or infer_unknown_changepoints). |
burn |
the proportion of the samples discarded as burn-in. |
pm |
the desired range around the MAP location for each change-point location. |
l |
condition on the number of change-points. If not initialised, the function expects as input the results obtained from the infer_fixed_changepoints function. |
plots of the empirical posterior distributions of the change-points given a specific number of change-points.
infer_fixed_changepoints
, infer_unknown_changepoints
# Use as an example the el_nino dataset. # Run the function with l_max = 3 change-points, a maximum depth of 5 and the [0, 1] alphabet. # The sampler is run for 10000 iterations. res_unknown <- infer_unknown_changepoints(el_nino, 3, 5, c("01"), 100, fileName = NULL) # Because l_max = 3 , there can be 0, 1, 2 or 3 changes. # Let's see the posterior distribution on the number of changes plot_changepoint_posterior(res_unknown, 0.2) # The MAP l is 2. Let's see the distribution of changes given l = 2. plot_individual_changepoint_posterior(res_unknown, 0.2, 20, 2) # One can also see the distribution of changes given l = 1. plot_individual_changepoint_posterior(res_unknown, 0.2, 500, 1) # This function can be also used with the infer_fixed_changepoints # Assume l = 2. res_fixed <- infer_fixed_changepoints(el_nino, 2, 5, c("01"), 100, fileName = NULL) # The function is now called without l = 2 as the number of changes is fixed # (all sampled vectors have 2 values). plot_individual_changepoint_posterior(res_fixed, 0.2, 20)
# Use as an example the el_nino dataset. # Run the function with l_max = 3 change-points, a maximum depth of 5 and the [0, 1] alphabet. # The sampler is run for 10000 iterations. res_unknown <- infer_unknown_changepoints(el_nino, 3, 5, c("01"), 100, fileName = NULL) # Because l_max = 3 , there can be 0, 1, 2 or 3 changes. # Let's see the posterior distribution on the number of changes plot_changepoint_posterior(res_unknown, 0.2) # The MAP l is 2. Let's see the distribution of changes given l = 2. plot_individual_changepoint_posterior(res_unknown, 0.2, 20, 2) # One can also see the distribution of changes given l = 1. plot_individual_changepoint_posterior(res_unknown, 0.2, 500, 1) # This function can be also used with the infer_fixed_changepoints # Assume l = 2. res_fixed <- infer_fixed_changepoints(el_nino, 2, 5, c("01"), 100, fileName = NULL) # The function is now called without l = 2 as the number of changes is fixed # (all sampled vectors have 2 values). plot_individual_changepoint_posterior(res_fixed, 0.2, 20)
Computes the posterior predictive distribution at each time step, and predicts the next symbol as its most likely value. Given an initial context (x-D+1, ..., x0) and training data (x1, ..., xn), the posterior predictive distribution is computed sequentially for the test data (xn+1, ..., xn+T). The function outputs the predicted distribution at each time step, along with the most likely symbol; for more information see Kontoyiannis et al.(2020).
prediction(input_data, depth, train_size, beta = NULL)
prediction(input_data, depth, train_size, beta = NULL)
input_data |
the sequence to be analysed. The sequence needs to be a "character" object. See the examples section on how to transform any dataset to a "character" object. |
depth |
maximum memory length. |
train_size |
number of samples used for training. |
beta |
hyper-parameter of the model prior. Takes values between 0 and 1. If not initialised in the call function, the default value is 1-2-m+1, where m is the size of the alphabet; for more information see: Kontoyiannis et al. (2020). |
returns a "list" containing the posterior predictive distribution at each time step. The last entry in the list, named "Prediction", contains the most likely character at each time step according to the posterior predictive distribution.
# Predicting the 2 last characters of a dataset using a model with a maximum depth of 5 # The training size is the total number of characters within the dataset minus 2: nchar(pewee) - 2 q <- prediction(pewee, 5, nchar(pewee) - 2) q # [[1]] # [1] 0.56300039 0.05899728 0.37800233 # [[2]] # [1] 0.08150306 0.76293065 0.15556628 # $Prediction # [1] "0" "1" # To access the "Prediction" from result list q: q[["Prediction"]] # For custom beta (e.g. 0.8): prediction(pewee, 5, nchar(pewee) - 10, 0.8)
# Predicting the 2 last characters of a dataset using a model with a maximum depth of 5 # The training size is the total number of characters within the dataset minus 2: nchar(pewee) - 2 q <- prediction(pewee, 5, nchar(pewee) - 2) q # [[1]] # [1] 0.56300039 0.05899728 0.37800233 # [[2]] # [1] 0.08150306 0.76293065 0.15556628 # $Prediction # [1] "0" "1" # To access the "Prediction" from result list q: q[["Prediction"]] # For custom beta (e.g. 0.8): prediction(pewee, 5, nchar(pewee) - 10, 0.8)
The severe acute respiratory syndrome coronavirus, SARS-CoV-2, is the novel coronavirus responsible for the Covid-19 global pandemic in 2019-20. This dataset contains the SARS-CoV-2 genome, available in the GenBank database as the sequence MN908947.3. It consists of n = 29903 base pairs. The gene sequence is mapped to the alphabet 0,1,2,3 via the obvious map A->0, C->1, G->2, T->3.
sars_cov_2
sars_cov_2
An object of class "character"
.
K. Clark, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and E.W. Sayers. GenBank. Nucleic Acids Research, 44(D1): D67–D72, January 2016. (ncbi)
F. Wu, S. Zhao, B. Yu, et al. A new coronavirus associated with human respiratory disease in China. Nature, 579(7798):265–269, Februrary 2020. (PubMed)
BCT(sars_cov_2, 5)
BCT(sars_cov_2, 5)
Plots a tree depicting a model with the given set of contexts.
show_tree(s)
show_tree(s)
s |
vector containing the contexts of the leaves of the desired tree. |
plot of the desired tree model.
# Construct an example vector: r <- c("a", "ab", "aab", "b", "ba") show_tree(r) # If the input contains digits: q <- c(11,1,0) show_tree(q)
# Construct an example vector: r <- c("a", "ab", "aab", "b", "ba") show_tree(r) # If the input contains digits: q <- c(11,1,0) show_tree(q)
This dataset contains the 5243 base-pair-long genome of the simian vacuolating virus 40 (SV40). SV40 is a polyomavirus that is found in both monkeys and humans. The expression of SV40 genes is regulated by two major transcripts (early and late), suggesting the presence of a single major change-point in the entire genome.
simian_40
simian_40
An object of class "character"
.
(SV40)
This dataset contains the quantised daily changes xi in the Standard & Poor’s index price, from January 2, 1928 until October 7, 2016. The price changes are quantised to 7 values as follows: If the change between two successive trading days (i - 1) and i is smaller than -3%, xi is set equal to 0; if the change is between -3% and -2%, xi=1; for changes in the intervals (-2%, -1%], (-1%, 1%], (1%, 2%], and (2%, 3%] xi is set equal to 2,3,4 and 5, respectively; and for changes greater than 3%, xi = 6.
SP500
SP500
An object of class "character"
.
Yahoo! finance. (yahoo_finance)
BCT(SP500, 10)
BCT(SP500, 10)
This dataset contains synthetic-generated data. The sequence has three change-points located at 2500, 3500 and 4000. The tree models and associated parameters are chosen to be quite similar, so that the segmentation problem is nontrivial.
three_changes
three_changes
An object of class "character"
.
V. Lungu, I. Papageorgiou and I. Kontoyiannis. Change-point Detection and Segmentation of Discrete Data using Bayesian Context Trees. arxiv.2203.04341 (Segmentation)
Compute the 0-1 loss, i.e., the proportion of incorrectly predicted values, incurred in BCT prediction with memory length D. Given an initial context (x-D+1, ..., x0) and training data (x1, ..., xn), the 0-1 loss is computed in sequentially predicting the test data (xn+1, ..., xn+T). The function outputs the cummulative, normalized (per-sample) 0-1 loss, at each prediction step; for more information see Kontoyiannis et al. (2020).
zero_one_loss(input_data, depth, train_size, beta = NULL)
zero_one_loss(input_data, depth, train_size, beta = NULL)
input_data |
the sequence to be analysed. The sequence needs to be a "character" object. See the examples section of kBCT/BCT functions on how to transform any dataset to a "character" object. |
depth |
maximum memory length. |
train_size |
number of samples used in the training set. The training set size should be at least equal to the depth. |
beta |
hyper-parameter of the model prior. Takes values between 0 and 1. If not initialised in the call function, the default value is 1-2-m+1, where m is the size of the alphabet; for more information see Kontoyiannis et al. (2020) |
returns a vector containing the averaged number of errors at each timestep.
# Use the pewee dataset and look at the last 8 elements: substring(pewee, nchar(pewee)-7, nchar(pewee)) # [1] "10001001" # Predict last 8 elements using the prediction function pred <- prediction(pewee, 10, nchar(pewee)-8)[["Prediction"]] # Taking only the "Prediction" vector: pred # [1] "1" "0" "0" "1" "1" "0" "0" "1" # To transform the result of the prediction function into a "character" object: paste(pred, collapse = "") # [1] "10011001" # As observed, there is only 1 error (the sixth predicted element is 1 instead of a 0). # Thus, up to the 4th place, the averaged error is 0 # and the sixth averaged error is expected to be 1/4. # Indeed, the zero_one_loss function yields the expected answer: zero_one_loss(pewee, 10, nchar(pewee)-8) # [1] 0.0000000 0.0000000 0.0000000 0.2500000 0.2000000 0.1666667 0.1428571 0.1250000
# Use the pewee dataset and look at the last 8 elements: substring(pewee, nchar(pewee)-7, nchar(pewee)) # [1] "10001001" # Predict last 8 elements using the prediction function pred <- prediction(pewee, 10, nchar(pewee)-8)[["Prediction"]] # Taking only the "Prediction" vector: pred # [1] "1" "0" "0" "1" "1" "0" "0" "1" # To transform the result of the prediction function into a "character" object: paste(pred, collapse = "") # [1] "10011001" # As observed, there is only 1 error (the sixth predicted element is 1 instead of a 0). # Thus, up to the 4th place, the averaged error is 0 # and the sixth averaged error is expected to be 1/4. # Indeed, the zero_one_loss function yields the expected answer: zero_one_loss(pewee, 10, nchar(pewee)-8) # [1] 0.0000000 0.0000000 0.0000000 0.2500000 0.2000000 0.1666667 0.1428571 0.1250000