Title: | Calculates information-content-informed Kendall-tau |
---|---|
Description: | Provides functions for calculating information-content-informed Kendall-tau. This version of Kendall-tau allows for the inclusion of missing values. |
Authors: | Robert M Flight [aut, cre] , Hunter NB Moseley [aut] |
Maintainer: | Robert M Flight <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.2.8 |
Built: | 2024-11-18 02:53:53 UTC |
Source: | https://github.com/MoseleyBioinformaticsLab/ICIKendallTau |
Adds uniform noise to values, generating replicates with noise added to the original.
add_uniform_noise(value, n_rep, sd, use_zero = FALSE)
add_uniform_noise(value, n_rep, sd, use_zero = FALSE)
value |
a single or vector of numeric values |
n_rep |
the number of replicates to make (numeric). Default is 1. |
sd |
the standard deviation of the data |
use_zero |
logical, should returned values be around zero or not? |
numeric matrix
Given a matrix of data, calculates the median value in each column or row.
calculate_matrix_medians(in_matrix, use = "col", ...)
calculate_matrix_medians(in_matrix, use = "col", ...)
in_matrix |
numeric matrix of values |
use |
character of "col" or "row" defining columns or rows |
... |
extra parameters to the median function |
numeric
Allows to run cor.test
on a matrix of inputs.
cor_fast( x, y = NULL, use = "everything", method = "pearson", alternative = "two.sided", continuity = FALSE, return_matrix = TRUE )
cor_fast( x, y = NULL, use = "everything", method = "pearson", alternative = "two.sided", continuity = FALSE, return_matrix = TRUE )
x |
a numeric vector, matrix, or data frame. |
y |
NULL (default) or a vector. |
use |
an optional character string giving a method for computing correlations in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", or "pairwise.complete.obs". |
method |
which correlation method to use, "pearson" or "spearman" |
alternative |
how to perform the statistical test |
continuity |
should a continuity correction be applied |
return_matrix |
should the matrices of values be returned, or a long data.frame |
Although the interface is mostly identical to the built-in
stats::cor.test()
method, there are some differences.
if only x
is provided as a matrix, the columns must be named.
if providing both x
and y
, it is assumed they are both
single vectors.
if NA
values are present, this function does not error, but will either remove them
or return NA
, depending on the option.
"na.or.complete" is not a valid option for use
.
A named list with matrices or data.frame is returned, with the rho
and pvalue
values.
a list of matrices, rho, pvalue, or a data.frame.
Given a square correlation matrix, converts it to a long data.frame, with three columns.
cor_matrix_2_long_df(in_matrix)
cor_matrix_2_long_df(in_matrix)
in_matrix |
the correlation matrix |
The data.frame contains three columns:
s1: the first entry of comparison
s2: the second entry of comparison
cor: the correlation value
data.frame
There may be good reasons to turn the logging off after it's been turned on. This basically tells the package that the logger isn't available.
disable_logging()
disable_logging()
Choose to enable logging, to a specific file if desired.
enable_logging(log_file = NULL, memory = FALSE)
enable_logging(log_file = NULL, memory = FALSE)
log_file |
the file to log to |
memory |
provide memory logging too? Only available on Linux and MacOS |
Uses the logger package under the hood, which is suggested in the dependencies. Having logging enabled is nice to see when things are starting and stopping, and what exactly has been done, without needing to write messages to the console. It is especially useful if you are getting errors, but can't really see them, then you can add "memory" logging to see if you are running out of memory.
Default log file has the pattern:
YYYY.MM.DD.HH.MM.SS_ICIKendallTau_run.log
Given a data-matrix, computes the information-theoretic Kendall-tau-b between all samples.
ici_kendalltau( data_matrix, global_na = c(NA, Inf, 0), perspective = "global", scale_max = TRUE, diag_good = TRUE, include_only = NULL, alternative = "two.sided", continuity = FALSE, check_timing = FALSE, return_matrix = TRUE )
ici_kendalltau( data_matrix, global_na = c(NA, Inf, 0), perspective = "global", scale_max = TRUE, diag_good = TRUE, include_only = NULL, alternative = "two.sided", continuity = FALSE, check_timing = FALSE, return_matrix = TRUE )
data_matrix |
matrix or data.frame of values, samples are columns, features are rows |
global_na |
numeric vector that defines globally, what should be treated as NA? |
perspective |
how to treat missing data in denominator and ties, character |
scale_max |
logical, should everything be scaled compared to the maximum correlation? |
diag_good |
logical, should the diagonal entries reflect how many entries in the sample were "good"? |
include_only |
only run the correlations that include the members (as a vector) or combinations (as a list or data.frame) |
alternative |
what is the alternative for the p-value test? |
continuity |
should a continuity correction be applied? |
check_timing |
logical to determine should we try to estimate run time for full dataset? (default is FALSE) |
return_matrix |
logical, should the data.frame or matrix result be returned? |
For more details, see the vignette vignette("ici-kendalltau", package = "ICIKendallTau"))
The default for global_na
includes what values in the data to replace with NA for the Kendall-tau calculation. By default these are global_na = c(NA, Inf, 0)
. If you want to replace something other than 0, for example, you might use global_na = c(NA, Inf, -2)
, and all values of -2 will be replaced instead of 0.
When check_timing = TRUE
, 5 random pairwise comparisons will be run to generate timings on a single core, and then estimates of how long the full set will take are calculated. The data is returned as a data.frame, and will be on the low side, but it should provide you with a good idea of how long your data will take.
Returned is a list containing matrices with:
cor: scaled correlations
raw: raw kendall-tau correlations
pvalue: p-values
taumax: the theoretical maximum kendall-tau value possible
completeness: how complete the two samples are (i.e. how many entries are not missing in either sample)
Eventually, we plan to provide two more parameters for replacing values, feature_na
for feature specific NA values and sample_na
for sample specific NA values.
If you want to know if the missing values in your data are possibly due to
left-censorship, we recommend testing that hypothesis with test_left_censorship()
first.
list with cor, raw, pvalue, taumax, completeness
test_left_censorship()
pairwise_completeness()
kt_fast()
## Not run: # not run set.seed(1234) s1 = sort(rnorm(1000, mean = 100, sd = 10)) s2 = s1 + 10 matrix_1 = cbind(s1, s2) r_1 = ici_kendalltau(matrix_1) r_1$cor # s1 s2 # s1 1 1 # s2 1 1 names(r_1) # "cor", "raw", "pvalue", "taumax", "completeness", "keep", "run_time" s3 = s1 s3[sample(100, 50)] = NA s4 = s2 s4[sample(100, 50)] = NA matrix_2 = cbind(s3, s4) r_2 = ici_kendalltau(matrix_2) r_2$cor # s3 s4 # s3 1.0000000 0.9944616 # s4 0.9944616 1.0000000 # using include_only set.seed(1234) x = t(matrix(rnorm(5000), nrow = 100, ncol = 50)) colnames(x) = paste0("s", seq(1, nrow(x))) # only calculate correlations of other columns with "s1" include_s1 = "s1" s1_only = ici_kendalltau(x, include_only = include_s1) # include s1 and s3 things both include_s1s3 = c("s1", "s3") s1s3_only = ici_kendalltau(x, include_only = include_s1s3) # only specify certain pairs either as a list include_pairs = list(g1 = "s1", g2 = c("s2", "s3")) s1_other = ici_kendalltau(x, include_only = include_pairs) # or a data.frame include_df = as.data.frame(list(g1 = "s1", g2 = c("s2", "s3"))) s1_df = ici_kendalltau(x, include_only = include_df) ## End(Not run)
## Not run: # not run set.seed(1234) s1 = sort(rnorm(1000, mean = 100, sd = 10)) s2 = s1 + 10 matrix_1 = cbind(s1, s2) r_1 = ici_kendalltau(matrix_1) r_1$cor # s1 s2 # s1 1 1 # s2 1 1 names(r_1) # "cor", "raw", "pvalue", "taumax", "completeness", "keep", "run_time" s3 = s1 s3[sample(100, 50)] = NA s4 = s2 s4[sample(100, 50)] = NA matrix_2 = cbind(s3, s4) r_2 = ici_kendalltau(matrix_2) r_2$cor # s3 s4 # s3 1.0000000 0.9944616 # s4 0.9944616 1.0000000 # using include_only set.seed(1234) x = t(matrix(rnorm(5000), nrow = 100, ncol = 50)) colnames(x) = paste0("s", seq(1, nrow(x))) # only calculate correlations of other columns with "s1" include_s1 = "s1" s1_only = ici_kendalltau(x, include_only = include_s1) # include s1 and s3 things both include_s1s3 = c("s1", "s3") s1s3_only = ici_kendalltau(x, include_only = include_s1s3) # only specify certain pairs either as a list include_pairs = list(g1 = "s1", g2 = c("s2", "s3")) s1_other = ici_kendalltau(x, include_only = include_pairs) # or a data.frame include_df = as.data.frame(list(g1 = "s1", g2 = c("s2", "s3"))) s1_df = ici_kendalltau(x, include_only = include_df) ## End(Not run)
Calculates kendall-tau, with consideration of missingness providing information. Uses the calculation of tau-b.
ici_kt( x, y, perspective = "local", alternative = "two.sided", continuity = FALSE, output = "simple" )
ici_kt( x, y, perspective = "local", alternative = "two.sided", continuity = FALSE, output = "simple" )
x |
numeric vector |
y |
numeric vector |
perspective |
should we consider the "local" or "global" perspective? |
alternative |
what is the alternative for the p-value test? |
continuity |
logical: if true, a continuity correction is applied to the p-value |
output |
used to control reporting of values for debugging |
Calculates the information-content-informed Kendall-tau correlation measure. This correlation is based on concordant and discordant ranked pairs, like Kendall-tau, but also includes missing values (as NA). Missing values are assumed to be primarily due to lack of detection due to instrumental sensitivity, and therefore encode some information.
For more details see the ICI-Kendall-tau vignette:
browseVignettes("ICIKendallTau")
kendall tau correlation, p-value, max-correlation, completeness
x = sort(rnorm(100)) y = x + 1 y2 = y y2[1:10] = NA ici_kt(x, y) ici_kt(x, y2, "global") ici_kt(x, y2)
x = sort(rnorm(100)) y = x + 1 y2 = y y2[1:10] = NA ici_kt(x, y) ici_kt(x, y2, "global") ici_kt(x, y2)
Uses the underlying c++ implementation of ici_kt
to provide a fast version
of Kendall-tau correlation.
kt_fast( x, y = NULL, use = "everything", alternative = "two.sided", continuity = FALSE, return_matrix = TRUE )
kt_fast( x, y = NULL, use = "everything", alternative = "two.sided", continuity = FALSE, return_matrix = TRUE )
x |
a numeric vector, matrix, or data frame. |
y |
NULL (default) or a vector. |
use |
an optional character string giving a method for computing correlations in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", or "pairwise.complete.obs". |
alternative |
the type of test |
continuity |
should a continuity correction be applied |
return_matrix |
Should the matrices of values be returned, or a long data.frame |
Although the interface is mostly identical to the built-in
stats::cor()
method, there are some differences.
if only x
is provided as a matrix or data.frame, the columns must be named.
if providing both x
and y
, it is assumed they are both
single vectors.
if NA
values are present, this function does not error, but will either remove them
or return NA
, depending on the option.
"na.or.complete" is not a valid option for use
.
A named list with matrices or data.frame is returned, with the tau
and pvalue
values.
a list of matrices, tau, pvalue, or a data.frame.
Logs the amount of memory being used to a log file if it is available, and generating warnings if the amount of RAM hits zero.
log_memory()
log_memory()
If a log_appender is available, logs the given message at the info
level.
log_message(message_string)
log_message(message_string)
message_string |
the string to put in the message |
Given a long data.frame, converts it to a possibly square correlation matrix
long_df_2_cor_matrix(long_df, is_square = TRUE)
long_df_2_cor_matrix(long_df, is_square = TRUE)
long_df |
the long data.frame |
is_square |
should it be a square matrix? |
matrix
An example dataset that has missingness from left-censorship
missing_dataset
missing_dataset
missing_dataset
A matrix with 1000 rows and 20 columns, where rows are features and columns are samples.
Robert M Flight
Calculates the completeness between any two samples using "or", is an entry missing in either X "or" Y.
pairwise_completeness( data_matrix, global_na = c(NA, Inf, 0), include_only = NULL, return_matrix = TRUE )
pairwise_completeness( data_matrix, global_na = c(NA, Inf, 0), include_only = NULL, return_matrix = TRUE )
data_matrix |
samples are columns, features are rows |
global_na |
globally, what should be treated as NA? |
include_only |
is there certain comparisons to do? |
return_matrix |
should the matrix or data.frame be returned? |
matrix of degree of completeness
Given a data-matrix of numeric data, calculates the rank of each row in each column (feature in sample), gets the median rank across all columns, and returns the original data with missing values set to NA, the reordered data, and a data.frame of the ranks of each feature and the number of missing values.
rank_order_data(data_matrix, global_na = c(NA, Inf, 0), sample_classes = NULL)
rank_order_data(data_matrix, global_na = c(NA, Inf, 0), sample_classes = NULL)
data_matrix |
matrix or data.frame of values |
global_na |
the values to consider as missing |
sample_classes |
are the columns defined by some metadata? |
list with two matrices and a data.frame
Allow the user to turn progress messages to the console and off. Default is to provide messages to the console.
show_progress(progress = TRUE)
show_progress(progress = TRUE)
progress |
logical to have it on or off |
Does a binomial test to check if the most likely cause of missing values is due to values being below the limit of detection, or coming from a left-censored distribution.
test_left_censorship( data_matrix, global_na = c(NA, Inf, 0), sample_classes = NULL )
test_left_censorship( data_matrix, global_na = c(NA, Inf, 0), sample_classes = NULL )
data_matrix |
matrix or data.frame of numeric data |
global_na |
what represents zero or missing? |
sample_classes |
which samples are in which class |
For each feature that is missing in a group of samples, we save as a possibility to test. For each sample, we calculate the median value with any missing values removed. Each feature that had a missing value, we test whether the remaining non-missing values are below the sample median for those samples where the feature is non-missing. A binomial test considers the total number of features instances (minus missing values) as the number of trials, and the number of of features below the sample medians as the number of successes.
There is a bit more detail in the vignette: vignette("testing-for-left-censorship", package = "ICIKendallTau")
data.frame of trials / successes, and binom.test result
# this example has 80% missing due to left-censorship data(missing_dataset) missingness = test_left_censorship(missing_dataset) missingness$values missingness$binomial_test
# this example has 80% missing due to left-censorship data(missing_dataset) missingness = test_left_censorship(missing_dataset) missingness$values missingness$binomial_test
An example dataset from RNA-seq experiment on yeast, created by Gierliński et al., "Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment", Bioinformatics, 31, 2015 https://doi.org/10.1093/bioinformatics/btv425.
yeast_missing
yeast_missing
yeast_missing
A matrix with 6887 rows (genes) and 96 columns (samples).
https://dx.doi.org/10.6084/M9.FIGSHARE.1425502.V1 https://dx.doi.org/10.6084/M9.FIGSHARE.1425503.V1