NA
’s) in calculating a
correlation between two variables.NA
is informative.A Kendall Tau Correlation coefficient calculates correlation based on the number of concordant and discordant pairs:
But these definitions can be expanded to handle missing observations:
The base Kendall tau correlation must be adjusted to handle tied values, ie. the tau-b version of the equation.
$$\tau = \frac{ | pairs_{concordant} | - | pairs_{discordant} |}{\sqrt{ ( n_0 - n_{xtie} ) ( n_0 - n_{ytie} )}} $$ where:
When generating a correlation matrix (heatmap) for large analytical datasets, the number of observations in common can become quite low between any two variables. It becomes advantageous to scale by the pair of variables with the highest information content. One objective scaling factor is the highest possible absolute correlation at the maximum information content observed across a dataset, and dividing by this maximum possible absolute correlation would scale the whole dataset appropriately.
$$maxcorr = \frac{\binom{n-m}{2} + n * m}{\binom{n-m}{2} + n * m + \binom{m}{2}}$$ Where:
The functions that implement this include:
ici_kt
: the workhorse, actually calculating a
correlation between X and Y vectors.
perspective
will control how the
NA
values influence ties.perspective = "global"
.ici_kendallt
: Handles comparisons for a matrix.
library(furrr)
plan(multiprocess)
We’ve also included a function for testing if the missingness in your
data comes from left-censorship, test_left_censorship
. We
walk through creating example data and testing it in the vignette Testing
for Left Censorship.
It turns out, if we think about it really hard, all that is truly necessary is to replace missing values in each vector with a value smaller than the minimum value in each one. For the local version, we first remove common missing values from each vector. Our C++ implementation explicitly does this so that we can have speed, instead of wrapping the {stats::cor} function. We also use the double merge-sort algorithm, translating {scipy:: ::stats::kendalltau} function into C++ using {Rcpp}.
x = rnorm(1000)
y = rnorm(1000)
library(microbenchmark)
microbenchmark(
cor(x, y, method = "kendall"),
ici_kt(x, y, "global"),
times = 5
)
#> Unit: microseconds
#> expr min lq mean median uq
#> cor(x, y, method = "kendall") 15632.208 15648.228 15737.67 15664.638 15707.71
#> ici_kt(x, y, "global") 323.474 347.479 382.66 347.728 370.18
#> max neval
#> 16035.550 5
#> 524.439 5
Just like R’s cor
function, we can also calculate
correlations between many variables. Let’s make some fake data and try
it out.
set.seed(1234)
s1 = sort(rnorm(1000, mean = 100, sd = 10))
s2 = s1 + 10
s2[sample(length(s1), 100)] = s1[1:100]
s3 = s1
s3[c(1:10, sample(length(s1), 5))] = NA
matrix_1 = cbind(s1, s2, s3)
r_1 = ici_kendalltau(matrix_1)
r_1$cor
#> s1 s2 s3
#> s1 1.0000000 0.8049209 0.9907488
#> s2 0.8049209 1.0000000 0.7956652
#> s3 0.9907488 0.7956652 0.9850000
If you have {future} and the {furrr} packages installed, then it is also possible to split up the a set of matrix comparisons across compute resources for any multiprocessing engine registered with {future}.
In the case of hundreds of thousands of comparisons to be done, the
result matrices can become very, very large, and require lots of memory
for storage. They are also inefficient, as both the lower and upper
triangular components are stored. An alternative storage format is as a
data.frame
, where there is a single row for each comparison
performed. This is actually how the results are stored internally, and
then they are converted to a matrix form if requested (the default). To
keep the data.frame
output, add the argument
return_matrix=FALSE
to the call of
ici_kendalltau
.
r_3 = ici_kendalltau(matrix_1, return_matrix = FALSE)
r_3$cor
#> s1 s2 core raw pvalue taumax completeness cor
#> 1 s1 s2 1 0.8049209 0 1.0000000 1.000 0.8049209
#> 2 s1 s3 1 0.9907488 0 0.9998949 0.985 0.9907488
#> 3 s2 s3 1 0.7956652 0 0.9998949 0.985 0.7956652
#> 4 s1 s1 0 1.0000000 0 1.0000000 1.000 1.0000000
#> 5 s2 s2 0 1.0000000 0 1.0000000 1.000 1.0000000
#> 6 s3 s3 0 0.9850000 0 1.0000000 0.985 0.9850000
It is possible to log the steps being done and how much memory is being used (on Linux at least) as correlations are calculated. This can be useful when running very large sets of correlations and making sure too much memory isn’t being used, for example.
To enable logging, the {logger} package must be installed. If a
log_file
is not supplied, one will be created with the
current date and time.
By default, ici_kendalltau
also shows progress messages,
if you want to turn them off, you can do: