Changed some functions to treat columns as samples and rows as features:
keep_non_zero_percentagesummarize_datacalculate_fratioAdded keep_non_missing_percentage, which allows using multiple values to represent missingnes.
Made summarize_data handle possible missing values.
Removed correlation calculation functions, those have been superseded by ICIKendallTau.
only_high to determine_outliers to only look at the high end of the score distribution for outliers, as sometimes boxplot.stats will pick up outliers at the low end as well.Making the splitup version of ICI-Kendall-tau the "implementation" (visqc_ici_kendallt), and using a single core if the user doesn't setup a "plan" first.
A reference version still exists so we can run tests against it, but it is no longer exported for general users.
Also inlined the C++ sign function, which gave us another 3X speedup on my 8 core machine on a larger test data set.
ici_kendallt.ici_kendallt, and variants around calculating all pairwise correlations
between samples; visqc_ici_kendallt and visqc_ici_kendallt_splitup for parallel processing.Removed requirement for ggbiplot, instead we added a function for calculating
the variances of each of the PCs in the scores.
updated the vignette accordingly.
Now using globally_it_weighted_correlation and locally_it_weighted_correlation
instead of pairwise_correlation.
keep_non_zero_percentage gains an argument, all, that defaults to FALSE
to keep previous behavior. Setting all = TRUE means that the value must be
non-zero in at least X% of all of the sample classes.median_correlations gains a new argument, between_classes to generate the
median values to samples in other classes. This causes the appearance of two
more columns when set to TRUE. The default is FALSE, so hopefully this does not
cause current code to misbehave, but I've bumped the version number as a warning.Augmented correlations (weight = TRUE) should be much more useful and interpretable.
information_volume and correspondence calculations improved. Namely that
information_volume is being scaled by the maximum.
correspondence by default does not consider presence of zeros in both
samples to be informative, this can be changed by setting not_both = TRUE. The
default is more useful in cases where there are lots of features and the data is
sparse, and zeros are likely to happen by chance.
In addition to returning the cor matrix and keep matrix, pairwise_correlations
now returns the raw correlations, and the weighting matrices info and correspondence
so that each one can be examined.
The diagonal of info weighting corresponds to how many features a sample has
compared to the sample with the most features.
Added two functions, information_volume and correspondence to calculate
weights based on the amount of things that are non-zero in both things when
doing pairwise correlation.
Added logical argument weight to pairwise_correlation to weight the correlations. If weight = TRUE, the diagonal will not be 1 anymore, but instead will reflect how many features out of the total are in that sample.
median_correlations that meant the wrong sample ids
might be added to the output data, making detection of real problems difficultpairwise_correlation now uses cor internally directly, whereas previously
it did a for loop to allow pairwise comparisons. This makes the correlations
3x faster.
count has been removed from the list returned by pairwise_correlation
new function pairwise_correlation_count to get the counts in each pairwise
comparison
cor), counts in each correlation (count),
and which points passed the criteria (keep).