Title: | Development of visualization methods for quality control |
---|---|
Description: | Provides utilities useful quality control of high-throughput -omics datasets. |
Authors: | Robert M Flight [aut, cre], Hunter NB Moseley [aut] |
Maintainer: | Robert M Flight <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.5.1 |
Built: | 2024-11-12 05:57:54 UTC |
Source: | https://github.com/MoseleyBioinformaticsLab/visualizationQualityControl |
given a data.frame of means and variances, calculate mean sd at low end and mean rsd at high end.
calc_sd_rsd(data, low_cut, hi_cut = NULL)
calc_sd_rsd(data, low_cut, hi_cut = NULL)
data |
data.frame of means and variances |
low_cut |
means <= this value used for average sd |
hi_cut |
means >= this value used for average rsd |
vector
given a data.frame of means and variances, use a two step non-linear least squares. The first step is done on the mean vs sd, then the estimates are used in a second that estimates them using the mean vs rsd.
calc_sd_rsd_nls(data, ...)
calc_sd_rsd_nls(data, ...)
data |
data.frame of means and variances |
... |
other nls parameters |
vector
given a data matrix of samples (columns) and features (rows), and a vector of classes (character or factor), calculate an F-ratio for each feature.
calculate_fratio(data, data_classes)
calculate_fratio(data, data_classes)
data |
the data matrix, with samples (columns) and features (rows) |
data_classes |
what are the classes of the samples (columns) |
vector
For a given set of feature-sample matrices, calculates how many features are in that sample, as well as in common to all other samples, and if provided, in common within and outside the same sample group.
count_matching_features(feature_matrix, zero_value = NA, groups = NULL)
count_matching_features(feature_matrix, zero_value = NA, groups = NULL)
feature_matrix |
the feature to sample matrix. |
zero_value |
what is the zero value? Default is NA |
groups |
what are the groups |
data.frame
determine outliers
determine_outliers( median_correlations = NULL, outlier_fraction = NULL, cor_weight = 1, frac_weight = 1, only_high = TRUE )
determine_outliers( median_correlations = NULL, outlier_fraction = NULL, cor_weight = 1, frac_weight = 1, only_high = TRUE )
median_correlations |
median correlations |
outlier_fraction |
outlier fractions |
cor_weight |
how much weight for the correlation score? |
frac_weight |
how much weight for the outlier fraction? |
only_high |
should only things at the low end of score be removed? |
For outlier sample detection, one should first generate median correlations using 'median_correlations', and outlier fractions using 'outlier_fraction'. If you only have one or the other, than you should use named arguments to only pass the one or the other.
Alternatively, you can change the weighting used for median correlations or outlier fraction, including setting them to 0.
data.frame
Given a value matrix (features are columns, samples are rows), and sample classes, find those things that are not zero in at least a certain number of one of the classes, and keep them
filter_non_zero_percentage(data_matrix, sample_classes = NULL, keep_num = 0.75)
filter_non_zero_percentage(data_matrix, sample_classes = NULL, keep_num = 0.75)
data_matrix |
the matrix of values to work with |
sample_classes |
the classes of each sample |
keep_num |
what number of samples in each class need a non-zero value (see Details) |
This function is being deprecated and all code should use the
keep_non_zero_percentage
function instead.
matrix
keep_non_zero_percentage
When multiple sample classes need to be visualized on a heatmap, it is useful to be able to distinguish them by color. This function generates a set of colors for sample classes
generate_group_colors(n_group, randomize = NULL)
generate_group_colors(n_group, randomize = NULL)
n_group |
how many groups should there be colors for |
randomize |
should colors be randomized? (default is |
the default for randomize
is NULL
, so that reordering
the colors randomly is decided purely based on the number of colors requested.
Currently, that cutoff is 5 colors, less than that the colors will
always be in the same order, for 5 colors or more, they will
be in a scrambled order, different each time unless set.seed
is
used. If randomize
is TRUE
or FALSE
, then it overrides
the defaults.
test data of 10 sample to sample correlations where samples are drawn from two groups. Generated by rmflight from random distributions
matrix with 10 rows and 10 columns, with row and colnames
generated by rmflight
Example data used for demonstrating median correlation
. A list with
2 named entries:
grp_cor_data
grp_cor_data
List with 2 entries, data and class
a data matrix with 100 rows and 20 columns
a character vector of 20 entries denoting classes
The data comes from two groups of samples, where there is ~0.85 correlation within each group, and ~0.53 correlation between groups.
Robert M Flight
Example data that requires log-transformation before doing PCA or other QC. A list with 2 named entries:
grp_exp_data
grp_exp_data
List with 2 entries, data and class
a data matrix with 1000 rows and 20 columns
a character vector of 20 entries denoting classes
The data comes from two groups of samples, where there is ~0.80 correlation within each group, and ~0.38 correlation between groups.
Robert M Flight
meta-data for grp_cor
.
data.frame with 10 rows, and 2 columns, grp
defining with group
and set
, defining the set.
generated by rmflight
Given a value matrix (features are rows, samples are columns), and sample classes, find those things that are not missing in at least a certain number of samples in one of the classes, and keep those features for further processing.
keep_non_missing_percentage( data_matrix, sample_classes = NULL, keep_num = 0.75, missing_value = NA, all = FALSE )
keep_non_missing_percentage( data_matrix, sample_classes = NULL, keep_num = 0.75, missing_value = NA, all = FALSE )
data_matrix |
the matrix of values to work with |
sample_classes |
the classes of each sample |
keep_num |
what number of samples in each class need a non-missing value (see Details) |
missing_value |
what number(s) represents missing values (default NA) |
all |
is this an either / or OR does it need to be present in all? |
The number of samples that must be non-missing can be expressed either as a whole
number (that is greater than one), or as a fraction that will be be multiplied
by the number of samples in each class to get the lower limits for each of the classes.
If there are multiple values that represent missingness, use a vector. For example, to
to use both 0 and NA, you can do missing_value = c(NA, 0)
.
logical
Given a value matrix (features are rows, samples are columns), and sample classes, find those things that are not zero in at least a certain number of samples in one of the classes, and keep those features for further processing.
keep_non_zero_percentage( data_matrix, sample_classes = NULL, keep_num = 0.75, zero_value = 0, all = FALSE )
keep_non_zero_percentage( data_matrix, sample_classes = NULL, keep_num = 0.75, zero_value = 0, all = FALSE )
data_matrix |
the matrix of values to work with |
sample_classes |
the classes of each sample |
keep_num |
what number of samples in each class need a non-zero value (see Details) |
zero_value |
what number represents zero values |
all |
is this an either / or OR does it need to be present in all? |
The number of samples that must be non-zero can be expressed either as a whole number (that is greater than one), or as a fraction that will be be multiplied by the number of samples in each class to get the lower limits for each of the classes.
logical
Given a correlation matrix the sample class information, calculates the median correlations of the samples within the class and between classes.
median_class_correlations(cor_matrix, sample_classes = NULL)
median_class_correlations(cor_matrix, sample_classes = NULL)
cor_matrix |
the sample - sample correlations |
sample_classes |
the sample classes as a character or factor |
matrix
Given a correlation matrix and optionally the sample class information, calculates the median correlations of each sample to all other samples in the same class. May be useful for determining outliers.
median_correlations(cor_matrix, sample_classes = NULL, between_classes = FALSE)
median_correlations(cor_matrix, sample_classes = NULL, between_classes = FALSE)
cor_matrix |
the sample - sample correlations |
sample_classes |
the sample classes as a character or factor |
between_classes |
should the between class correlations be evaluated? |
The data.frame may have 5 columns, first three are always present,
the second two come up if between_classes = TRUE
:
the median correlation with other samples
the sample id, either the rowname or an index
the class of the sample. If not provided, set to "C1"
the class of the other sample
sample_class::compare_class
for easy grouping
data.frame
Calculates the fraction of entries in each sample that are more than X
standard deviations from the trimmed mean. See Details.
outlier_fraction( data, sample_classes = NULL, n_trim = 3, n_sd = 5, remove_missing = NA )
outlier_fraction( data, sample_classes = NULL, n_trim = 3, n_sd = 5, remove_missing = NA )
data |
the data matrix (samples are columns, rows are features) |
sample_classes |
the sample classes |
n_trim |
how many features to trim at each end (default is 3) |
n_sd |
how many SD before treated as outlier (default is 5) |
remove_missing |
what missing values be removed before calculating? (default is NA) |
Based on the Gerlinski paper link
for each feature (in a sample class), take the range across all the samples,
remove the n_trim
lowest and highest values, and calculate the mean
and sd
, and the actual upper and lower ranges of n_sd
from the
mean
. For each sample and feature, determine if within or outside
that limit. Fraction is reported as the number of features outside the range.
Returns a data.frame
with:
the sample id, rownames
are used if available, otherwise
this is an index
the class of the sample if sample_classes
were provided,
otherwise given a default of "C1"
the actual outlier fraction calculated for that sample
data.frame
given a matrix (maybe a distance matrix), cluster and then re-order using dendsort.
similarity_reorder( similarity_matrix, matrix_indices = NULL, transform = "none", hclust_method = "complete", dendsort_type = "min" )
similarity_reorder( similarity_matrix, matrix_indices = NULL, transform = "none", hclust_method = "complete", dendsort_type = "min" )
similarity_matrix |
matrix of similarities |
matrix_indices |
indices to reorder |
transform |
should a transformation be applied to the data first |
hclust_method |
which method for clustering should be used? |
dendsort_type |
how should the reordering be done? (default is "min") |
a dendrogram object. To get the order use order.dendogram
.
to avoid spurious visualization problems, it is useful in a heatmap visualization to reorder the samples within each sample class. This function uses hierarchical clustering and dendsort to sort entries in a distance matrix.
similarity_reorderbyclass( similarity_matrix, sample_classes = NULL, transform = "none", hclust_method = "complete", dendsort_type = "min" )
similarity_reorderbyclass( similarity_matrix, sample_classes = NULL, transform = "none", hclust_method = "complete", dendsort_type = "min" )
similarity_matrix |
matrix of similarities between objects |
sample_classes |
data.frame or factor denoting classes |
transform |
a transformation to apply to the data |
hclust_method |
which method for clustering should be used |
dendsort_type |
how should dendsort do reordering? |
The similarity_matrix
should be either a square matrix of similarity values
or a distance matrix of class dist
. If your matrix does not encode a "true"
distance, you can use a transform
to turn it into a true distance
(for example, if you have correlation, then a distance would be 1 - correlation,
use "sub_1" as the transform argument).
The sample_classes
should be either a data.frame or factor argument. If
a data.frame is passed, all columns of the data.frame will be pasted together
to create a factor for splitting the data into groups. If the rownames of the
data.frame do not correspond to the rownames or colnames of the matrix, then
it is assumed that the ordering in the matrix and the data.frame are identical.
a list containing the reordering of the matrix in a:
dendrogram
numeric vector
character vector (will be NULL if rownames are not set on the matrix)
library(visualizationQualityControl) set.seed(1234) mat <- matrix(rnorm(100, 2, sd = 0.5), 10, 10) rownames(mat) <- colnames(mat) <- letters[1:10] neworder <- similarity_reorderbyclass(mat) mat[neworder$indices, neworder$indices] sample_class <- data.frame(grp = rep(c("grp1", "grp2"), each = 5), stringsAsFactors = FALSE) rownames(sample_class) <- rownames(mat) neworder2 <- similarity_reorderbyclass(mat, sample_class[, "grp", drop = FALSE]) # if there is a class with only one member, it is dropped, with a warning sample_class[10, "grp"] = "grp3" neworder3 <- similarity_reorderbyclass(mat, sample_class[, "grp", drop = FALSE]) neworder3$indices # 10 should be missing mat[neworder2$indices, neworder2$indices] cbind(neworder$names, neworder2$names)
library(visualizationQualityControl) set.seed(1234) mat <- matrix(rnorm(100, 2, sd = 0.5), 10, 10) rownames(mat) <- colnames(mat) <- letters[1:10] neworder <- similarity_reorderbyclass(mat) mat[neworder$indices, neworder$indices] sample_class <- data.frame(grp = rep(c("grp1", "grp2"), each = 5), stringsAsFactors = FALSE) rownames(sample_class) <- rownames(mat) neworder2 <- similarity_reorderbyclass(mat, sample_class[, "grp", drop = FALSE]) # if there is a class with only one member, it is dropped, with a warning sample_class[10, "grp"] = "grp3" neworder3 <- similarity_reorderbyclass(mat, sample_class[, "grp", drop = FALSE]) neworder3$indices # 10 should be missing mat[neworder2$indices, neworder2$indices] cbind(neworder$names, neworder2$names)
Given a matrix and a data.frame, character vector or data.frame of groups, splits the indices / names of the matrix into groups appropriately. This function assumes that the matrix and the groups are in the correct order!!
split_groups(in_matrix, groups = NULL)
split_groups(in_matrix, groups = NULL)
in_matrix |
the matrix we want to split up |
groups |
a data.frame, character vector or factor |
list of groups
summarizes a matrix or data.frame, where columns are samples and rows are features
summarize_data( in_data, sample_classes = NULL, avg_function = mean, log_transform = FALSE, remove_missing = NA )
summarize_data( in_data, sample_classes = NULL, avg_function = mean, log_transform = FALSE, remove_missing = NA )
in_data |
matrix or data.frame |
sample_classes |
which samples are in which class |
avg_function |
which function to use for summary |
log_transform |
apply a log-transform to the mean |
remove_missing |
remove missing values before summarizing |
data.frame
Given a matrix of PCA scores, set of sample attributes to test, goes through and performs an ICI-Kt of the scores versus the attribute.
visqc_cor_pca_scores(pca_scores, sample_info)
visqc_cor_pca_scores(pca_scores, sample_info)
pca_scores |
the scores matrix to test |
sample_info |
data.frame of sample attributes to test Important: All of the attributes must be numeric, or character. If character, they will be transformed to a factor, and the numeric factor levels will be used instead. If missing values are present, that is OK, as long as they are missing-not-at-random (i.e. missing at the low end of the values). |
data.frame
rolls some of the common Heatmap
options into a single function call
to make life easier when creating lots of heatmaps. Note: clustering
of rows and columns is disabled, it is expected that you are reordering the
matrix beforehand, or passing in column_order
and row_order
as
arguments to be passed to Heatmap
(see example). Matrices can be reordered
using similarity_reorderbyclass
, and nice class colors generated
using generate_group_colors
visqc_heatmap( matrix_data, color_values, title = "", row_color_data = NULL, row_color_list = NULL, col_color_data = NULL, col_color_list = NULL, ... )
visqc_heatmap( matrix_data, color_values, title = "", row_color_data = NULL, row_color_list = NULL, col_color_data = NULL, col_color_list = NULL, ... )
matrix_data |
the matrix you want to plot as a heatmap |
color_values |
the color mapping of values to colors (see Details) |
title |
what do the values represent |
row_color_data |
data for row annotations |
row_color_list |
list for row annotations |
col_color_data |
data for column annotations |
col_color_list |
list for column annotations |
... |
other |
This function uses the ComplexHeatmap
package to produce
heatmaps with complex row- and column-color annotations. Both row_color_data
and col_color_data
should be data.frame
's where each column describes
meta-data about the rows or columns of the matrix. The row_color_list
and
col_color_list
provide the mapping of color to annotation, where each
list
entry should be a named vector of colors, with the list entry
corresponding to a column entry in the data.frame, and the names of the colors
corresponding to annotations in that column.
## Not run: library(circlize) data(grp_cor) data(grp_info) colormap <- colorRamp2(c(0, 1), c("black", "white")) annotation_color <- c(grp1 = "green", grp2 = "red", set1 = "blue", set2 = "yellow") row_data <- grp_info[, "grp", drop = FALSE] col_data <- grp_info[, "set", drop = FALSE] row_annotation = list(grp = annotation_color[1:2]) col_annotation = list(set = annotation_color[3:4]) visqc_heatmap(grp_cor, colormap, row_color_data = row_data, row_color_list = row_annotation, col_color_data = col_data, col_color_list = col_annotation) reorder_sim <- similarity_reorderbyclass(grp_cor, transform = "sub_1") visqc_heatmap(grp_cor, colormap, "reorder1", row_data, row_annotation, col_data, col_annotation, column_order = reorder_sim$indices, row_order = reorder_sim$indices) sample_classes <- grp_info[, "grp", drop = FALSE] reorder_sim2 <- similarity_reorderbyclass(grp_cor, sample_classes, "sub_1") visqc_heatmap(grp_cor, colormap, "reorder2", row_data, row_annotation, col_data, col_annotation, column_order = reorder_sim2$indices, row_order = reorder_sim2$indices) ## End(Not run)
## Not run: library(circlize) data(grp_cor) data(grp_info) colormap <- colorRamp2(c(0, 1), c("black", "white")) annotation_color <- c(grp1 = "green", grp2 = "red", set1 = "blue", set2 = "yellow") row_data <- grp_info[, "grp", drop = FALSE] col_data <- grp_info[, "set", drop = FALSE] row_annotation = list(grp = annotation_color[1:2]) col_annotation = list(set = annotation_color[3:4]) visqc_heatmap(grp_cor, colormap, row_color_data = row_data, row_color_list = row_annotation, col_color_data = col_data, col_color_list = col_annotation) reorder_sim <- similarity_reorderbyclass(grp_cor, transform = "sub_1") visqc_heatmap(grp_cor, colormap, "reorder1", row_data, row_annotation, col_data, col_annotation, column_order = reorder_sim$indices, row_order = reorder_sim$indices) sample_classes <- grp_info[, "grp", drop = FALSE] reorder_sim2 <- similarity_reorderbyclass(grp_cor, sample_classes, "sub_1") visqc_heatmap(grp_cor, colormap, "reorder2", row_data, row_annotation, col_data, col_annotation, column_order = reorder_sim2$indices, row_order = reorder_sim2$indices) ## End(Not run)
Given a set of PCA scores, calculates their variance contributions, cumulative contributions, and generates a percent label that can be used for labeling plots.
visqc_score_contributions(pca_scores)
visqc_score_contributions(pca_scores)
pca_scores |
matrix of scores, columns are each PC |
data.frame
Given a matrix of loadings for principal components, and a set of components to test, for each loading in each component, generates a null distribution from the other loadings in all the other components, and reports a p-value for that loading.
visqc_test_pca_loadings( loadings, test_columns, progress = FALSE, direction = FALSE )
visqc_test_pca_loadings( loadings, test_columns, progress = FALSE, direction = FALSE )
loadings |
matrix of loadings from pca decomposition |
test_columns |
names of the columns of the loadings to test |
progress |
should progress be reported |
direction |
should direction of loading be tested? |
named list
Given a matrix of PCA scores, set of sample attributes to test, goes through and performs an ANOVA of the scores versus the attribute.
visqc_test_pca_scores(pca_scores, sample_info)
visqc_test_pca_scores(pca_scores, sample_info)
pca_scores |
matrix of scores from a PCA decomposition |
sample_info |
data.frame of sample attributes to test |
data.frame