---
title: "Quality Control Example"
author: "Robert M Flight"
date: "`r Sys.time()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Quality Control Example}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{visualizationQualityControl, ggplot2, ComplexHeatmap,
  viridis, circlize, ICIKendallTau}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

## Introduction

This vignette shows how the various functions in this package can be used for 
basic quality control, in addition to advocating a particular workflow for 
examining experimental data prior to analysis. These steps consist of:

* Principal components analysis
* Correlation heatmap
* Median correlation
* Feature outliers

In all of the examples, we will compare and contrast a dataset that is composed
of two different conditions, and the same dataset wherein two samples have had
their class labels switched by mistake, and how the visualizations above can
illustrate potential problems.

## Data

```{r setup, echo = FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.width = 8, fig.height = 5)
```

```{r load_data, message=FALSE}
library(visualizationQualityControl)
library(ggplot2)
data(grp_exp_data)
exp_data <- grp_exp_data$data
rownames(exp_data) <- paste0("f", seq(1, nrow(exp_data)))
colnames(exp_data) <- paste0("s", seq(1, ncol(exp_data)))

sample_info <- data.frame(id = colnames(exp_data), class = grp_exp_data$class)
sample_classes <- sample_info$class

mix_data <- exp_data
mix_data[, 5] <- exp_data[, 19]
mix_data[, 19] <- exp_data[, 5]
```


The data used here are from a simulated experiment, where there are two classes
of samples. `r nrow(exp_data)` features measured, across `r ncol(exp_data)` samples
from `r length(unique(sample_classes))` groups, with `r sum(sample_classes == "grp1")`
in each group.

The actual values we are going to use are in `exp_data`. The classes of samples are
defined in `sample_classes`. We will also create a version of the data that has
two samples completely switched by accident, labels and all, in `mix_data`.

```{r show_data}
str(exp_data)
exp_data[1:5, 1:5]
sample_info
```

## Data Transformation

Before we do anything else, we need to transform the data. This is because
this and other -omics data frequently have a distribution and error structure
that many statistical methods will completely choke on, or at least give you
incorrect results.

**Make sure to ask** if data has been transformed in any way!

Here I will show what the raw and transformed data looks like.

```{r raw_plot, figure.width = 5, figure.height = 4}
exp_df = as.data.frame(exp_data)
ggplot(exp_df, aes(x = s1, y = s2)) + geom_point() + labs(title = "Raw Data")
```

Notice the dispersion in the values as the actual values increase. Most statistical
methods don't like this. 

```{r transform}
log_data <- log(exp_data)
```

```{r log_plot}
log_df = as.data.frame(log_data)
ggplot(log_df, aes(x = s1, y = s2)) + geom_point() + labs(title = "Log Transform")
```

But, lots of other methods don't deal well with `NA` or `Inf` values, which is what you
get when you do a log-transform on negative or zero values. There are a couple of 
solutions:

* If have negatives, add the most negative + small offset
* If have zeros, use `log1p` to handle zero and small values

```{r transform_data, eval=FALSE}
log1_data <- log1p(exp_data)

small_value <- min(exp_data[exp_data != 0]) / 100
log2_data <- log2(exp_data + small_value)
```

Most of the methods in this package can handle the presence of `NA` or `Inf`, 
except for the principal components analysis (PCA). Because we have physical data,
the lowest value should be zero. Therefore we can use `log1p`, which will keep
the zeros as zeros. Alternatively, we could use `log1p` for PCA, and `log` for
the correlations.

```{r log_transform}
log_data <- log1p(exp_data)
log_mix <- log1p(mix_data)
```


## Principal Components Analysis

As a first step, we will use principal components analysis (PCA) to decompose
the data. PCA is trying to find linear combinations of the original variables
that account for the maximum amount of variance, and then finding the next
combination that is orthogonal to the first, and so on and so on. It is extemely
useful for confirming that the largest source of variance is the biological one,
and for examining if any confounds can explain some sources of variance.

```{r do_pca}
pca_data <- prcomp(t(log_data), center = TRUE)
pca_mixed <- prcomp(t(log_mix), center = TRUE)
```

### Visualize Them

To visualize the data, we plot the scores.
If we want to know how much a PC contributes to the variances, we can get a summary of them using `visqc_score_contributions`.

```{r plot_good}
gd_scores = cbind(as.data.frame(pca_data$x), sample_info)
gd_pca = ggplot(gd_scores, aes(x = PC1, y = PC2, color = class)) + geom_point() + ggtitle("Good Data")
gd_pca
```

```{r good_contributions}
knitr::kable(visqc_score_contributions(pca_data$x))
```

```{r plot_bad}
bad_scores = cbind(as.data.frame(pca_mixed$x), sample_info)
bad_pca <- ggplot(bad_scores, aes(x = PC1, y = PC2, color = class)) + geom_point() + ggtitle("Bad Data")
bad_pca
```

```{r bad_contributions}
knitr::kable(visqc_score_contributions(pca_mixed$x))
```

Note in this case **PC1** contains a large proportion of variance, and separates
the samples very well. The PCA plots rarely look this good in practice!

## Correlation Heatmap

Correlation heatmaps show much of the same information as the PCA plots, but in
a different way.

### Calculate Correlations

We recommend to use our information-content-informed Kendall-tau {ICIKendallTau::ici_kendalltau} correlation, that is scale invariant, and includes some effects of missing values.
Note that we take the transpose of the data, because this function assumes
that data are organized with *features* as *columns* and *samples* as *rows*. 

```{r calc_correlations}
data_cor <- ICIKendallTau::ici_kendalltau(exp_data)
```

This returns a list with some useful information, the actual correlations in `cor`,
the number of points in each correlation in `count`, and which points pass the
various filters in `keep`. We are really only interested in `cor` in this case.

We can also check that we did the right correlations by looking at the dimensions
of the matrix, in this case we expect a `r ncol(exp_data)` *by* `r ncol(exp_data)`
matrix.

```{r keep_cor}
data_cor <- data_cor$cor
dim(data_cor)
```

We also do the same for our **mixed-up** data.

```{r mix_cor}
mix_cor <- ICIKendallTau::ici_kendalltau(mix_data)$cor
```

### Reorder Correlations

To make the heatmap more useful, we also do clustering within each of the **sample classes**
and reorder the correlations, this highlights sub-groups within classes as well
as potential outliers.

```{r reorder_cor}
data_order <- similarity_reorderbyclass(data_cor, sample_info[, "class", drop = FALSE],
                                        transform = "sub_1")
mix_order <- similarity_reorderbyclass(mix_cor, sample_info[, "class", drop = FALSE],
                                       transform = "sub_1")
```

### Color by Class

We also want to color them by their class.

```{r color_class}
data_legend <- generate_group_colors(2)
names(data_legend) <- c("grp1", "grp2")

row_data <- sample_info[, "class", drop = FALSE]
row_annotation <- list(class = data_legend)
```

### Map Correlation to Color

Correlation values are mapped to colors using the `colorRamp2` function, by specifying
the range of correlations, and what colors to map them to. Here the `viridis`
color-scale is used because it is perceptually uniform and is good for those
suffering from various types of color blindness. Other choices might be
**black -> white**, or the other color maps in the `viridis` package. More
information about `viridis` is available [here](http://bids.github.io/colormap/).

```{r correlation_to_color}
library(viridis)
library(circlize)
colormap <- colorRamp2(seq(0.5, 1, length.out = 20), viridis::viridis(20))
```


### Heatmap!

Finally we can make the heatmaps!

```{r heatmap1, fig.width=5, fig.height=4}
visqc_heatmap(data_cor, colormap, "Good Data", row_color_data = row_data,
              row_color_list = row_annotation, col_color_data = row_data,
              col_color_list = row_annotation, row_order = data_order$indices,
              column_order = data_order$indices)
```


```{r heatmap2, fig.width=5, fig.height=4}
visqc_heatmap(mix_cor, colormap, "Bad Data", row_color_data = row_data,
              row_color_list = row_annotation, col_color_data = row_data,
              col_color_list = row_annotation, row_order = mix_order$indices,
              column_order = mix_order$indices)
```

Note in the second example, **s5** and **s19** look odd.

## Median Correlation

Lets also calculate the median correlation within each class.

### Good

```{r med_cor}
data_medcor <- median_correlations(data_cor, sample_info$class)
```

And plot it using facets in `ggplot2`.

```{r plot_med_good, fig.width=5}
ggplot(data_medcor, aes(x = sample_id, y = med_cor)) + geom_point() + 
  facet_grid(. ~ sample_class, scales = "free") + ggtitle("Good Data")
```

### Bad

```{r med_corbad}
mix_medcor <- median_correlations(mix_cor, sample_info$class)
```

Plot it!

```{r plot_med_mix, fig.width=5}
ggplot(mix_medcor, aes(x = sample_id, y = med_cor)) + geom_point() +
  facet_grid(. ~ sample_class, scales = "free") + ggtitle("Bad Dat")
```

## Proportion of Outlier Features

For every feature (the rows in our matrix), within each **sample-class**, calculate
the **trimmed** (remove the *x* highest and lowest values) *mean* and *sd*, which
features are outside a given number of *sd's*, and then calculate the proportion
of outliers in each sample. Samples with a deviated proportion of outliers should
be examined more closely.

If you are using **metabolomics** data with a large number of zeros, you probably
want to ignore zeros.

### Good

```{r calc_outliers}
data_outlier <- outlier_fraction(log_data, sample_info$class, remove_missing = NA)
```

Plot them!

```{r plot_good_outliers, fig.width=5}
ggplot(data_outlier, aes(x = sample_id, y = frac)) + geom_point() + 
  facet_grid(. ~ sample_class, scales = "free") + ggtitle("Good Data")
```

### Bad

```{r calc_mixoutliers}
mix_outlier <- outlier_fraction(mix_data, sample_info$class, remove_missing = 0)
```

Plot them.

```{r plot_mixoutliers, fig.width = 5}
ggplot(mix_outlier, aes(x = sample_id, y = frac)) + geom_point() + 
  facet_grid(. ~ sample_class, scales = "free") + ggtitle("Mix Data")
```

## Conclusion

Hopefully this has shown how you can use the various tools to examine your 
high-througput -omics data for potential problems.