---
title: "Testing for Left Censorship"
output: rmarkdown::html_vignette
author: "Robert M Flight and Hunter NB Moseley"
vignette: >
  %\VignetteIndexEntry{Testing for Left Censorship}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  \usepackage[utf8]{visdat, ggplot2}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  message = FALSE,
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(ICIKendallTau)
library(ggplot2)
```

## Why?

{ICIKendallTau} has a very specific assumption, that the missing values are **largely** due to being below the limit of detection, or the result of left-censorship.
Therefore, it should only be used **if** the missing values are from left-censorship.
Ideally, it would be nice to have a way to test for it.

## Strategy

To test this, we do the following (implemented in `test_left_censorship`).
For any feature that is missing in one or more samples in a class of samples, we check if the non-missing entries are below their sample's median values.
We count all of the entries below sample medians as successes, and the total number of possible entries as the number of trials in a binomial test, aggregating across all features that had a missing value.
We can then perform a one-tailed binomial test with the expectation that the successes are greater than 0.5.

## Fake Data

To start, we need to make some fake data that we can evaluate the test on.
We will make a smallish dataset, with 1000 features across 20 samples, and we will have 80% of the missing values be due to being left-censored.

We sort the initial data so we know where we can easily put missingness due to left-censoring.
We also use a log-normal distribution initially, just because.

```{r}
#| label: create-fake-data
set.seed(1234)
n_feature = 1000
n_sample = 20
n_miss = 100
n_low = 80
test_dataset = rlnorm(n_feature, 10, 1)
test_dataset = sort(test_dataset)
  
noisy_dataset = add_uniform_noise(log(test_dataset), n_sample, 0.1)
sample_medians = calculate_matrix_medians(noisy_dataset)
```

## Add Missingness

Now we can sample some low and high indices to add our missingness, as well as some samples to add them to.

```{r}
#| label: add-missingness
low_indices = sample(seq_len(300), n_low)
hi_indices = sample(seq(800, 1000), n_miss - n_low)
  
all_indices = c(low_indices, hi_indices)
sample_indices = sample(n_sample, n_miss, replace = TRUE)

missing_dataset = noisy_dataset
for (i_loc in seq_along(all_indices)) {
  missing_dataset[all_indices[i_loc], sample_indices[i_loc]] = NA
}
```

We can actually visualize the missingness using the `visdat` package.

```{r}
#| label: examine-missingness
#| fig-cap: Missing entries in the test dataset.
visdat::vis_miss(as.data.frame(missing_dataset))
```

## Test

Now we can actually test the missingness entries.

```{r}
#| label: test-missingness
missing_test = test_left_censorship(missing_dataset)
missing_test
```

## RNA-Seq Example

Gierliński et al ([link](https://doi.org/10.1093/bioinformatics/btv425)) did a really neat experiment in yeast, with a large number of replicates, 48 WT and 48 SNF2 deletion mutant.
We've included a summarized experiment as part of this package, and use it to demonstrate the left-censorship in a real data context.

```{r}
#| label: load-yeast
data(yeast_missing)
yeast_classes = rep(c("snf2", "wt"), each = 48)
```

```{r}
#| label: test-yeast
yeast_stats = test_left_censorship(yeast_missing, sample_classes = yeast_classes)
yeast_stats
```

In addition to testing, we can re-order the features based on their median ranking across samples.

```{r}
#| label: order-yeast
yeast_order = rank_order_data(yeast_missing, sample_classes = yeast_classes)
```

```{r}
#| label: fig-yeast-original
#| fig-cap: Missingness in yeast data set, no ordering.
visdat::vis_miss(as.data.frame(yeast_order$wt$original))
```

```{r}
#| label: fig-yeast-reordered
#| fig-cap: Missingness in yeast data set, after ordering features by rank.
visdat::vis_miss(as.data.frame(yeast_order$wt$ordered))
```

```{r}
#| label: fig-yeast-nna
#| fig-cap: Feature rank as a function of the number of missing values.
yeast_n_na = dplyr::bind_rows(yeast_order$wt$n_na_rank,
                              yeast_order$snf2$n_na_rank)
yeast_n_na |>
  ggplot(aes(x = n_na, y = median_rank)) +
  geom_point(size = 0.1, position = "jitter") +
  facet_wrap(~ split, nrow = 1)
```