Title: | Functionality for Characterizing Peaks in Mass Spectrometry in a Scan-Centric Manner |
---|---|
Description: | Provides a functions and classes for detecting, characterizing, and integrating peaks in a scan-centric manner from direct-injection mass spectrometry data. |
Authors: | Robert M Flight [aut, cre] |
Maintainer: | Robert M Flight <[email protected]> |
License: | file LICENSE |
Version: | 0.3.65 |
Built: | 2024-11-11 06:17:42 UTC |
Source: | https://github.com/MoseleyBioinformaticsLab/ScanCentricPeakCharacterization |
takes a list from xmlToList for "run" and looks at whether all scans are positive, negative, or mixed
.get_scan_polarity(spectrum_list)
.get_scan_polarity(spectrum_list)
spectrum_list |
the list of spectra |
removes a list entry called ".attrs" from a list, and makes them first level partners
.remove_attrs(in_list)
.remove_attrs(in_list)
in_list |
the list to work on |
transform to data frame
.to_data_frame(in_list)
.to_data_frame(in_list)
in_list |
the list of xml nodes to work on |
add scan level info
add_scan_info(mzml_data)
add_scan_info(mzml_data)
mzml_data |
the MSnbase mzml data object |
returns a data.frame with:
scanIndex
: the indices of the scans
scan
: the number of the scan by number. This will be used to name scans.
polarity
: +1 or -1 depending on if the scan is positive or negative
rtime
: the retention time or injection time of the scan for for direct-injection data
tic
: the total intensity of the scan
rtime_lag
: how long between this scan and previous scan
rtime_lead
: how long between this scan and next scan
After running predict_frequency()
, the following fields are added
from the information returned from frequency conversion:
mad
: mean absolute deviation of residuals
frequency model coefficients
: the coefficients from the
fit frequency, named whatever you named them
mz model coefficients
: similar, but for the m/z model
data.frame, see Details
given an output object, filename, and zip file, write the output object to the file, and then add to the zip file
add_to_zip(object, filename, zip_file)
add_to_zip(object, filename, zip_file)
object |
the object to write |
filename |
the file that it should be |
zip_file |
the zip file to add to |
a directory created by tempdir
is used to hold the file,
which is then added to the zip file.
calculate the area based on summing the points
area_sum_points(peak_mz, peak_intensity, zero_value = 0)
area_sum_points(peak_mz, peak_intensity, zero_value = 0)
peak_mz |
the mz in the peak |
peak_intensity |
the peak intensities |
zero_value |
what value actually represents zero |
numeric
characterize peaks from points and picked peaks
characterize_peaks(peak_region, calculate_peak_area = FALSE)
characterize_peaks(peak_region, calculate_peak_area = FALSE)
peak_region |
the PeakRegion object to work on |
list
check r2
check_frequency_r2(mz_frequency_list)
check_frequency_r2(mz_frequency_list)
mz_frequency_list |
the list of predicted frequency data.frames |
Given M/Z point data in a data.frame, create IRanges based point "regions" of
width 1, using the frequency_multiplier
argument to convert from the floating
point double to an integer
.
check_ranges_convert_to_regions(frequency_list, frequency_multiplier = 400)
check_ranges_convert_to_regions(frequency_list, frequency_multiplier = 400)
frequency_list |
a list of with a |
frequency_multiplier |
a value used to convert to integers. |
checks that the zip file has the basic contents it should have, and that files listed in the metadata actually exist.
check_zip_file(zip_dir)
check_zip_file(zip_dir)
zip_dir |
the directory of the unzipped data |
default single model
choose_frequency_model_builtin(sc_mzml)
choose_frequency_model_builtin(sc_mzml)
sc_mzml |
the sc_mzml object |
This is the default function to choose a single frequency and mz model. It takes the scan_info after filtering scans, and calculates the median of the square root terms, and chooses the one closest to the median value.
Please examine this function and write your own if needed. You can view the function definition using choose_frequency_model_builtin
SCmzml
Given a data.frame of m/z, generate frequency values for the data.
convert_mz_frequency(mz_data, keep_all = TRUE)
convert_mz_frequency(mz_data, keep_all = TRUE)
mz_data |
a data.frame with |
keep_all |
keep all the variables generated, or just the original + frequency? |
The M/Z values from FTMS data do not have constant spacing between them. This produces challenges in working with ranged intervals and windows. The solution for FTMS data then is to convert them to frequency space. This is done by:
taking subsequent M/Z points
averaging their M/Z
taking the difference to get an offset
value
dividing averaged M/Z by offset to generate frequency
taking subsequent differences of frequency points
keep points with a difference in the supplied range as valid for modeling
After deciding on the valid points for modeling, each point gets an interpolated frequency value using the two averaged points to the left and right in M/Z.
list
mz_scans_to_frequency
Given a corrected SD, corrects the mean assuming that it is the result of a truncated normal distribution.
correct_mean(observed_mean, corrected_sd, fraction)
correct_mean(observed_mean, corrected_sd, fraction)
observed_mean |
the observed mean |
corrected_sd |
a corrected sd, generated by |
fraction |
the fraction of total observations |
corrected mean
https://en.wikipedia.org/wiki/Truncated_normal_distribution
correct_peak()
correct_variance()
Assuming that an observed mean (intensity) and sd are from a truncated normal distribution that is truncated on one side only.
correct_peak(observed_mean, observed_sd, n_observed, n_should_observe)
correct_peak(observed_mean, observed_sd, n_observed, n_should_observe)
observed_mean |
the observed mean |
observed_sd |
the observed sd |
n_observed |
how many observations went into this mean |
n_should_observe |
how many observations should there have been? |
data.frame, with corrected mean and sd
https://en.wikipedia.org/wiki/Truncated_normal_distribution
correct_mean()
correct_variance()
correct peak height and sd
correct_peak_sd_height( original_height, list_of_heights, n_observed, n_should_observe )
correct_peak_sd_height( original_height, list_of_heights, n_observed, n_should_observe )
original_height |
the original height estimate to correct |
list_of_heights |
the set of peak heights |
n_observed |
how many were observed |
n_should_observe |
how many should have been observed |
data.frame
Given a variance observed from a truncated normal distribution, correct it assuming that it should have had 100% observaionts
correct_variance(observed_variance, fraction)
correct_variance(observed_variance, fraction)
observed_variance |
the observed variance |
fraction |
what fraction was it observed in |
corrected variance
https://en.wikipedia.org/wiki/Truncated_normal_distribution
Given a point-point spacing and a frequency range, create IRanges based regions
of specified width. Overlapping sliding regions can be creating by specifying
a region_size
bigger than delta
, adjacent tiled regions can be created
by specifying a region_size
== delta
.
create_frequency_regions( point_spacing = 0.5, frequency_range = NULL, n_point = 10, delta_point = 1, multiplier = 500 )
create_frequency_regions( point_spacing = 0.5, frequency_range = NULL, n_point = 10, delta_point = 1, multiplier = 500 )
point_spacing |
how far away are subsequent points. |
frequency_range |
the range of frequency to use |
n_point |
how many points you want to cover |
delta_point |
the step size between the beginning of each subsequent region |
multiplier |
multiplier to convert from frequency to integer space |
For Fourier-transform mass spec, points are equally spaced in frequency space, which will lead to unequal spacing in M/Z space. Therefore, we create regions using the point-point differences in frequency space.
What will be returned is an IRanges
object, where the widths are constantly
increasing over M/Z space.
IRanges
given a mantissa and exponent, returns the actual value as a numeric
create_value(mantissa, exponent)
create_value(mantissa, exponent)
mantissa |
the base part of the number |
exponent |
the exponent part |
numeric
Given a MasterPeakList object and the MultiScansPeakList that generated it, correct the m/z values using offset predictions
default_correct_offset_function( master_peak_list, multi_scan_peaklist, min_scan = 0.1 )
default_correct_offset_function( master_peak_list, multi_scan_peaklist, min_scan = 0.1 )
master_peak_list |
the MasterPeakList object of correspondent peaks |
multi_scan_peaklist |
the MultiScansPeakList to be corrected |
min_scan |
what is the minimum number of scans a peak should be in to be used for correction. |
list
The offset predictor using loess
default_offset_predict_function(model, x)
default_offset_predict_function(model, x)
model |
the model to use |
x |
the new values |
numeric
There may be good reasons to turn the logging off after it's been turned on. This basically tells the package that the logger isn't available.
disable_logging()
disable_logging()
Choose to enable logging, to a specific file if desired.
enable_logging(log_file = NULL, memory = FALSE)
enable_logging(log_file = NULL, memory = FALSE)
log_file |
the file to log to |
memory |
provide memory logging too? Only available on Linux and MacOS |
Uses the logger package under the hood, which is suggested in the dependencies. Having logging enabled is nice to see when things are starting and stopping, and what exactly has been done, without needing to write messages to the console. It is especially useful if you are getting errors, but can't really see them, then you can add "memory" logging to see if you are running out of memory.
Default log file has the pattern:
YYYY.MM.DD.HH.MM.SS_ScanCentricPeakCharacterization_run.log
Often we want to transform a number into it's exponential representation, having the number itself and the number of decimal places. This function provides that functionality
extract(x)
extract(x)
x |
the number to extract the parts from |
list
Given a Thermo ".raw" file, attempts to extract the "method" definition from a translated hexdump of the file.
extract_raw_method(in_file, output_type = "data.frame")
extract_raw_method(in_file, output_type = "data.frame")
in_file |
The Thermo raw file to extract |
output_type |
string, data.frame or json |
string or data.frame
built in filter scan function
filter_scans_builtin(sc_mzml)
filter_scans_builtin(sc_mzml)
sc_mzml |
the sc_mzml object |
This is the built in filtering and removing outliers function. It is based on the Moseley groups normal samples and experience. However, it does not reflect everyone's experience and needs. We expect that others have different use cases and needs, and therefore they should create their own function and use it appropriately.
Please examine this function and write your own as needed.
It must take an SCMzml
object, work on the scan_info
slot,
and then create a column with the name "keep" denoting which scans to keep.
To view the current definition, you can do filter_scans_builtin
SCmzml
Given some regions and point_regions, find the regions that actually should contain real data. See details for an explanation of what is considered real.
find_signal_regions( regions, point_regions_list, region_percentile = 0.99, multiplier = 1.5, n_point_region = 2000 )
find_signal_regions( regions, point_regions_list, region_percentile = 0.99, multiplier = 1.5, n_point_region = 2000 )
regions |
the regions we want to query |
point_regions_list |
the individual points |
region_percentile |
the cumulative percentile cutoff to use |
multiplier |
how much above base quantiles to use (default = 1.5) |
n_point_region |
how many points make up a large segment to do percentile on? |
IRanges
Given a set of frequency points in a data.frame, create IRanges based point "regions"
of width 1, using the multiplier
to convert from a floating point double
to an integer
frequency_points_to_frequency_regions( frequency_data, frequency_variable = "frequency", multiplier = 400 )
frequency_points_to_frequency_regions( frequency_data, frequency_variable = "frequency", multiplier = 400 )
frequency_data |
a |
frequency_variable |
which column is the |
multiplier |
value used to convert to integers |
given the peak, returns the location and intensity
get_fitted_peak_info( possible_peak, use_loc = "mz", w = NULL, addend = 1e-08, calculate_peak_area = FALSE )
get_fitted_peak_info( possible_peak, use_loc = "mz", w = NULL, addend = 1e-08, calculate_peak_area = FALSE )
possible_peak |
data.frame of mz, intensity and log intensity |
use_loc |
which field to use for locations, default is "mz" |
w |
the weights to use for the points |
addend |
how much was added to the peak intensity |
calculate_area |
should the area of the peak be calculated too? |
list
extract mzML header
get_mzml_header(mzml_file)
get_mzml_header(mzml_file)
mzml_file |
the mzML file to get the header from |
get mzML metadata
get_mzml_metadata(mzml_file)
get_mzml_metadata(mzml_file)
mzml_file |
the mzML file to get metadata from |
figures out which metadata function to run, and returns back the metadata generated by it.
get_raw_ms_metadata(in_file)
get_raw_ms_metadata(in_file)
in_file |
the file to use |
list
import json from a file correctly given some things where things get written differently
import_json(json_file)
import_json(json_file)
json_file |
the json file to read |
list
function to import mzml mass spec data in a way that provides what we need to work
with it. mzml_data
should be the full path to the data.
import_sc_mzml(mzml_data, ms_level = 1)
import_sc_mzml(mzml_data, ms_level = 1)
mzml_data |
the mzml mass spec file to import |
ms_level |
which MS-level data to import |
MSnbase
Given a directory of characterized samples, attempts to determine which peaks may be standards or contaminants that should be removed after assignment.
indicate_standards_contaminents( zip_dir, file_pattern = ".zip", blank_pattern = "^blank", save_dir = NULL, conversion_factor = 400, progress = TRUE )
indicate_standards_contaminents( zip_dir, file_pattern = ".zip", blank_pattern = "^blank", save_dir = NULL, conversion_factor = 400, progress = TRUE )
zip_dir |
which directories to look for files within |
file_pattern |
what files are we actually using |
blank_pattern |
regex indicating that a sample may be a blank |
save_dir |
where to save the files (default is to overwrite originals) |
conversion_factor |
how much to multiply frequencies by |
progress |
should progress messages be displayed? |
For each sample, the scan level frequencies are read in and converted to ranges, and then compared with tiled ranges over the whole frequency range. For those ranges that have 90 to 110% of scan level peaks in ALL blanks, and have 10 to 110% of scan level peaks in at least N-sample - 1, we consider a possible standard or contaminant. The peak is marked so that it can be removed by filtering out it's assignments later.
NULL nothing is returned, files are overwritten
initialize metadata from mzML
initialize_metadata_from_mzml(zip_dir, mzml_file)
initialize_metadata_from_mzml(zip_dir, mzml_file)
zip_dir |
the directory containing unzipped data |
mzml_file |
the mzML file to extract metadata from |
initialize metadata
initialize_zip_metadata(zip_dir)
initialize_zip_metadata(zip_dir)
zip_dir |
the temp directory that represents the final zip |
provides the area integration for the peak that fits the parabolic model
integrate_model(model_mz, model_coeff, n_point = 100, log_transform = "log")
integrate_model(model_mz, model_coeff, n_point = 100, log_transform = "log")
model_mz |
the mz values for the model peak |
model_coeff |
the model of the peak |
n_point |
how many points to use for integration |
log_transform |
what kind of transform was applied |
numeric
provides ability to calculate the area on the sides of a peak that are not caught by the parabolic model assuming a triangle to each side of the parabola
integrate_sides(peak_mz, peak_int, full_peak_loc, model_peak_loc)
integrate_sides(peak_mz, peak_int, full_peak_loc, model_peak_loc)
peak_mz |
the mz in the peak |
peak_int |
the intensity in the peak |
full_peak_loc |
what defines all of the peak |
model_peak_loc |
what defined the peak fitting the parabolic model |
numeric
gives the area of the peak based on integrating the model bits and the sides
integration_based_area( mz_data, int_data, full_peak_loc, model_peak_loc, model_coeff, n_point = 100, log_transform = "log" )
integration_based_area( mz_data, int_data, full_peak_loc, model_peak_loc, model_coeff, n_point = 100, log_transform = "log" )
mz_data |
peak mz values |
int_data |
peak intensity values |
full_peak_loc |
indices defining the full peak |
model_peak_loc |
indices defining the model peak |
model_coeff |
the model of the peak |
n_point |
number of points for integration of the model section |
log_transform |
which log transformation was used |
takes json representing a PeakList object, and generates the data.frame version
json_2_peak_list(json_string, in_var = "Peaks")
json_2_peak_list(json_string, in_var = "Peaks")
json_string |
the json to convert |
in_var |
the top level variable containing the "Peaks" |
tbl_df
Given a json file or list of lists, return a data.frame with the most important bits of the data.
json_mzML_2_df(in_file)
json_mzML_2_df(in_file)
in_file |
the file to read from |
data.frame
lists_2_json
lists_2_json( lists_to_save, zip_file = NULL, digits = 8, temp_dir = tempfile(pattern = "json") )
lists_2_json( lists_to_save, zip_file = NULL, digits = 8, temp_dir = tempfile(pattern = "json") )
lists_to_save |
the set of lists to create the json from |
zip_file |
should the JSON files be zipped into a zip file? Provide the zip file name |
digits |
how many digits to use for the JSON representation |
temp_dir |
temp directory to write the JSON files to |
character
given a zip and a metadata file, load it and return it
load_metadata(zip_dir, metadata_file)
load_metadata(zip_dir, metadata_file)
zip_dir |
the directory of the unzipped data |
metadata_file |
the metadata file |
list
Given a loess model, creates a data.frame suitable for plotting via ggplot2
loess_to_df(loess_model)
loess_to_df(loess_model)
loess_model |
the model object generated by loess |
data.frame
Logs the amount of memory being used to a log file if it is available, and generating warnings if the amount of RAM hits zero.
log_memory()
log_memory()
If a log_appender is available, logs the given message at the info
level.
log_message(message_string)
log_message(message_string)
message_string |
the string to put in the message |
performs a log-transform while adding a small value to the data based on finding the smallest non-zero value in the data
log_with_min(data, min_value = NULL, order_mag = 3, log_fun = log)
log_with_min(data, min_value = NULL, order_mag = 3, log_fun = log)
data |
the data to work with |
min_value |
the minimum value |
order_mag |
how many orders of magnitute smaller should min value be? |
log_fun |
what log function to use for the transformation |
matrix
export the list metadata to a json string
meta_export_json(meta_list)
meta_export_json(meta_list)
meta_list |
a list of metadata |
use the derivative of the parabolic equation to find the peak center, and then put the center into the equation to find the intensity at that point.
model_peak_center_intensity(x, coefficients)
model_peak_center_intensity(x, coefficients)
x |
the x-values to use (non-centered) |
coefficients |
the model coefficients generated from centered model |
The coefficients are generated using the linear model:
.
The derivative of this is:
The peak of a parabola is defined where y is zero for the derivative.
We can use this to derive where the center of the peak is, and then put the center value back into the equation to get the intensity.
numeric
Given a query, and either two values of M/Z and two values of frequency or a previously generated model, return a data.frame with the predicted value, and the slope and the intercept so the model can be re-used later for other points when needed.
mz_frequency_interpolation( mz_query, mz_values = NULL, frequency_values = NULL, model = NULL )
mz_frequency_interpolation( mz_query, mz_values = NULL, frequency_values = NULL, model = NULL )
mz_query |
the M/Z value to fit |
mz_values |
two M/Z values |
frequency_values |
two frequency values |
model |
a model to use instead of actual values |
data.frame with predicted_value, intercept, and slope
Given a multi-scan data.frame of m/z, generate frequency values for the data.
mz_scans_to_frequency( mz_df_list, frequency_fit_description, mz_fit_description, ... )
mz_scans_to_frequency( mz_df_list, frequency_fit_description, mz_fit_description, ... )
mz_df_list |
a list of data.frame with at least |
frequency_fit_description |
the exponentials to use in fitting the frequency ~ mz model |
mz_fit_description |
the exponentials to use in fitting the mz ~ frequency model |
... |
other parameters for |
list
convert_mz_frequency
given an mzML file, create the initial zip file containing the zipped mzML, metadata.json, and mzml_metadata.json. This zip file is what will be operated on by anything that accesses files, so that our interface is consistent.
mzml_to_zip(mzml_file, out_file)
mzml_to_zip(mzml_file, out_file)
mzml_file |
the mzML file to zip up |
out_file |
the directory to save the zip file |
calculates the coefficients of a parabolic fit (y = x + x^2) of x to y
parabolic_fit(x, y, w = NULL)
parabolic_fit(x, y, w = NULL)
x |
the x-values, independent |
y |
the y-values, dependent |
w |
weights |
list
takes a PeakList object, and generates a json version
peak_list_2_json(peak_list)
peak_list_2_json(peak_list)
peak_list |
a data.frame or tbl_df to convert |
json_string
calculate r2
predicted_frequency_r2(mz_frequency_df)
predicted_frequency_r2(mz_frequency_df)
mz_frequency_df |
the data.frame with predicted frequencies |
When raw files are copied, we also generated metadata about their original locations and new locations, and some other useful info. We would like to capture it, and keep it along with the metadata from the mzml file. So, given a list of mzml files, and a location for the raw files, this function creates metadata json files for the mzml files.
raw_metadata_mzml(mzml_files, raw_file_loc, recursive = TRUE)
raw_metadata_mzml(mzml_files, raw_file_loc, recursive = TRUE)
mzml_files |
the paths to the mzml files |
raw_file_loc |
the directory holding raw files and json metadata files |
recursive |
should we go recursively down the directories or not (default = TRUE) |
Given a previously generated zip file of characterized peaks, now we've realized that the offsets on each peak should be somehow different. This function takes a zip file, adjusts the offsets, and writes the file back out.
recalculate_offsets(in_zip, offset = 2, out_file = in_zip)
recalculate_offsets(in_zip, offset = 2, out_file = in_zip)
in_zip |
the zip file to work with |
offset |
the offset to use |
out_file |
the file to write too (optional) |
Given a data.frame or character vector of files to run characterization on, processes them in sequence, to a particular saved location.
run_mzml_list( mzml_files, json_files = NULL, progress = TRUE, save_loc = ".", ... )
run_mzml_list( mzml_files, json_files = NULL, progress = TRUE, save_loc = ".", ... )
mzml_files |
the list of mzML files to use |
json_files |
the list of corresponding json meta-data files |
progress |
whether to give messages about the progress of things |
save_loc |
where should the file files be saved |
... |
other parameters for |
list
determine sample run time
sample_run_time(zip, units = "m")
sample_run_time(zip, units = "m")
zip |
the zip object you want to use |
units |
what units should the run time be in? (s, m, h) |
data.frame with sample, start and end time
make a new SCZip
sc_zip( in_file, mzml_meta_file = NULL, out_file = NULL, load_raw = TRUE, load_peak_list = TRUE )
sc_zip( in_file, mzml_meta_file = NULL, out_file = NULL, load_raw = TRUE, load_peak_list = TRUE )
in_file |
the file to use (either .zip or .mzML) |
mzml_meta_file |
metadata file (.json) |
out_file |
the file to save to at the end |
load_raw |
logical to load the raw data |
load_peak_list |
to load the peak list if it exists |
SCZip
The ScanCentricPeakCharacterization package provides several classes and functions for working with direct injection, high-resolution mass spectrometry data.
Peak characterization control
Peak characterization associates data with the SCZip
,
SCPeakRegionFinder
, and controls their execution.
found_peaks
peaks found by a function
id
a holder for the ID of the sample
frequency_fit_description
the model for conversion to frequency
mz_fit_description
the model for converting back to m/z
calculate_peak_area
whether to calculate peak area or not
sc_peak_region_finder
the peak finder object
sc_zip
the SCZip
that represents the final file
in_file
the input file
metadata_file
the metadata file
out_file
where everything gets saved
temp_loc
where intermediates get saved
load_file()
Loads the mzml data into the SCZip
SCCharacterizePeaks$load_file()
filter_scans()
Filter the scans in data.
SCCharacterizePeaks$filter_scans()
choose_frequency_model()
Choose the single frequency model.
SCCharacterizePeaks$choose_frequency_model()
prepare_mzml_data()
Prepare the mzml data.
SCCharacterizePeaks$prepare_mzml_data()
set_frequency_fit_description()
Set the frequency fit description
SCCharacterizePeaks$set_frequency_fit_description(frequency_fit_description)
frequency_fit_description
the frequency model description
set_mz_fit_description()
Set the mz fit description
SCCharacterizePeaks$set_mz_fit_description(mz_fit_description)
mz_fit_description
the m/z model description
generate_filter_scan_function()
Sets the scan filtering and check for outlier function.
SCCharacterizePeaks$generate_filter_scan_function( rtime = NA, y.freq = NA, f_function = NULL )
rtime
retention time limits of scans to keep
y.freq
y-frequency coefficient limits of scans to keep (NA)
f_function
a full function to set as the filtering function
generate_choose_frequency_model_function()
Sets the function for choosing a single frequency model
SCCharacterizePeaks$generate_choose_frequency_model_function(f_function = NULL)
f_function
the function for choosing a single model
predict_frequency()
Run frequency prediction
SCCharacterizePeaks$predict_frequency()
check_frequency_model()
Check the frequency model
SCCharacterizePeaks$check_frequency_model()
get_frequency_data()
Get the frequency data from the SCMzml
bits
SCCharacterizePeaks$get_frequency_data()
scan_info()
Get the SCMzml$scan_info
out
SCCharacterizePeaks$scan_info()
find_peaks()
Do the peak characterization without saving
SCCharacterizePeaks$find_peaks(stop_after_initial_detection = FALSE)
stop_after_initial_detection
should it stop after the initial peak finding
summarize()
Generates the JSON output summary.
SCCharacterizePeaks$summarize()
save_peaks()
Saves the peaks and JSON to the temp file
SCCharacterizePeaks$save_peaks()
write_zip()
Write the zip file
SCCharacterizePeaks$write_zip()
run_all()
Runs all of the pieces for peak characterization in order
SCCharacterizePeaks$run_all( filter_scan_function = NULL, choose_frequency_model_function = NULL )
filter_scan_function
the scan filtering function
choose_frequency_model_function
the function for choosing a frequency model
prep_data()
Loads and preps the data for characterization
SCCharacterizePeaks$prep_data()
add_regions()
Adds initial regions for finding real peak containing regions
SCCharacterizePeaks$add_regions()
run_splitting()
Does initial region splitting and peak finding in scans
SCCharacterizePeaks$run_splitting()
new()
Creates a new SCCharacterizePeaks
class
SCCharacterizePeaks$new( in_file, metadata_file = NULL, out_file = NULL, temp_loc = tempfile("scpcms"), frequency_fit_description = NULL, mz_fit_description = NULL, filter_remove_outlier_scans = NULL, choose_single_frequency_model = NULL, sc_peak_region_finder = NULL, calculate_peak_area = FALSE )
in_file
the mass spec data file to use (required)
metadata_file
a json metadata file (optional)
out_file
where to save the final zip container
temp_loc
a specified temporary location
frequency_fit_description
mz -> frequency model
mz_fit_description
frequency -> mz model
filter_remove_outlier_scans
function for scan filtering
choose_single_frequency_model
function to choose a single frequency model
sc_peak_region_finder
a blank SCPeakRegionFinder
to use instead of the default
calculate_peak_area
should peak areas be returned as well as height?
clone()
The objects of this class are cloneable with this method.
SCCharacterizePeaks$clone(deep = FALSE)
deep
Whether to make a deep clone.
## Not run: lipid_sample = system.file("extdata", "lipid_example.mzML", package = "ScanCentricPeakCharacterization") sc_char = SCCharacterizePeaks$new(lipid_sample) # prep data and check model library(ggplot2) library(patchwork) sc_char$load_file() sc_char$generate_filter_scan_function() sc_char$generate_choose_frequency_model_function() sc_char$prepare_mzml_data() sc_char$check_frequency_model() # run characterization save_loc = "test.zip" sc_char = SCCharacterizePeaks$new(lipid_sample, out_file = save_loc) sc_char$run_all() ## End(Not run)
## Not run: lipid_sample = system.file("extdata", "lipid_example.mzML", package = "ScanCentricPeakCharacterization") sc_char = SCCharacterizePeaks$new(lipid_sample) # prep data and check model library(ggplot2) library(patchwork) sc_char$load_file() sc_char$generate_filter_scan_function() sc_char$generate_choose_frequency_model_function() sc_char$prepare_mzml_data() sc_char$check_frequency_model() # run characterization save_loc = "test.zip" sc_char = SCCharacterizePeaks$new(lipid_sample, out_file = save_loc) sc_char$run_all() ## End(Not run)
mzML mass spectrometry data container with some useful methods.
Provides our own container for mzML data, and does conversion to frequency, filtering scans, choosing a single frequency regression model, and generating the frequency data for use in the peak characterization.
mzml_file
the mzml file location
mzml_metadata
metadata from an external json file
mzml_data
the actual mzml data from MSnbase
mzml_df_data
a list of data.frames of the data
scan_range
the range of scans to be used
rtime_range
the range of retention times to keep
mz_range
the mz range to use
scan_info
data.frame of scan information
remove_zero
should zero intensity data points be removed?
frequency_fit_description
the model for m/z -> frequency
mz_fit_description
the model for going from frequency -> m/z
frequency_coefficients
the coefficients for the frequency model
mz_coefficients
the coefficients for the m/z model
ms_level
which MS level will we be using from the mzml file?
memory_mode
how will the mzml data be worked with to start, inMemory or onDisk?
difference_range
how wide to consider adjacent frequency points as good
choose_frequency_model_function
where the added model selection function will live
filter_scan_function
where the added filter scan function will live.
choose_single_frequency_model
function to choose a single frequency model
import_mzml()
import the mzml file defined
SCMzml$import_mzml( mzml_file = self$mzml_file, ms_level = self$ms_level, memory_mode = self$memory_mode )
mzml_file
what file are we reading in?
ms_level
which ms level to import (default is 1)
memory_mode
use inMemory or onDisk mode
extract_mzml_data()
get the mzml data into data.frame form so we can use it
SCMzml$extract_mzml_data(remove_zero = self$remove_zero)
remove_zero
whether to remove zero intensity points or not
predict_frequency()
predict frequency and generate some summary information. This does regression of frequency ~ m/z for each scan separately.
SCMzml$predict_frequency( frequency_fit_description = self$frequency_fit_description, mz_fit_description = self$mz_fit_description )
frequency_fit_description
the regression model definition
mz_fit_description
the regression model definition
convert_to_frequency()
actually do the conversion of m/z to frequency
SCMzml$convert_to_frequency()
choose_frequency_model()
choose a frequency model using the previously added function
SCMzml$choose_frequency_model()
generate_choose_frequency_model_function()
generate a frequency model choosing function and attach it
SCMzml$generate_choose_frequency_model_function(f_function = NULL)
f_function
the function you want to pass in
Creates a new function that access the scan_info
slot of an SCMzml
object
after conversion to frequency space, and chooses a single model based on the information
there.
filter_scans()
filter the scans using the previously added function
SCMzml$filter_scans()
generate_filter_scan_function()
generate a filter function and attach it
SCMzml$generate_filter_scan_function( rtime = NA, y.freq = NA, f_function = NULL )
rtime
retention time limits of scans to keep (NA)
y.freq
y-frequency coefficient limits of scans to keep (NA)
f_function
a full function to set as the filtering function
Creates a new function that accesses the scan_info
slot of
an SCMzml
object, filters the scans by their retention-time and
y-frequency coefficients, tests for outliers in the y-frequency
coefficients, and denotes which scans will be kept for further
processing.
NA
means no filtering will be done, one-sided limits, eg. (NA, 10)
or (10, NA)
implies to filter <=
or >=
, respectively.
check_frequency_model()
check how well a given frequency model works for this data
SCMzml$check_frequency_model(scan = NULL, as_list = FALSE)
scan
which scan to show predictions for
as_list
whether plots should be returned as a single plot or a list of plots
get_instrument()
get instrument data from associated mzml file metadata
SCMzml$get_instrument()
get_frequency_data()
get the frequency data to go into the next steps of analysis.
SCMzml$get_frequency_data()
new()
SCMzml$new( mzml_file, frequency_fit_description = c(a.freq = 0, x.freq = -1, y.freq = -1/2, z.freq = -1/3), mz_fit_description = c(a.mz = 0, x.mz = -1, y.mz = -2, z.mz = -3), metadata_file = NULL, scan_range = NULL, rtime_range = NULL, mz_range = NULL, remove_zero = FALSE, ms_level = 1, memory_mode = "inMemory" )
mzml_file
the file to load and use
frequency_fit_description
a description of the regression model for frequency ~ m/z
mz_fit_description
a description of the regression model for m/z ~ frequency
metadata_file
a metadata file generated by ...
scan_range
which scans can be used for analysis
rtime_range
the retention time to use for scans
mz_range
what m/z range to use
remove_zero
should zero intensity data be removed?
ms_level
what MS level should be extracted (default is 1)
memory_mode
what memory mode should MSnbase be using (inMemory or onDisk)
clone()
The objects of this class are cloneable with this method.
SCMzml$clone(deep = FALSE)
deep
Whether to make a deep clone.
choose_single_frequency_model_default()
## Not run: lipid_sample = system.file("extdata", "lipid_example.mzML", package = "ScanCentricPeakCharacterization") ## End(Not run)
## Not run: lipid_sample = system.file("extdata", "lipid_example.mzML", package = "ScanCentricPeakCharacterization") ## End(Not run)
R6 Peak Region Finder
R6 Peak Region Finder
Think of it like managing all the stuff that needs to happen to find the peaks in the regions.
run_time
how long did the process take
start_time
when did we start
stop_time
when did we start
peak_regions
SCPeakRegions object
sliding_region_size
how big are the sliding regions in data points
sliding_region_delta
how much space between sliding region starts
quantile_multiplier
how much to multiply quantile based cutoff by
n_point_region
how many points are there in the big tiled regions for quantile based cutoff
tiled_region_size
how wide are the tiled regions in data points
tiled_region_delta
how far in between each tiled region
region_percentile
??
peak_method
what method to extract peak center, height, area, etc
min_points
how many points wide does a peak have to be to get characterized
sample_id
what sample are we processing
n_zero_tiles
how many zero count tiled regions split up a region into multiple peaks?
zero_normalization
do we want to pretend to do normalization
calculate_peak_area
should peak area be calculated as well?
add_regions()
Add the sliding and tiled regions
SCPeakRegionFinder$add_regions()
reduce_sliding_regions()
Find the regions most likely to contain real signal
SCPeakRegionFinder$reduce_sliding_regions()
split_peak_regions()
Split up signal regions by peaks found
SCPeakRegionFinder$split_peak_regions( use_regions = NULL, stop_after_initial_detection = FALSE )
use_regions
an index of the regions we want to split up
stop_after_initial_detection
should it do full characterization or stop
remove_double_peaks_in_scans()
Check for the presence of two peaks with the same scan number in each region and remove them. Any regions with zero peaks left, remove the region.
SCPeakRegionFinder$remove_double_peaks_in_scans()
normalize_data()
Normalize the intensity data
SCPeakRegionFinder$normalize_data(which_data = "both")
which_data
raw, characterized, or both (default)
find_peaks_in_regions()
Find the peaks in the regions.
SCPeakRegionFinder$find_peaks_in_regions()
model_mzsd()
Model the m/z standard deviation.
SCPeakRegionFinder$model_mzsd()
model_heightsd()
Model the intensity height standard deviation.
SCPeakRegionFinder$model_heightsd()
indicate_high_frequency_sd()
Look for peaks with higher than expected frequency standard deviation.
SCPeakRegionFinder$indicate_high_frequency_sd()
add_data()
Add the data from an SCMzml object to the underlying SCPeakRegions object.
SCPeakRegionFinder$add_data(sc_mzml)
sc_mzml
the SCMzml object being passed in
summarize_peaks()
Summarize the peaks to go into JSON form.
SCPeakRegionFinder$summarize_peaks()
add_offset()
Add an offset based on width in frequency space to m/z to describe how wide the peak is.
SCPeakRegionFinder$add_offset()
sort_ascending_mz()
Sort the data in m/z order, as the default is frequency order
SCPeakRegionFinder$sort_ascending_mz()
characterize_peaks()
Run the overall peak characterization from start to finish.
SCPeakRegionFinder$characterize_peaks(stop_after_initial_detection = FALSE)
stop_after_initial_detection
do we stop the whole process after finding initial peaks in each scan?
summarize()
Summarize everything for output to the zip file after completion.
SCPeakRegionFinder$summarize( package_used = "package:ScanCentricPeakCharacterization" )
package_used
which package is being used for this work.
peak_meta()
Generate the meta data that goes into the accompanying JSON file.
SCPeakRegionFinder$peak_meta()
new()
Make a new SCPeakRegionFinder object.
SCPeakRegionFinder$new( sc_mzml = NULL, sliding_region_size = 10, sliding_region_delta = 1, tiled_region_size = 1, tiled_region_delta = 1, region_percentile = 0.99, offset_multiplier = 1, frequency_multiplier = 400, quantile_multiplier = 1.5, n_point_region = 2000, peak_method = "lm_weighted", min_points = 4, n_zero_tiles = 1, zero_normalization = FALSE, calculate_peak_area = FALSE )
sc_mzml
the SCMzml object to use (can be missing)
sliding_region_size
how wide to make the sliding regions in data points
sliding_region_delta
how far apart are the starting locations of the sliding regions
tiled_region_size
how wide are the tiled regions
tiled_region_delta
how far apart are the tiled reigons
region_percentile
cumulative percentile cutoff to use
offset_multiplier
what offset multiplier should be used
frequency_multiplier
how much to multiply frequency points to interval ranges
quantile_multiplier
how much to adjust the quantile cutoff by
n_point_region
how many points in the large tiled regions
peak_method
the peak characterization method to use (lm_weighted)
min_points
how many points to say there is a peak (4)
n_zero_tiles
how many tiles in a row do there need to be to split things up? (1)
zero_normalization
don't actually do normalization (FALSE)
calculate_peak_area
should peak area as well as peak height be returned? (FALSE)
clone()
The objects of this class are cloneable with this method.
SCPeakRegionFinder$clone(deep = FALSE)
deep
Whether to make a deep clone.
Holds all the peak region data
Holds all the peak region data
This reference class represents the peak region data.
frequency_point_regions
the frequency data
frequency_fit_description
the model of frequency ~ m/z
mz_fit_description
the model of m/z ~ frequency
peak_regions
the peak regions
sliding_regions
the sliding regions used for density calculations
tiled_regions
the tiled regions used for grouping and splitting peak regions
peak_region_list
list of regions
frequency_multiplier
how much to multiplier frequency by to make interval points
scan_peaks
the peaks by scans
peak_data
the data.frame of final peak data
scan_level_arrays
scan level peak data as matrices
is_normalized
are the peak intensities normalized
normalization_factors
the normalization factors calculated
n_scan
how many scans are we working with
scans_per_peak
??
scan_perc
what percentage of scans is a minimum
min_scan
based on scan_perc
, how many scans minimum does a peak
need to be in
max_subsets
??
scan_subsets
??
frequency_range
what is the range in frequency space
scan_correlation
??
keep_peaks
which peaks are we keeping out of all the peaks we had
peak_index
the indices for the peaks
scan_indices
the names of the scans
instrument
the instrument serial number if available
set_min_scan()
sets the minimum number of scans to use
SCPeakRegions$set_min_scan()
add_data()
Adds the data from an SCMzml object to the SCPeakRegion.
SCPeakRegions$add_data(sc_mzml)
sc_mzml
the SCMzml object being passed in
new()
Creates a new SCPeakRegions object
SCPeakRegions$new( sc_mzml = NULL, frequency_multiplier = 400, scan_perc = 0.1, max_subsets = 100 )
sc_mzml
the SCMzml object to get data from
frequency_multiplier
how much to multiply frequency by
scan_perc
how many scans are required to be in to be a "peak"
max_subsets
??
clone()
The objects of this class are cloneable with this method.
SCPeakRegions$clone(deep = FALSE)
deep
Whether to make a deep clone.
Represents the zip mass spec file
Represents the zip mass spec file
This reference class represents the zip mass spec file. It does this by providing objects for the zip file, the metadata, as well as various bits underneath such as the mzml data and peak lists, and their associated metadata. Although it is possible to work with the SCZip object directly, it is heavily recommended to use the SCCharacterizePeaks object for carrying out the various steps of an analysis, including peak finding.
zip_file
the actual zip file
zip_metadata
the metadata about the zip file
metadata
the metadata itself
metadata_file
the metadata file
sc_mzml
the mzML data object.
peaks
??
sc_peak_region_finder
the peak finder object
json_summary
jsonized summary of the peak characterization
id
the identifier of the sample
out_file
where to put the final file
temp_directory
where we keep everything until peak characterization is done
load_mzml()
Loads the mzML file
SCZip$load_mzml()
load_sc_peak_region_finder()
Loads the SCPeakRegionFinder object
SCZip$load_sc_peak_region_finder()
save_json()
Save the jsonized summary out to actual json files
SCZip$save_json()
save_sc_peak_region_finder()
Saves the SCPeakRegionFinder binary object
SCZip$save_sc_peak_region_finder()
load_peak_list()
loads just the peak list data-frame instead of peak region finder
SCZip$load_peak_list()
compare_mzml_corresponded_densities()
compare peak densities
SCZip$compare_mzml_corresponded_densities( mz_range = c(150, 1600), window = 1, delta = 0.1 )
mz_range
the mz range to work over
window
the window size in m/z
delta
how much to move the window
new()
Create a new SCZip object.
SCZip$new( in_file, mzml_meta_file = NULL, out_file = NULL, load_mzml = TRUE, load_peak_list = TRUE, temp_loc = tempfile("scpcms") )
in_file
the mzML file to load
mzml_meta_file
an optional metadata file
out_file
where to save the final file
load_mzml
should the mzML file actually be loaded into an SCMzml object?
load_peak_list
should the peak list be loaded if this is previously characterized?
temp_loc
where to make the temp file while working with the data
show_temp_dir()
Show the temp directory where everything is being worked with
SCZip$show_temp_dir()
write_zip()
Write the zip file
SCZip$write_zip(out_file = NULL)
out_file
where to save the zip file
cleanup()
delete the temp directory
SCZip$cleanup()
finalize()
delete when things are done
SCZip$finalize()
add_peak_list()
Add peak list data to the temp directory
SCZip$add_peak_list(peak_list_data)
peak_list_data
the peak list data
clone()
The objects of this class are cloneable with this method.
SCZip$clone(deep = FALSE)
deep
Whether to make a deep clone.
Allows the user to set which mapping function is being used internally in the functions.
set_internal_map(map_function = NULL)
set_internal_map(map_function = NULL)
map_function |
which function to use, assigns it to an internal object |
by default, the package uses purrr::map to iterate over things. However, if you have the furrr package installed, you could switch it to use furrr::future_map instead.
## Not run: library(furrr) future::plan(multicore) set_internal_map(furrr::future_map) ## End(Not run)
## Not run: library(furrr) future::plan(multicore) set_internal_map(furrr::future_map) ## End(Not run)
Allow the user to turn progress messages to the console and off. Default is to provide messages to the console.
show_progress(progress = TRUE)
show_progress(progress = TRUE)
progress |
logical to have it on or off |
Does a single pass of normalizing scans to each other.
single_pass_normalization( scan_peaks, intensity_measure = c("RawHeight", "Height"), summary_function = median, use_peaks = NULL, min_ratio = 0.7 )
single_pass_normalization( scan_peaks, intensity_measure = c("RawHeight", "Height"), summary_function = median, use_peaks = NULL, min_ratio = 0.7 )
scan_peaks |
the scan peaks to normalize |
intensity_measure |
which intensities to normalize |
summary_function |
which function to use to calculate summaries (median) |
use_peaks |
which peaks to use for normalization |
min_ratio |
what ratio of maximum intensity of peaks should we use for normalization |
scan_peaks list
Given a region that should contain signal, and the point data within it, find the peaks, and return the region, and the set of points that make up each peak from each scan.
split_region_by_peaks( region_list, min_points = 4, metadata = NULL, calculate_peak_area = FALSE )
split_region_by_peaks( region_list, min_points = 4, metadata = NULL, calculate_peak_area = FALSE )
region_list |
a list with points and tiles IRanges objects |
min_points |
how many points are needed for a peak |
metadata |
metadata that tells how things should be processed |
list
returns the sum of squares residuals from an lm
object
ssr(object)
ssr(object)
object |
the lm object |
numeric
given a set of original and fitted values and a transform, return a set of transformed residuals.
transform_residuals(original, fitted, transform = exp)
transform_residuals(original, fitted, transform = exp)
original |
the original points |
fitted |
the fitted points |
transform |
the function that should be used to transform the values |
numeric
given a zip file, list the contents
zip_list_contents(zip_file)
zip_list_contents(zip_file)
zip_file |
the zip file |