from kendrick import read_raw
Exploring half a million molecules
Reading our first .raw file
First step is to read an .raw file containing (already centroided) ASAP-HRMS data. The data can be loaded into a positive and a negative mode dataframe as with the read_raw()
function which returns two dataframes for positive and negative mode. This function is based on the pyRawTools python package.
= '/home/frank/Work/DATA/kendrick-data/Ref0443_casein_asap01.RAW'
raw_file = read_raw(raw_file) df_pos, df_neg
Let’s focus on the positive mode data for now. Here is what the first and last rows of the dataframe looks like.
df_pos
RT | mz | inty | |
---|---|---|---|
Scan | |||
1 | 0.003945 | 125.023285 | 10221.680664 |
1 | 0.003945 | 125.059830 | 81648.109375 |
1 | 0.003945 | 125.132690 | 35949.582031 |
1 | 0.003945 | 126.062935 | 6891.418945 |
1 | 0.003945 | 126.099777 | 18121.056641 |
... | ... | ... | ... |
390 | 3.005163 | 546.237732 | 35105.105469 |
390 | 3.005163 | 548.254578 | 52922.070312 |
390 | 3.005163 | 596.266724 | 100884.796875 |
390 | 3.005163 | 610.248840 | 50182.472656 |
390 | 3.005163 | 722.359741 | 28877.074219 |
271982 rows × 3 columns
Inspecting the df_pos
dataframe we find 271982 rows with three columns: 1) RT retention time, 2) mz mass per electrical charge, and 3) inty number of ions. From the first column one can see that this experiment lasted 3 minutes.
As we will see, m/z values for identical molecules are slightly jittered due to limited instrumental precision. In order to determine the abundance of different molecules present in the sample, we now need to create time averaged centroided m/z values. This can be achieved by 1) first binning the data in a histogram, 2) then Gaussian smoothing the histogram and locating the peaks. These steps are implemented in the functions histogram()
and get_time_averaged_centroids()
.
Next step is to explore the data in an interactive visualization. In order to plot half a million data points in a single plot we need to import a special function interactive_plot()
. This function makes heavily use of a powerful python package datashader that is designed for fast plotting huge numbers of data points.
Note that in order to activate interactive plotting in a Jupyter notebook you need to execute the following notebook magic command in a code cell: %matplotlib widget
from kendrick import histogram, get_time_averaged_centroids, interactive_plot
= histogram(df_pos)
mz_hist = get_time_averaged_centroids(mz_hist)
mz_centroids
interactive_plot(df_pos, mz_hist, mz_centroids)
interactive_plot
interactive_plot (df, mz_hist, mz_centroids)
Create interactive plot for dataframe df
.
get_time_averaged_centroids
get_time_averaged_centroids (mz_hist_w_xy)
Get peaks (centroids) from histogram.
histogram
histogram (df)
Create intensity weighed histogram.
read_mzml
read_mzml (mzml_file)
*Read mzml_file
.
Returns positive and negative mode dataframes df_pos
and df_min
.*
read_raw
read_raw (raw_file)
Read raw_file
into positive and negative mode data frames.