Exploring half a million molecules

Zooming into high resolution mass spectrometry data

Reading our first .raw file

First step is to read an .raw file containing (already centroided) ASAP-HRMS data. The data can be loaded into a positive and a negative mode dataframe as with the read_raw() function which returns two dataframes for positive and negative mode. This function is based on the pyRawTools python package.

from kendrick import read_raw
raw_file = '/home/frank/Work/DATA/kendrick-data/Ref0443_casein_asap01.RAW'
df_pos, df_neg = read_raw(raw_file)

Let’s focus on the positive mode data for now. Here is what the first and last rows of the dataframe looks like.

df_pos
RT mz inty
Scan
1 0.003945 125.023285 10221.680664
1 0.003945 125.059830 81648.109375
1 0.003945 125.132690 35949.582031
1 0.003945 126.062935 6891.418945
1 0.003945 126.099777 18121.056641
... ... ... ...
390 3.005163 546.237732 35105.105469
390 3.005163 548.254578 52922.070312
390 3.005163 596.266724 100884.796875
390 3.005163 610.248840 50182.472656
390 3.005163 722.359741 28877.074219

271982 rows × 3 columns

Inspecting the df_pos dataframe we find 271982 rows with three columns: 1) RT retention time, 2) mz mass per electrical charge, and 3) inty number of ions. From the first column one can see that this experiment lasted 3 minutes.

As we will see, m/z values for identical molecules are slightly jittered due to limited instrumental precision. In order to determine the abundance of different molecules present in the sample, we now need to create time averaged centroided m/z values. This can be achieved by 1) first binning the data in a histogram, 2) then Gaussian smoothing the histogram and locating the peaks. These steps are implemented in the functions histogram() and get_time_averaged_centroids().

Next step is to explore the data in an interactive visualization. In order to plot half a million data points in a single plot we need to import a special function interactive_plot(). This function makes heavily use of a powerful python package datashader that is designed for fast plotting huge numbers of data points.

Note

Note that in order to activate interactive plotting in a Jupyter notebook you need to execute the following notebook magic command in a code cell: %matplotlib widget

from kendrick import histogram, get_time_averaged_centroids, interactive_plot
mz_hist = histogram(df_pos)
mz_centroids = get_time_averaged_centroids(mz_hist)

interactive_plot(df_pos, mz_hist, mz_centroids)

Zooming-in

source

interactive_plot

 interactive_plot (df, mz_hist, mz_centroids)

Create interactive plot for dataframe df.


source

get_time_averaged_centroids

 get_time_averaged_centroids (mz_hist_w_xy)

Get peaks (centroids) from histogram.


source

histogram

 histogram (df)

Create intensity weighed histogram.


source

read_mzml

 read_mzml (mzml_file)

*Read mzml_file.

Returns positive and negative mode dataframes df_pos and df_min.*


source

read_raw

 read_raw (raw_file)

Read raw_file into positive and negative mode data frames.