Exploring half a million molecules

Zooming into high resolution mass spectrometry data

Exploring our Nextcloud data

For our ASAP project research team we created a Nextcloud folder that we can use to easily share our data with fairdatanow Python package. Let’s take a look what Wim has uploaded so far.

from fairdatanow import RemoteData
import os

init_notebook_modetrusted

configuration = {
    'url': "https://laboppad.nl/asap-data",
    'user':    os.getenv('NC_AUTH_USER'),
    'password': os.getenv('NC_AUTH_PASS')
}

remote_data = RemoteData(configuration)

Please wait while scanning all file paths in remote folder...
Ready building file table for 'asap-data'
Total number of files and directories: 1050
Total size of the files: 825.6 MiB

remote_data.itable

init_notebook_modetrusted

Loading ITables v2.4.3 from the init_notebook_mode cell... (need help?)

For this example we want to download a demo data file Ref0443_casein_asap01.RAW from the Nextcloud server to our local computer. This can be done by typing the filename in the search bar, and selecting the row by clicking. Downloading is done with the method .download_selected()

raw_file = str(remote_data.download_selected()[0])

Ready with downloading 1 selected remote files to local cache: /home/frank/.cache/fairdatanow/asap-data/demo-data/Ref0443_casein_asap01.RAW

Note

Due to security restrictions at this moment you can not yet download from our Nextcloud server unless you are a member of our team! We are working to open up our data for you soon!

Reading our first .raw file

First step is to read an .raw file containing (already centroided) ASAP-HRMS data. The data can be loaded into a positive and a negative mode dataframe as with the read_raw() function which returns two dataframes for positive and negative mode. This function is based on the pyRawTools python package.

from kendrick import read_raw

raw_file = '../downloads/kendrick-data/Ref0443_casein_asap01.RAW'

df_pos, df_neg = read_raw(raw_file)

Let’s focus on the positive mode data for now. Here is what the first and last rows of the dataframe looks like.

df_pos

	RT	mz	inty
Scan
1	0.003945	125.023285	10221.680664
1	0.003945	125.059830	81648.109375
1	0.003945	125.132690	35949.582031
1	0.003945	126.062935	6891.418945
1	0.003945	126.099777	18121.056641
...	...	...	...
390	3.005163	546.237732	35105.105469
390	3.005163	548.254578	52922.070312
390	3.005163	596.266724	100884.796875
390	3.005163	610.248840	50182.472656
390	3.005163	722.359741	28877.074219

271982 rows × 3 columns

Inspecting the df_pos dataframe we find 271982 rows with three columns: 1) RT retention time, 2) mz mass per electrical charge, and 3) inty number of ions. From the first column one can see that this experiment lasted 3 minutes.

As we will see, m/z values for identical molecules are slightly jittered due to limited instrumental precision. In order to determine the abundance of different molecules present in the sample, we now need to create time averaged centroided m/z values. This can be achieved by 1) first binning the data in a histogram, 2) then Gaussian smoothing the histogram and locating the peaks. These steps are implemented in the functions histogram() and get_time_averaged_centroids().

Next step is to explore the data in an interactive visualization. In order to plot half a million data points in a single plot we need to import a special function interactive_plot(). This function makes heavily use of a powerful python package datashader that is designed for fast plotting huge numbers of data points.

Note

Note that in order to activate interactive plotting in a Jupyter notebook you need to execute the following notebook magic command in a code cell: %matplotlib widget

from kendrick import histogram, get_time_averaged_centroids, interactive_plot

mz_hist = histogram(df_pos)
mz_centroids = get_time_averaged_centroids(mz_hist)

interactive_plot(df_pos, mz_hist, mz_centroids)

source

interactive_plot

 interactive_plot (df, mz_hist, mz_centroids)

Create interactive plot for dataframe df.

source

get_time_averaged_centroids

 get_time_averaged_centroids (mz_hist_w_xy)

Get peaks (centroids) from histogram.

source

histogram

 histogram (df)

Create intensity weighed histogram.

source

read_mzml

 read_mzml (mzml_file)

*Read mzml_file.

Returns positive and negative mode dataframes df_pos and df_min.*

source

read_raw

 read_raw (raw_file)

Read raw_file into positive and negative mode data frames.