Getting Started
Installation
Install from PyPI:
pip install mzmlpy
If you need MS-Numpress decoding support, install the optional extra:
pip install mzmlpy[numpress]
Basic Usage
Open an mzML file with the context manager to ensure proper cleanup:
from mzmlpy import Mzml
with Mzml("tests/data/example.mzML") as reader:
print(f"File ID: {reader.id}")
print(f"mzML version: {reader.version}")
Both .mzML and .mzML.gz files are supported. The reader lazily parses the file, so metadata is available immediately while binary data is decoded only on access.
Reading Gzipped Files
When working with .mzML.gz files, the gzip_mode parameter controls how the compressed file is accessed:
"extract"(default) — Decompress to a cached file under the OS temp directory (<tmpdir>/mzmlpy/), then read with full random access. The cache persists across Python sessions so subsequent opens of the same file skip decompression entirely. The OS clears the temp directory on reboot; callclear_cache()to reclaim space sooner."indexed"— Use therapidgziplibrary for seekable access to the compressed file without extracting to disk. Requirespip install mzmlpy[rapidgzip]. Builds a gzip seek index (.gzidx) and mzML offset index (.mzMLidx) on first open, cached alongside the file for instant startup on subsequent opens."stream"— Stream the file sequentially with no index. Lowest startup cost, but random access (e.g.reader.spectra[0]) scans from the beginning each time — a warning is emitted.
Performance comparison (33,535-spectrum DDA file, cold start, with rapidgzip):
| Mode | Startup | Iterate (500 spectra) | Random access (5 reads) |
|---|---|---|---|
plain .mzML |
0.042s | 0.087s | 0.001s |
in_memory=True |
1.499s | 0.362s | 0.002s |
gzip_mode="extract" |
0.957s | 0.083s | 0.001s |
gzip_mode="indexed" ¹ |
6.850s | 0.135s | 0.074s |
gzip_mode="stream" |
0.089s | 0.155s | 22.8s |
¹ Startup includes building the gzip seek index (.gzidx) and mzML offset index (.mzMLidx) on first open. Both are cached alongside the file — subsequent opens are fast.
For best performance with .mzML.gz files, use "extract" or "indexed":
from mzmlpy import Mzml
# Indexed mode — no extraction, seekable (requires rapidgzip)
with Mzml("tests/data/example.mzML.gz", gzip_mode="indexed", in_memory=False) as reader:
print(f"Spectra: {len(reader.spectra)}")
spec = reader.spectra[0]
print(spec.id)
To reclaim disk space before the OS clears the temp directory on reboot:
from mzmlpy import clear_cache
clear_cache()
Iterating Spectra
The reader.spectra property returns a lookup object that supports iteration, integer indexing, slicing, and string ID lookup:
from mzmlpy import Mzml
with Mzml("tests/data/example.mzML") as reader:
# Iterate all spectra
for spectrum in reader.spectra:
print(f"Scan {spectrum.id} (MS{spectrum.ms_level}) - TIC: {spectrum.TIC}")
# Access by index
first = reader.spectra[0]
# Access by slice
batch = reader.spectra[0:2]
# Access by string ID
scan = reader.spectra["scan=19"]
# Filter with a list comprehension
ms2_spectra = [s for s in reader.spectra if s.ms_level == 2]
Accessing Binary Data
Spectra expose mz and intensity as convenience properties. Access is lazy -- the binary data is decoded on every call, so save the result to a local variable when you need it more than once:
from mzmlpy import Mzml
from mzmlpy import constants as c
with Mzml("tests/data/example.mzML") as reader:
spec = reader.spectra[0]
mz = spec.mz # NDArray[float64] | None
intensity = spec.intensity # NDArray[float64] | None
charge = spec.charge # NDArray[float64] | None
# For less common array types, use get_binary_array with a CV accession
barr = spec.get_binary_array(c.BinaryDataArrayAccession.RAW_ION_MOBILITY)
if barr is not None:
values = barr.data
# Iterate all binary arrays on a spectrum
for ba in spec.binary_arrays:
print(ba.binary_array_type, ba.compression, ba.encoding)
Working with Scan Timing
Retention time and ion injection time are accessible as timedelta objects through the spectrum, which delegates to the first scan:
from mzmlpy import Mzml
with Mzml("tests/data/example.mzML") as reader:
spec = reader.spectra[0]
if spec.scan_start_time is not None:
rt_seconds = spec.scan_start_time.total_seconds()
rt_minutes = rt_seconds / 60
print(f"RT: {rt_minutes:.4f} min")
if spec.ion_injection_time is not None:
iit_ms = spec.ion_injection_time.total_seconds() * 1000
print(f"Ion injection time: {iit_ms:.2f} ms")
print(f"Lower m/z: {spec.lower_mz}")
print(f"Upper m/z: {spec.upper_mz}")
Working with Ion Mobility
Check whether a spectrum carries ion mobility data and retrieve the relevant arrays:
from mzmlpy import Mzml
from mzmlpy.constants import BinaryDataArrayAccession
with Mzml("tests/data/example.mzML") as reader:
spec = reader.spectra[0]
if spec.has_im:
print(f"IM types: {spec.im_types}")
im_array = spec.get_binary_array(
BinaryDataArrayAccession.MEAN_INVERSE_REDUCED_ION_MOBILITY
)
if im_array is not None:
values = im_array.data
Working with Chromatograms
Chromatograms work similarly to spectra -- access by index, ID, or iteration:
from mzmlpy import Mzml
with Mzml("tests/data/example.mzML") as reader:
tic = reader.chromatograms["tic"]
time = tic.time # NDArray[float64] | None
intensity = tic.intensity # NDArray[float64] | None
# Precursor and product info (SRM chromatograms)
print(tic.precursor)
print(tic.product)
print(tic.chromatogram_type) # "tic", "basepeak", "srm", etc.
Accessing File Metadata
The reader exposes instrument configuration, software, and other file-level metadata:
from mzmlpy import Mzml
with Mzml("tests/data/example.mzML") as reader:
# Instrument configurations
for config_id, config in reader.instrument_configurations.items():
print(f"Instrument: {config_id}")
print(f" Sources: {len(config.source_components)}")
print(f" Analyzers: {len(config.analyzer_components)}")
print(f" Detectors: {len(config.detector_components)}")
# Software
for sw in reader.softwares.values():
print(f"{sw.id} v{sw.version}")
# Other metadata
_ = reader.cvs
_ = reader.file_description
_ = reader.referenceable_param_groups
_ = reader.data_processes
_ = reader.samples
_ = reader.scan_settings
_ = reader.run