Welcome to pdbx2df’s documentation!#
pdbx2df reads biology structure files like PDB, PDBx/mmCIF, and MOL2 into
dictionaries of Pandas DataFrame s. With such a data structure, relatively loosely coupled
data are separated into different DataFrame
objects but are still linked to each other in
the same Python dict. For the DataFrame
objects, cheminformatians, bioinformaticans, and
machine learning researchers should feel very comfortable to work with. It’s easy to visualize,
group, filter, manipulate, and export to other formats. Moreover, most machine learning frameworks
support DataFrame
s as inputs. This library makes it easy, intuitive, and fast to read those
files into DataFrame
s.
The PDBx/mmCIF format is the easiest to parse as a dict of DataFrame
in that we can just use
the provided category names as dict keys and provided attribute names as column names in the
DataFrame
.
The MOL2 format is the also quite straightforward because different category of data are well separated by definition. The category names and column names are also provided by the Tripos document. The minor difficulty comes from the fact that many categories have unstructured data.
The PDB format is harder to parse. Except for a few categories like SEQRES
which are self
constrained, many categories can be misleading if parsed into different DataFrame
s. As such, I
arbitrarily created some coarse-grained category names to group several categories together. As a result,
the _atom_site
category, mimicking the PDBx/mmCIF _atom_site
category, is handy to work with for
most use cases.
There are many other PDBx/PDB/MOL2 parsers, like Biopython PDBParser and OpenMM PDBFile but most mainly parses the coordinates, and make the whole molecule into a python object. It can be convenient in several use cases, but not so intuitive to visualize individual entries, select atoms, merge molecules, and export to other formats. And since they might need to build many python objects and not take advantage of the underlying structure of those structured data, they can be slow. Large scale of processing those files is not viable.
There are other python packages that can parse PDB files into DataFrame
s. CPDB is the fastest by using Cython according to the author’s
benchmarks. But it can only parse PDB files not the other formats, and no writing back to PDB files.
BioPandas can parse PDBx, PDB, and MOL2 files, but it is slow by the
same benchmarks. According to my benchmark (coming soon!), pdbx2df is also much faster than BioPandas
and only slightly slower than CPDB
.
Other than the lightweight and speedy parts, perhaps the provided PDBDataFrame class
, which is a Pandas DataFrame
subclass, is the most useful feature when we need to access common atom groups or select atoms finely. The PDBDataFrame
class provides an easy to use
.
syntax to access common atom groups like backbone
, side_chain
, water
, and heavy_atoms
. It also implements atom selection language
in a pythonic way that we can select by atom_numbers
, atom_names
, chain_ids
, residue_names
, residue_numbers
, x_coord
, y_coord
,
z_coord
, b_factor
, and others. We can also select by distances
in a very flexible way. Check the documents for detailed information.