Welcome to MolDF’s documentation!#

Important

This project is renamed from pdbx2df. Please go to its documentation for historical features.

MolDF reads structure files like PDB, PDBx/mmCIF, and MOL2 used in biology and chemistry into dictionaries of Pandas DataFrame s. With such a data structure, relatively loosely coupled data are separated into different DataFrame objects but are still linked to each other in the same Python dict. For the DataFrame objects, cheminformatians, bioinformaticans, and machine learning researchers should feel very comfortable to work with. It’s easy to inspect, visualize, group, filter, manipulate, and export to other portable formats. Moreover, most machine learning frameworks support DataFrame s as inputs directly. This library makes it easy, intuitive, and fast to read those files into DataFrame s.

The PDBx/mmCIF format is the easiest to parse into a dict of DataFrame in that we can just use the provided category names as dict keys and the provided attribute names as column names in the DataFrame. Indeed, many mmCIF parsers just parse them into dicts.

The MOL2 format is also quite straightforward to parse because different category of data are well separated by definition. The category names and column names are also provided by the Tripos document. The minor difficulty comes from the fact that many categories have unstructured and/or optional data.

The PDB format is harder to parse compared to the other two. Except for a few categories like SEQRES which are self contained, many categories can be misleading if parsed into different DataFrame s. As such, I arbitrarily created some coarse-grained category names to group several categories together. As a result, the _atom_site category, mimicking the PDBx/mmCIF _atom_site category, is handy to work with for most use cases.

There are many other PDBx/PDB/MOL2 parsers, like Biopython PDBParser and OpenMM PDBFile, but most mainly parse the coordinates, and make the whole molecule into a python object of objects. It can be convenient in several use cases, but not so intuitive to visualize individual entries, select atoms, merge molecules, or export to other formats. And since they might need to build many python objects and not take advantage of the underlying structure of those structured data, they can be slow in large scale data processing. Moreover, those python objects are not so convenient to transfer to other platform or programming languages.

There are other python packages that can parse PDB files into DataFrame s. CPDB is the fastest by using Cython according to the author’s benchmarks. But it can only parse PDB files not the other formats, and no writing back to PDB files. BioPandas can parse PDBx, PDB, and MOL2 files, but it is slow by the same benchmarks. According to my benchmark (coming soon!), moldf is also much faster than BioPandas and only slightly slower than CPDB.

Other than the lightweight and speedy parts, perhaps the provided PDBDataFrame class, which is a Pandas DataFrame subclass, is the most useful feature when we need to access common atom groups or select atoms finely. The PDBDataFrame class provides an easy to use . syntax to access common atom groups like backbone, side_chain, water, and heavy_atoms. It also implements atom selection language in a pythonic way that we can select by atom_numbers, atom_names, chain_ids, residue_names, residue_numbers, x_coord, y_coord, z_coord, b_factor, and others. We can even select by distances in a very flexible way. Check the documents for detailed information.

Welcome to MolDF’s documentation!#

Contents#

Indices and tables#