Quick Start#

Important

This project is renamed from pdbx2df. Please go to its documentation for historical features.

This quick start tutorial will guide you to use the MolDF functions and classes to read, manipulate, and write PDBx, PDB, and MOL2 files. You will learn by going through some basic but useful examples.

1. Read a PDB file#

To read a PDB file, you can use the moldf.read_pdb function:

One of the pdb_file and pdb_id parameters should be given. Otherwise, moldf.read_pdb will raise an exception. If pdb_file is given, pdb_id is ignored.

For example:

>>> from moldf import read_pdb
>>> pdb = read_pdb(pdb_id='1vii')
>>> pdb['_atom_site'].columns
Index(['record_name', 'atom_number', 'atom_name', 'alt_loc', 'residue_name',
    'chain_id', 'residue_number', 'insertion', 'x_coord', 'y_coord',
    'z_coord', 'occupancy', 'b_factor', 'segment_id', 'element_symbol',
    'charge'],
   dtype='object')
>>> pdb.keys()
dict_keys(['_atom_site'])

By default, a 1vii.pdb file is downloaded to the ./PDB_files directory from RCSB 1VII.

If you have a local PDB file test.pdb under your current directory. You can read it as:

>>> pdb = read_pdb(pdb_file='test.pdb')

2. Select atoms using PDBDataFrame#

To select rows in the _atom_site DataFrame, you can of course just use standard filter operations in Pandas. For example, you might have thought of something as below:

>>> pdb_df = pdb['_atom_site']
>>> ca_atoms = pdb_df[(pdb_df.atom_name.str.strip().isin(['CA'])) & (pdb_df.element_symbol.str.strip().isin(['C']))]

But it is obvious to see the cumbersomeness and error-proneness if you need more complex selections. And it should be noted that the condition as to element_symbol is necessary because only using the condition as to atom_name could give out a DataFrame containing calcium atoms if there is any, because calcium atoms also have atom_name as CA.

Instead, you could use the selection language implemented in the PDBDataFrame class. For the same selection, it is simply:

>>> from moldf import PDBDataFrame
>>> pdb_df = PDBDataFrame(pdb_df)  # Just adding a few methods to the standard Pandas DataFrame
>>> ca_atoms = pdb_df.ca_atoms

Or equally,

>>> ca_atoms = pdb_df.atom_names(['CA'])  # If you want calcium atoms, you have to add 'CA' to the 'names_2c' keyword herej.

The first method uses the build-in ca_atoms python property so that you can use the familiar . syntax. Check the PDBDataFrame class documentation for other convenient properties.

The build-in properties are handy but not so flexible nor powerful. The second method is much more flexible in that you can select atoms providing a list of atom_name s to the atom_names method and optionally specifying metal atoms in the names_2c keyword. You can also invert the selection by:

>>> not_ca_atoms = pdb_df.atom_names(['CA'], invert=True)

All columns in the PDBDataFrame are supported for such an atom selection language, simply by use the plural forms of the column names as methods for selecting the corresponding columns. Another example:

>>> x_coord_larger_than_zero = pdb_df.x_coords(0, relation='>') # all atoms whose 'x_coord' > 0

Here it shows you can use the relation keywords to control the relationship between the target variable and the reference value if it is a numerical column like x_coord or atom_number etc.

Selection based on distance can be done easily through the distances method, e.g.:

>>> close_to_origin = pdb_df.distances([0.0, 0.0, 0.0], cut_off=10.0, relation='<=')

which gives you all atoms within 10.0 Å of the point [0.0, 0.0, 0.0].

Even more, you can chain and make arbitrary combinations of them to get very complex selections.

>>> complex_selection = pdb_df.chain_ids(['A']).backbone.atom_names(['N']).residue_names(['Lys', 'His', 'Arg']).distances([0.0, 0.0, 0.0], cut_off=10.0, relation='<=')

which gives you all the nitrogen atoms in the backbone of Lys, His, and Arg residues of 1vii’s chain A that are within 10.0 Å of the origin point. For such a selection, using vanilla Pandas filter language can be very time-consuming, error-prone, and thus frustrating. Fortunately, moldf can help you save a lot of effort.

3. Write DataFrames back to a PDB file#

Writing back to a PDB file is simply:

>>> from moldf import write_pdb
>>> write_pdb(pdb, 'output.pdb')

Remember to use the pdb object, not the pdb_df, or it will error out. An output.pdb file is saved to your working directory.

If you want to save the selected atoms (e.g. the complex_selection example above) only, you can:

>>> pdb_out = {'_atom_site': complex_selection}
>>> write_pdb(pdb_out, 'complex_selection.pdb')

and the complex_selection.pdb has all and only the atoms in the complex_selection.

4. Read a mmCIF/PDBx file#

To read a PDBx file, you can use the moldf.read_pdbx function:

One of the pdbx_file and pdb_id parameters should be given. Otherwise, moldf.read_pdbx will raise an exception. If pdbx_file is given, pdb_id is ignored.

For example:

>>> from moldf import read_pdbx
>>> pdbx = read_pdbx(pdb_id='1vii')
>>> pdbx['_atom_site'].columns
Index(['group_PDB', 'id', 'type_symbol', 'label_atom_id', 'label_alt_id',
       'label_comp_id', 'label_asym_id', 'label_entity_id', 'label_seq_id',
       'pdbx_PDB_ins_code', 'Cartn_x', 'Cartn_y', 'Cartn_z', 'occupancy',
       'B_iso_or_equiv', 'pdbx_formal_charge', 'auth_seq_id', 'auth_comp_id',
       'auth_asym_id', 'auth_atom_id', 'pdbx_PDB_model_num'],
      dtype='object')
>>> pdbx.keys()
dict_keys(['_entry', '_audit_conform', '_database_2', '_pdbx_database_status', '_audit_author', '_citation',
           '_citation_author', '_cell', '_symmetry', '_entity', '_entity_name_com', '_entity_poly',
           '_entity_poly_seq', '_entity_src_gen', '_struct_ref', '_struct_ref_seq', '_chem_comp', '_pdbx_nmr_exptl',
           '_pdbx_nmr_exptl_sample_conditions', '_pdbx_nmr_spectrometer', '_pdbx_nmr_refine', '_pdbx_nmr_ensemble',
           '_pdbx_nmr_software', '_exptl', '_struct', '_struct_keywords', '_struct_asym', '_struct_biol',
           '_struct_conf', '_struct_conf_type', '_struct_site', '_struct_site_gen', '_database_PDB_matrix',
           '_atom_sites', '_atom_type', '_atom_site', '_pdbx_poly_seq_scheme', '_pdbx_struct_assembly',
           '_pdbx_struct_assembly_gen', '_pdbx_struct_oper_list', '_pdbx_audit_revision_history',
           '_pdbx_audit_revision_details', '_pdbx_audit_revision_group', '_pdbx_audit_revision_category',
           '_pdbx_audit_revision_item', '_software', '_pdbx_validate_close_contact', '_pdbx_validate_torsion'])

By default, a 1vii.cif file is downloaded to the ./PDBx_files from RCSB 1VII.

Similarly to the read_pdb case, you can read a local test.cif file as well:

>>> pdbx = read_pdbx(pdbx_file='test.cif')

5. Write DataFrames back to a PDBx file#

Similar to the above writing back to PDB file example, you can write back to a PDBx file like:

>>> from moldf import write_pdbx
>>> write_pdbx(pdbx, 'output.cif')

Here the pdbx object is the one generated in the PDBx reading example. An output.cif file is saved to your working directory.

Perhaps a useful case is that you want to keep only some categories but removing the other redundant ones:

>>> to_keep = ['_atom_site', '_entity_poly']
>>> pdbx_keep = {k: v for k, v in pdbx.items() if k in keep}
>>> write_pdbx(pdbx_keep, 'to_keep.cif')

And thus only the _atom_site and _entity_poly categories are saved to your working directory as to_keep.cif.

6. Read a MOL2 file#

To read a Tripos MOL2 file, you can use the moldf.read_mol2 function:

Let’s download an example MOL2 file from LigandBox first. The example ligand is D00217 or Tylenol.

You can read it as:

>>> from moldf import read_mol2
>>> mol2 = read_mol2(mol2_file='./D00217-01.mol2')
>>> mol2['ATOM'].columns
Index(['atom_id', 'atom_name', 'x', 'y', 'z', 'atom_type', 'subst_id',
       'subst_name', 'charge'],
      dtype='object')
>>> mol2.keys()
dict_keys(['ATOM', 'MOLECULE', 'BOND'])

7. Write a MOL2 file#

You might need to do some manipulation to a mol2 file and then write back. One example is ParmEd needs the input mol2 file grouping the atoms in a same residue (can be accessed by the subst_name column) together if there are many, so that it can build the correct topology of the system. One solution is to read the mol2 file, group the residues by subst_name, and then write back.

>>> from moldf import read_mol2, write_mol2
>>> mol2 = read_mol2(mol2_file='glutathione.mol2')
>>> mol2['ATOM'].sort_values(by=['subst_name', 'atom_id'], inplace=True)
>>> write_mol2(mol2, file_name='glutathione_moldf.mol2')

In the glutathione_moldf.mol2 file, the atoms belonging to the same residue are together.

8. RMSD, radius of gyration, and distance matrix#

In moldf, it is very intuitive and convenient to do atom selection as shown above, thanks to the PDBDataFrame class. In fact, the class has more than that. We can use it to calculate RMSD, radius of gyration, and distance matrix easily.

>>> from moldf import read_pdb, PDBDataFrame
>>> pdb = read_pdb(pdb_id='1g03')
>>> df = pdb['_atom_site']
>>> df = PDBDataFrame(df)
>>> all_rmsd = df.rmsd()  # all_rmsd contains all RMSDs between NMR models 2-20 and 1
>>> model_1 = df.nmr_models(1)
>>> m1_rgyr = model_1.radius_of_gyration  # model 1's radius of gyration
>>> m1_dis_mat = model_1.distance_matrix  # model 1's distance matrix in condensed form

Check the API reference for PDBDataFrame for more options in the rmsd method. For example, align can be set as False so that the calculated RMSD values are based on the original coordinates.

For distance_matrix, we can set df.use_squared_distance=False and df.use_square_form=True so that the returned distance matrix is a truly squared matrix whose elements are distances, not distance squared values. The default settings can save computation time and RAM usage, recommended for large scale processing where squared distances are not required.