Quick Start#
Important
This project is renamed from pdbx2df. Please go to its documentation for historical features.
This quick start tutorial will guide you to use the MolDF functions and classes to read, manipulate, and write PDBx, PDB, and MOL2 files. You will learn by going through some basic but useful examples.
1. Read a PDB file#
To read a PDB file, you can use the moldf.read_pdb
function:
One of the pdb_file
and pdb_id
parameters should be given. Otherwise, moldf.read_pdb
will raise an exception. If pdb_file
is given, pdb_id
is ignored.
For example:
>>> from moldf import read_pdb
>>> pdb = read_pdb(pdb_id='1vii')
>>> pdb['_atom_site'].columns
Index(['record_name', 'atom_number', 'atom_name', 'alt_loc', 'residue_name',
'chain_id', 'residue_number', 'insertion', 'x_coord', 'y_coord',
'z_coord', 'occupancy', 'b_factor', 'segment_id', 'element_symbol',
'charge'],
dtype='object')
>>> pdb.keys()
dict_keys(['_atom_site'])
By default, a 1vii.pdb
file is downloaded to the ./PDB_files
directory from RCSB 1VII.
If you have a local PDB file test.pdb
under your current directory. You can read it as:
>>> pdb = read_pdb(pdb_file='test.pdb')
2. Select atoms using PDBDataFrame#
To select rows in the _atom_site
DataFrame
, you can of course just use standard filter operations in Pandas
.
For example, you might have thought of something as below:
>>> pdb_df = pdb['_atom_site']
>>> ca_atoms = pdb_df[(pdb_df.atom_name.str.strip().isin(['CA'])) & (pdb_df.element_symbol.str.strip().isin(['C']))]
But it is obvious to see the cumbersomeness and error-proneness if you need more complex selections. And it should be noted
that the condition as to element_symbol
is necessary because only using the condition as to atom_name
could give out
a DataFrame
containing calcium atoms if there is any, because calcium atoms also have atom_name
as CA
.
Instead, you could use the selection language implemented in the PDBDataFrame class
. For the same selection,
it is simply:
>>> from moldf import PDBDataFrame
>>> pdb_df = PDBDataFrame(pdb_df) # Just adding a few methods to the standard Pandas DataFrame
>>> ca_atoms = pdb_df.ca_atoms
Or equally,
>>> ca_atoms = pdb_df.atom_names(['CA']) # If you want calcium atoms, you have to add 'CA' to the 'names_2c' keyword herej.
The first method uses the build-in ca_atoms
python property
so that you can use the familiar .
syntax.
Check the PDBDataFrame class
documentation for other convenient properties.
The build-in properties are handy but not so flexible nor powerful. The second method is much more flexible in that
you can select atoms providing a list of atom_name
s to the atom_names
method and optionally specifying metal atoms in
the names_2c
keyword. You can also invert the selection by:
>>> not_ca_atoms = pdb_df.atom_names(['CA'], invert=True)
All columns in the PDBDataFrame
are supported for such an atom selection language, simply by use the plural forms of the
column names as methods for selecting the corresponding columns. Another example:
>>> x_coord_larger_than_zero = pdb_df.x_coords(0, relation='>') # all atoms whose 'x_coord' > 0
Here it shows you can use the relation
keywords to control the relationship between the target variable and the reference value
if it is a numerical column like x_coord or atom_number etc.
Selection based on distance
can be done easily through the distances
method, e.g.:
>>> close_to_origin = pdb_df.distances([0.0, 0.0, 0.0], cut_off=10.0, relation='<=')
which gives you all atoms within 10.0 Å of the point [0.0, 0.0, 0.0].
Even more, you can chain and make arbitrary combinations of them to get very complex selections.
>>> complex_selection = pdb_df.chain_ids(['A']).backbone.atom_names(['N']).residue_names(['Lys', 'His', 'Arg']).distances([0.0, 0.0, 0.0], cut_off=10.0, relation='<=')
which gives you all the nitrogen atoms in the backbone of Lys, His, and Arg residues of 1vii’s chain A that are within 10.0 Å of the origin point.
For such a selection, using vanilla Pandas
filter language can be very time-consuming, error-prone, and thus frustrating.
Fortunately, moldf
can help you save a lot of effort.
3. Write DataFrames back to a PDB file#
Writing back to a PDB file is simply:
>>> from moldf import write_pdb
>>> write_pdb(pdb, 'output.pdb')
Remember to use the pdb
object, not the pdb_df
, or it will error out. An output.pdb
file is saved to your working directory.
If you want to save the selected atoms (e.g. the complex_selection
example above) only, you can:
>>> pdb_out = {'_atom_site': complex_selection}
>>> write_pdb(pdb_out, 'complex_selection.pdb')
and the complex_selection.pdb
has all and only the atoms in the complex_selection
.
4. Read a mmCIF/PDBx file#
To read a PDBx file, you can use the moldf.read_pdbx
function:
One of the pdbx_file
and pdb_id
parameters should be given. Otherwise, moldf.read_pdbx
will raise an exception. If pdbx_file
is given, pdb_id
is ignored.
For example:
>>> from moldf import read_pdbx
>>> pdbx = read_pdbx(pdb_id='1vii')
>>> pdbx['_atom_site'].columns
Index(['group_PDB', 'id', 'type_symbol', 'label_atom_id', 'label_alt_id',
'label_comp_id', 'label_asym_id', 'label_entity_id', 'label_seq_id',
'pdbx_PDB_ins_code', 'Cartn_x', 'Cartn_y', 'Cartn_z', 'occupancy',
'B_iso_or_equiv', 'pdbx_formal_charge', 'auth_seq_id', 'auth_comp_id',
'auth_asym_id', 'auth_atom_id', 'pdbx_PDB_model_num'],
dtype='object')
>>> pdbx.keys()
dict_keys(['_entry', '_audit_conform', '_database_2', '_pdbx_database_status', '_audit_author', '_citation',
'_citation_author', '_cell', '_symmetry', '_entity', '_entity_name_com', '_entity_poly',
'_entity_poly_seq', '_entity_src_gen', '_struct_ref', '_struct_ref_seq', '_chem_comp', '_pdbx_nmr_exptl',
'_pdbx_nmr_exptl_sample_conditions', '_pdbx_nmr_spectrometer', '_pdbx_nmr_refine', '_pdbx_nmr_ensemble',
'_pdbx_nmr_software', '_exptl', '_struct', '_struct_keywords', '_struct_asym', '_struct_biol',
'_struct_conf', '_struct_conf_type', '_struct_site', '_struct_site_gen', '_database_PDB_matrix',
'_atom_sites', '_atom_type', '_atom_site', '_pdbx_poly_seq_scheme', '_pdbx_struct_assembly',
'_pdbx_struct_assembly_gen', '_pdbx_struct_oper_list', '_pdbx_audit_revision_history',
'_pdbx_audit_revision_details', '_pdbx_audit_revision_group', '_pdbx_audit_revision_category',
'_pdbx_audit_revision_item', '_software', '_pdbx_validate_close_contact', '_pdbx_validate_torsion'])
By default, a 1vii.cif
file is downloaded to the ./PDBx_files
from RCSB 1VII.
Similarly to the read_pdb
case, you can read a local test.cif
file as well:
>>> pdbx = read_pdbx(pdbx_file='test.cif')
5. Write DataFrames back to a PDBx file#
Similar to the above writing back to PDB file example, you can write back to a PDBx file like:
>>> from moldf import write_pdbx
>>> write_pdbx(pdbx, 'output.cif')
Here the pdbx
object is the one generated in the PDBx reading example.
An output.cif
file is saved to your working directory.
Perhaps a useful case is that you want to keep only some categories but removing the other redundant ones:
>>> to_keep = ['_atom_site', '_entity_poly']
>>> pdbx_keep = {k: v for k, v in pdbx.items() if k in keep}
>>> write_pdbx(pdbx_keep, 'to_keep.cif')
And thus only the _atom_site
and _entity_poly
categories are saved to your working directory as to_keep.cif
.
6. Read a MOL2 file#
To read a Tripos MOL2 file, you can use the moldf.read_mol2
function:
Let’s download an example MOL2 file from LigandBox first. The example ligand is D00217 or Tylenol.
You can read it as:
>>> from moldf import read_mol2
>>> mol2 = read_mol2(mol2_file='./D00217-01.mol2')
>>> mol2['ATOM'].columns
Index(['atom_id', 'atom_name', 'x', 'y', 'z', 'atom_type', 'subst_id',
'subst_name', 'charge'],
dtype='object')
>>> mol2.keys()
dict_keys(['ATOM', 'MOLECULE', 'BOND'])
7. Write a MOL2 file#
You might need to do some manipulation to a mol2 file and then write back. One example is ParmEd needs the input mol2 file grouping the atoms in a same residue (can be accessed by the subst_name column) together if there are many, so that it can build the correct topology of the system. One solution is to read the mol2 file, group the residues by subst_name, and then write back.
>>> from moldf import read_mol2, write_mol2
>>> mol2 = read_mol2(mol2_file='glutathione.mol2')
>>> mol2['ATOM'].sort_values(by=['subst_name', 'atom_id'], inplace=True)
>>> write_mol2(mol2, file_name='glutathione_moldf.mol2')
In the glutathione_moldf.mol2 file, the atoms belonging to the same residue are together.
8. RMSD, radius of gyration, and distance matrix#
In moldf, it is very intuitive and convenient to do atom selection as shown above, thanks to the PDBDataFrame class. In fact, the class has more than that. We can use it to calculate RMSD, radius of gyration, and distance matrix easily.
>>> from moldf import read_pdb, PDBDataFrame
>>> pdb = read_pdb(pdb_id='1g03')
>>> df = pdb['_atom_site']
>>> df = PDBDataFrame(df)
>>> all_rmsd = df.rmsd() # all_rmsd contains all RMSDs between NMR models 2-20 and 1
>>> model_1 = df.nmr_models(1)
>>> m1_rgyr = model_1.radius_of_gyration # model 1's radius of gyration
>>> m1_dis_mat = model_1.distance_matrix # model 1's distance matrix in condensed form
Check the API reference for PDBDataFrame for more options in the rmsd method. For example, align can be set as False so that the calculated RMSD values are based on the original coordinates.
For distance_matrix, we can set df.use_squared_distance=False and df.use_square_form=True so that the returned distance matrix is a truly squared matrix whose elements are distances, not distance squared values. The default settings can save computation time and RAM usage, recommended for large scale processing where squared distances are not required.