API References#
Important
This project is renamed from pdbx2df. Please go to its documentation for historical features.
PDBx Reader#
PDBx/mmCIF format reading.
Reads a PDBx file into a dict of Pandas DataFrame
s.
The dict keys are read from the mmCIF category names automatically.
For each category, the column names of the corresponding DataFrame
are read form the category attributes automatically.
For example:
_audit_conform.dict_name mmcif_pdbx.dic
_audit_conform.dict_version 5.355
_audit_conform.dict_location http://mmcif.pdb.org/dictionaries/ascii/mmcif_pdbx.dic
In this category, the category name is _audit_conform
, so the returned dict
key is _audit_conform
. The category attributes are dict_name
,
dict_version
, and dict_location
, so the returned dict value as a
DataFrame
has the exact column names.
- moldf.read_pdbx.AF2_MODEL = 4#
For AlphaFold structures, the version to use.
- moldf.read_pdbx.read_pdbx(pdbx_file: str | PathLike[str] | None = None, pdb_id: str | None = None, save_pdbx_file: bool = True, pdbx_file_dir: str | PathLike | None = None, category_names: list | None = None, convert_dtype: bool = False) dict[str, DataFrame] [source]#
Reads a
.cif
file’s categories into adict
ofPandas DataFrame
s.- Parameters:
pdb_id (optional) – PDB/Uniprot ID. Required if
pdbx_file
isNone
. Defaults to None.pdbx_file (optional) – file name for a PDBx/mmCIF file. Used over pdb_id. Defaults to None.
category_names (optional) – a list of categories in the mmCIF file format. If
None
,all
is used and all categories will be processed. Defaults to None.save_pdbx_file (optional) – whether to save the fetched PDBx file from RCSB to
pdbx_file_dir
. Defaults to False.pdbx_file_dir (optional) – directory to save fetched PDBx files. If
None
butsave_pdbx_file
isTrue
, ‘./PDBx_files’ is used. Defaults to None.convert_dtype (optional) – whether to convert the data types according to the RCSB mmcif specifications. Defaults to False.
- Returns:
A dict of
Pandas DataFrame
s corresponding to required categories.- Raises:
ValueError – if none of
pdb_id
orpdbx_file
is provided, or ifpdb_id
is given but cannot the PDB file cannot be downloaded from RCSB, or the PDB file is corrupted like no end-line symbol, or some content is irregular.FileNotFoundError – if
pdbx_file
cannot be found.
PDB Reader#
PDB format reading.
Reads a PDB file, including Chimera compatible ones, into a dict of
Pandas DataFrame
s.
Atom coordinates and sequences are currently supported by the following category names:
_atom_site
:ATOM
,HETATM
,TER
,NUMMDL
,MODEL
, andENDMDL
lines.
_seq_res
:SEQRES
lines.
- moldf.read_pdb.IMPLEMENTED_PDB_CATS = ['_atom_site', '_seq_res', '_chem_comp']#
PDB categories that are currently implemented.
- moldf.read_pdb.ATOM_SITE = ('ATOM', 'HETATM', 'TER')#
_atom_site
primary lines.
- moldf.read_pdb.NMR_MDL = ('NUMMDL', 'MODEL', 'ENDMDL')#
_atom_site
additional lines.
- moldf.read_pdb.CHEM_COMP = ('HET ', 'HETNAM', 'FORMUL', 'HETSYN')#
_chem_comp
lines.
- moldf.read_pdb.AF2_MODEL = 4#
For AlphaFold structures, the version to use.
- moldf.read_pdb.read_pdb(pdb_file: str | PathLike[str] | None = None, pdb_id: str | None = None, category_names: list | None = None, save_pdb_file: bool = True, pdb_file_dir: str | PathLike | None = None, allow_chimera: bool = True, need_ter_lines: bool = True) dict[str, DataFrame] [source]#
Reads a
.pdb
file’s categories into adict
ofPandas DataFrame
s.- Parameters:
pdb_id (optional) – PDB/Uniprot ID. Required if
pdb_file
isNone
. Defaults to None.pdb_file (optional) – file name for a PDB file. Used over pdb_id. Defaults to None.
category_names (optional) –
a list of categories similar to the mmCIF format. If
None
,_atom_site
is used. To be consistent with the PDBx format, the following category names are used to refer to block(s) in a PDB file and only they are supported:1.
_atom_site
:ATOM
,HETATM
, andTER
lines and possibleNUMMDL
,MODEL
, andENDMDL
lines._seq_res
:SEQRES
lines._chem_comp
: all lines inCHEM_COMP
above. However, onlycompounds included in the ``HET `` lines are extracted which usually means the water molecule is not included.
Defaults to None.
save_pdb_file (optional) – whether to save the fetched PDB file from RCSB to
pdb_file_dir
. Defaults to True.pdb_file_dir (optional) – directory to save fetched PDB files. If
None
butsave_pdb_file
isTrue
, ‘./PDB_files’ is used. Defaults to None.allow_chimera (optional) – whether to allow Chimera-formatted PDB files. Defaults to True.
need_ter_lines (optional) – whether to read the
TER
lines into theDataFrame
. Defaults to True.
- Returns:
A dict of
Pandas DataFrame
s corresponding to required categories.- Raises:
ValueError – if none of
pdb_id
orpdbx_file
is provided, or ifpdb_id
is given but cannot the PDB file cannot be downloaded from RCSB,NotImplementedError – if category_names not a subset of allowed names.
FileNotFoundError – if
pdbx_file
cannot be found.
- moldf.read_pdb._split_atom_line(line: str, nmr_model: int = -1, allow_chimera: bool = True, is_ter_line: bool = False) tuple [source]#
Internal function to parse a single line belonging to
ATOM
,HETATM
, orTER
lines.- Parameters:
line (required) – A
ATOM
,HETATM
, orTER
line.nmr_model (optional) – the NMR model number for the line;
-1
means not an NMR model. Defaults to -1.allow_chimera (optional) – try to parse as a Chimera-formatted PDB file. Defaults to True
is_ter_line (optional) – whether the line starts with
TER
. Defaults to False.
- Returns:
parsed values as a tuple.
MOL2 Reader#
Mol2 format reading.
Read a Tripos .mol2
file into a dictionary of pandas DataFrames
. Different
categories like ‘ATOM’ and ‘BOND’ are read into different DataFrame objects.
- moldf.read_mol2.IMPLEMENTED_MOL2_CATS = ['ATOM', 'MOLECULE', 'BOND', 'HEADER']#
MOL2 categories that are currently implemented.
- moldf.read_mol2.ATOM_COL_NAMES = ('atom_id', 'atom_name', 'x', 'y', 'z', 'atom_type', 'subst_id', 'subst_name', 'charge', 'status_bit')#
MOL2
ATOM
column names.
- moldf.read_mol2.BOND_COL_NAMES = ('bond_id', 'origin_atom_id', 'target_atom_id', 'bond_type', 'status_bit')#
MOL2
BOND
column names.
- moldf.read_mol2.read_mol2(mol2_file: str | PathLike, category_names: list | None = None) dict[str, DataFrame] [source]#
Reads a
.mol2
file’s categories into adict
ofPandas DataFrame
s.- Parameters:
mol2_file (required) – file name for a PDB file.
category_names (optional) – a list of categories as to the
.mol2
file format. IfNone
, ['ATOM'
,'MOLECULE'
,'BOND'
,'HEADER'
] is used. Defaults to None.
- Returns:
A dict of
category_name
as keys(s) andpd.DataFrame
as values.- Raises:
NotImplementedError – if
category_names
not a subset of ['ATOM'
,'MOLECULE'
,'BOND'
,'HEADER'
]
- moldf.read_mol2._get_header_df(header_lines: list[tuple]) DataFrame [source]#
Turns the
HEADER
lines into aPandas DataFrame
. TheHEADER
lines are those starting with one or multiple#
symbols.- Parameters:
header_lines (required) – a list of tuples corresponding to each line’s content. Tuples are generate by splitting the lines by
:
.- Returns:
Pandas DataFrame
of TheHEADER
category
- moldf.read_mol2._get_molecule_df(molecule_lines: list[tuple]) DataFrame [source]#
Turns the
MOLECULE
lines into aPandas DataFrame
.- Parameters:
molecule_lines (required) – a list of tuples corresponding to each line’s content.
- Returns:
Pandas DataFrame
of TheMOLECULE
category
JCSV Reader#
JCSV format reading.
Reads a JCSV file into a dict of Pandas DataFrame
s.
It is not limited to any molecular format.
- moldf.read_jcsv.read_jcsv(jcsv_file: str | PathLike, category_names: list | None = None) dict[str, DataFrame] [source]#
Reads a JCSV file by name.
Currently no molecular file repository has JCSV files so we can only read from a file name/path.
- Parameters:
jcsv_file (required) – JCSV file name/path.
category_names (optional) – a list of category names. If
None
, all categories are read. Defaults to None.
- Returns:
a dict of Pandas DataFrames for each category.
- Raises:
TypeError – if
category_names
is not a list of strings.ValueError – if any of the
category_names
has double quotes or if the number of items in any line does not match the number of column names in the same category.
- moldf.read_jcsv._read_jcsv_by_line(jcsv_file: str | PathLike, category_names: list | None = None) dict[str, DataFrame] [source]#
Reads JCSV file line by line when the file has no meta data to select blocks.
- Parameters:
jcsv_file (required) – JCSV file name/path.
category_names (optional) – a list of category names. If
None
, all categories are read. Defaults to None. It is passed by theread_jcsv
caller, so it is not sanitized here.
- Returns:
a dict of Pandas DataFrames for each category.
- Raises:
ValueError – if the number of items in any line does not match the number of column names in the same category.
- moldf.read_jcsv._count_n_lines(file_name: str | PathLike)[source]#
Gets the number of lines in a file. From https://stackoverflow.com/a/68385697/10094189
- Parameters:
file_name (required) – file name or path.
PDBDataFrame Class#
PDBDataFrame
as a subclass of Pandas DataFrame
.
Several features are added to make PDB data more accessible and selectable:
1. Properties like sequences
, heavy_atoms
, backbone
, and water
are
directly accessed by .
operation.
2. Atom selection by using methods whose names are just the column names plus s
(plural form). For example, selecting atoms by names is simply
df.atom_names([names])
where atom_name
is the column name
and atom_names
is the selection function. Each selection returns a
PDBDataFrame
object as well, which means we can chain selections one by one
like df.atom_names([names]).residue_numbers([numbers])
.
Distance matrix as a
@property
and@classmethod
.
- moldf.pdb_dataframe.RESIDUE_CODES = {'ALA': 'A', 'ARG': 'R', 'ASH': 'D', 'ASN': 'N', 'ASP': 'D', 'CYM': 'C', 'CYS': 'C', 'CYX': 'C', 'GLH': 'E', 'GLN': 'Q', 'GLU': 'E', 'GLY': 'G', 'HID': 'H', 'HIE': 'H', 'HIP': 'H', 'HIS': 'H', 'ILE': 'I', 'LEU': 'L', 'LYN': 'K', 'LYS': 'K', 'MET': 'M', 'PHE': 'F', 'PRO': 'P', 'PYL': 'O', 'SEC': 'U', 'SER': 'S', 'THR': 'T', 'TRP': 'W', 'TYR': 'Y', 'VAL': 'V'}#
dict[str, str], turn 3-, 2-, and 1-letter residue codes to 1-letter codes.
- moldf.pdb_dataframe.PDBX_COLS = {'alt_loc': 'label_alt_id', 'atom_name': 'label_atom_id', 'atom_number': 'id', 'b_factor': 'B_iso_or_equiv', 'chain_id': 'label_asym_id', 'charge': 'pdbx_formal_charge', 'element_symbol': 'type_symbol', 'insertion': 'pdbx_PDB_ins_code', 'nmr_model': 'pdbx_PDB_model_num', 'occupancy': 'occupancy', 'record_name': 'group_PDB', 'residue_name': 'label_comp_id', 'residue_number': 'label_seq_id', 'segment_id': 'label_entity_id', 'x_coord': 'Cartn_x', 'y_coord': 'Cartn_y', 'z_coord': 'Cartn_z'}#
dict[str, str], PDB and mmCIF column name dictionary.
- class moldf.pdb_dataframe.PDBDataFrame(*args, pdb_format: str | None = None, use_squared_distance: bool = True, use_square_form: bool = False, **kwargs)[source]#
Bases:
DataFrame
Pandas DataFrame with extended attributes and methods for PDB data.
It enables Pythonic atom selection methods and convenient
.
accessing to common PDB structure properties.- Parameters:
*args – all
pd.DataFrame
positional arguments. For example, the_atom_site
dataframe returned by reading a PDB file.pdb_format (optional) – PDB format in the underlying provided data. If
None
,PDB
is assumed. Defaults to None.use_squared_distance (optional) – whether to use squared distance when calculating distance matrix. Defaults to True.
use_square_form (optional) – whether to use a square matrix for the distance matrix. Defaults to False.
**kwargs – all
pd.DataFrame
acceptable keyword arguments.
- Returns:
A
PDBDataFrame
instance.
Example
>>> from moldf import read_pdb, PDBDataFrame >>> pdb = read_pdb(pdb_id='1vii') >>> pdb_df = pdb['_atom_site'] >>> pdb_df = PDBDataFrame(pdb_df)
Warning
This subclass uses a custom
__hash__
function for caching some calculations. And thus a custom__eq__
function is also implemented. For other typicalDataFrame
operations, use those.all()
,.any()
,.bool()
functions to do comparison.- _metadata: list[str] = ['_use_squared_distance', '_use_square_form', '_is_chimera', '_RESIDUE_CODES', '_ELEMENT_MASSES', '_pdb_format']#
- property _constructor#
Used when a manipulation result has the same dimensions as the original.
- _pdbx_to_pdb(keep_original: bool = False)[source]#
Converts PDBx ‘_atom_site’ DataFrame to PDB format.
- Parameters:
keep_original (optional) – whether to keep the original columns in the PDBx ‘_atom_site’ DataFrame. Defaults to False.
- property RESIDUE_CODES: dict[str, str]#
A dict of
residue_name
as keys andresidue_code
as values, whereresidue_code
is a 1-character code used in sequences. Settable.
- property ELEMENT_MASSES: dict[str, float]#
A dict of
element_symbol
as keys andelement_mass
as values, whereelement_mass
is taken from NIST. Settable.
- property is_chimera: bool#
Whether the original read-in PDB was Chimera compatible format. The main effect is the
residue_name
str width is 4 in the Chimera compatible format instead of 3 as in the standard PDB format. Not settable.
- property use_squared_distance: bool#
Whether R or R^2 is used in distance matrix calculations. Using R^2 saves computation time. Settable.
- property use_square_form: bool#
Whether the distance matrix will be in a square form. Using square form consumes less memory. Settable.
- property atoms: Self#
Gets atoms in the
ATOM
andHETATM
entries. In other words, removing ‘TER’ lines.- Returns:
sub
PDBDataFrame
.
- property coords: Self#
Gets the
x_coord
,y_coord
, andz_coord
columns only. Usepdb_df.coords.values
to get the underlying Numpy array of the coordinates.
- property bonds: dict#
- Gets the list of bonds. Each bond is represented as a pair of
atom_number
values.
- Raises:
ValueError – if the list of
atom_number
is not a set.
- get_bonds_by_distance(single_radii_set: str | None = None, need_non_covalent: bool = False, non_covalent_cutoff: float = 4.5) dict [source]#
Gets all the bonds purely by covalent radii constraints.
- Parameters:
single_radii_set (optional) – radii sets to use. If
None
,single_C
is used as to Cordero (PMID 18478144). Another option issingle_PA
which refers to Pyykkö’s studies (PMID 19058281;19856342;15832398, and doi:10.1103/PhysRevB.85.024115). Defaults to None.need_non_covalent (optional) – whether non-covalent ‘bonding’ should be included. Defaults to False.
non_covalent_cutoff (optional) – distance cutoff for non-covalent ‘bonding’.
- Raises:
ValueError – if the list of
atom_number
is not unique or if thesingle_radii_set
is not valid.- Returns:
- a dictionary of bonds with tuple of
atom_number
as keys and bond types as values.
- a dictionary of bonds with tuple of
- property sequences: dict[str, str]#
Gets the sequences for each chain as a dict of
chain_id
as key(s) andchain_sequence
as value(s).
- property residue_list: list[tuple]#
Gets all residues as a list of tuple (
chain_id
,residue_name
,residue_number
).
- property center_of_geometry: ndarray#
Gets the center of geometry as a
(3, )
np.ndarray
.
- property center_of_mass: ndarray#
Gets the center of mass as a
(3, )
np.ndarray
.
- property distance_matrix: ndarray#
Gets the distance matrix.
- rmsd(other: Self | ndarray | None = None, align: bool = True, weights: list | None = None, selection: Self | list | None = None) list | float [source]#
Calculates RMSD 1) among sets of coordinates in one
PDBDataFrame
with multiplenmr_model
s or 2) two sets of coordinates in twoPDBDataFrames
.- Parameters:
other (optional) – the other
PDBDataFrame
or (N, 3)numpy.ndarray
to calculate RMSD against. IfNone
,self
should contain at least two sets of coordinates (nmr_model
has >= 2 unique values). Defaults to None.align (optional) – whether to align the structures before calculating RMSD. If
False
, theweights
andselection
keywords are ignored. Defaults to True.weights (optional) – a list of weights for all the atoms in
selection
to do structure alignment. IfNone
, all coordinates in theselection
orself
have the same weights. Defaults to None.selection (optional) – a list of
atom_number
s inself
or a PDBDataFrame after the filtering methods. IfNone
, all coordinates inself
are used for structure alignment. Defaults to None.
- Returns:
RMSD or a list of RMSD’s.
- Raises:
ValueError – if dimensionalities mismatch among
self
,other
,weights
, andselection
if they are notNone
; oratom_number
s inself
are not unique.TypeError – if
other
,weights
, andselection
have unsupported types.
- record_names(names: list[str], invert: bool = False) Self [source]#
Filter by
record_name
.- Parameters:
names (required) – a list of
record_name
s.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- atom_numbers(numbers: list[int] | int, relation: str | None = None, invert: bool = False) Self [source]#
Filter by
atom_number
.- Parameters:
numbers (required) – one or a list of
atom_number
s.relation (optional) –
atom_number
relationship tonumbers
. Ifnumbers
is an integer, it has to be one of<
,<=
,=
,>=
, and>
. IfNone
,<=
is used. Ignored if a list of integers are provided tonumbers
. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- atom_names(names: list[str], names_2c: list[str] | None = None, invert: bool = False, suppress_warning: bool = False) Self [source]#
Filter by
atom_name
.- Parameters:
names (required) – a list of
atom_name
s whoseelement_symbols
have only one character. Atoms in common residues and ligands should be provide here likeC, H, O, N, S, P, F
.names_2c (optional) – a list of
atom_name
s whoseelement_symbols
have two characters like ion (FE
) and chloride (CL
). Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
suppress_warning – whether to suppress the warning message about possible conflicts between
names
andnames_2c
. Defaults to False.
- Returns:
sub
PDBDataFrame
- alt_locs(locs: list[str], invert: bool = False) Self [source]#
Filter by
alt_loc
.- Parameters:
locs (required) – a list of
alt_loc
s.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub PDBDataFrame
- residue_names(names: list[str], invert: bool = False) Self [source]#
Filter by
residue_names
.- Parameters:
names (required) – a list of
residue_name
sinvert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- chain_ids(ids: list[str], invert: bool = False) Self [source]#
Filter by
chain_id
.- Parameters:
ids (required) – a list of
chain_id
s.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- residue_numbers(numbers: list[int] | int, relation: str | None = None, invert: bool = False) Self [source]#
Filter by
residue_number
.- Parameters:
numbers (required) – one or a list of
residue_number
s.relation (optional) –
residue_number
relationship tonumbers
. Ifnumbers
is an integer, it has to be one of<
,<=
,=
,>=
, and>
. IfNone
, ‘<=’ is used. Ignored if a list of integers are provided tonumbers
. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- insertions(codes: list[str], invert: bool = False) Self [source]#
Filter by
insertion
.- Parameters:
codes (required) – a list of
insertion
codes.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- x_coords(value: float, relation: str | None = None, invert: bool = False, epsilon: float = 0.01) Self [source]#
Filter by
x_coord
.- Parameters:
value (required) – value to select
x_coord
s.relation (optional) –
x_coord
relationship tovalue
. It has to be one of'<'
,'<='
,'='
,'>='
, and'>'
. IfNone
,'<='
is used. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
epsilon (optional) – atoms
abs(x_coord - value)
<=epsilon
are selected wheninvert
=False
. Defaults to 0.01.
- Returns:
sub
PDBDataFrame
- y_coords(value: float, relation: str | None = None, invert: bool = False, epsilon: float = 0.01) Self [source]#
Filter by
y_coord
.- Parameters:
value (required) – value to select
y_coord
s.relation (optional) –
y_coord
relationship tovalue
. It has to be one of'<'
,'<='
,'='
,'>='
, and'>'
. IfNone
,'<='
is used. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
epsilon (optional) – atoms
abs(y_coord - value)
<=epsilon
are selected wheninvert
=False
. Defaults to 0.01.
- Returns:
sub
PDBDataFrame
- z_coords(value: float, relation: str | None = None, invert: bool = False, epsilon: float = 0.01) Self [source]#
Filter by
z_coord
.- Parameters:
value (required) – value to select
z_coord
s.relation (optional) –
z_coord
relationship tovalue
. It has to be one of'<'
,'<='
,'='
,'>='
, and'>'
. IfNone
,'<='
is used. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
epsilon (optional) – atoms
abs(z_coord - value)
<=epsilon
are selected wheninvert
=False
. Defaults to 0.01.
- Returns:
sub
PDBDataFrame
- occupancies(value: float, relation: str | None = None, invert: bool = False, epsilon: float = 0.01) Self [source]#
Filter by
occupancy
.- Parameters:
value (required) – value to select
occupancy
s.relation (optional) –
occupancy
relationship tovalue
. It has to be one of'<'
,'<='
,'='
,'>='
, and'>'
. IfNone
,'<='
is used. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
epsilon (optional) – atoms
abs(occupancy - value)
<=epsilon
are selected wheninvert
=False
. Defaults to 0.01.
- Returns:
sub
PDBDataFrame
- b_factors(value: float, relation: str | None = None, invert: bool = False, epsilon: float = 0.01) Self [source]#
Filter by
b_factor
.- Parameters:
value (required) – value to select
b_factor
s.relation (optional) –
b_factor
relationship tovalue
. It has to be one of'<'
,'<='
,'='
,'>='
, and'>'
. IfNone
,'<='
is used. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
epsilon (optional) – atoms
abs(b_factor - value)
<=epsilon
are selected wheninvert
=False
. Defaults to 0.01.
- Returns:
sub
PDBDataFrame
- segment_ids(ids: list[str], invert: bool = False) Self [source]#
Filter by
segment_id
.- Parameters:
ids (required) – a list of
segment_id
s.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- element_symbols(symbols: list[str], invert: bool = False) Self [source]#
Filter by
element_symbol
.- Parameters:
symbols (required) – a list of
element_symbol
s.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- charges(charges: list[str], invert: bool = False) Self [source]#
Filter by
charge
.- Parameters:
charges (required) – a list of
charge
s.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
Notes:
charge
is2-char
string in the PDB specifications.
- nmr_models(models: list[int] | int, relation: str | None = None, invert: bool = False) Self [source]#
Filter by
nmr_model
.- Parameters:
models (required) – one or a list of
nmr_model
ids.relation (optional) –
nmr_model
relationship tomodels
. Ifmodels
is an integer, it has to be one of'<'
,'<='
,'='
,'>='
, and'>'
. IfNone
,'<='
is used. Ignored if a list of integers are provided tomodels
. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- distances(other: ndarray | Self | Iterable, cut_off: float = inf, to: str | None = None, invert: bool = False) Self [source]#
Filter by
distance
to a reference point or group of atoms.- Parameters:
other – the other group’s coordinate(s).
cut_off – the distance cutoff to filter.
to –
if
other
is a group atoms, using which method to determine whether the atoms meet the cut_off distance. If None,COM
or center of mass is used ifother
isPDBDataFrame
, andCOG
or center of geometry is used ifother
isnp.ndarray
orIterable
. The following are allowed:com
,center of mass
,center_of_mass
:use the center of mass for the
other
.cog
,center of geometry
,center_of_geometry
:use the center of geometry for the
other
.all
:whether all the pair-distances meet the
cut_off
distance criteria.any
:whether any of the pair-distances meets the
cut_off
distance criteria.
invert – whether to invert the selection. Defaults to False.
- Returns:
sub
PDBDataFrame
- _filter_num_col(value: int | float | list[int], num_col_name: str, relation: str | None = None, invert: bool = False, epsilon: float | int = 0.01, suppress_warning: bool = False) Self [source]#
Generic function to do filter by a numerical column.
- Parameters:
value (required) – value(s) to select by the column given by the
num_col_name
input.num_col_name (required) – one of
atom_number
,residue_number
,x_coord
,y_coord
, orz_coord
,occupancy
,b_factor
, andnmr_model
. Note: thecharge
column is not numerical byPDB
format.relation (optional) –
x/y/z_coord
relationship tovalue
. It has to be one of'<'
,'<='
,'='
,'>='
, and'>'
. IfNone
,'<='
is used. Ignored if a list of integers are provided tovalue
. Defaults to None.invert (optional) – whether to invert the selection. Defaults to False.
epsilon (optional) – atoms
abs``(``num_col_value
-value
) <=epsilon
are selected wheninvert
=False
andrelation
='='
. Ignored if a list of integers are provided tovalue
. Defaults to 0.01.suppress_warning (optional) – whether to suppress warnings. Defaults to False.
- Returns:
sub
PDBDataFrame
- Raises:
ValueError – if xyz not in [
atom_number
,residue_number
,x_coord
,y_coord
, orz_coord
,occupancy
,b_factor
,nmr_model
] orrelation
not in ['<'
,'<='
,'='
,'>='
,'>'
] when selecting on float cols.
- classmethod get_chain_list(pdb_df: Self) list [source]#
Gets the list of chain ids given a
PDBDataFrame
object.- Parameters:
pdb_df (required) – a
PDBDataFrame
object.- Returns:
a list of chain ids.
Warning
- This method relies on the original and standard numbering of the atoms and
residues in the DataFrame. Therefore, any selecting or sorting of the it can lead to wrong results.
- classmethod get_residue_list(pdb_df: Self, include_heteros: bool = False) list[tuple] [source]#
Gets the list of residues given a
PDBDataFrame
object.- Parameters:
pdb_df (required) – a
PDBDataFrame
object.include_heteros (optional) – whether to include hetero ligands. Defaults to False.
- Returns:
a list of residues as (
chain_id
,residue_name
,residue_number
).
Warning
- This method relies on the original and standard numbering of the atoms and
residues in the DataFrame. Therefore, any selecting or sorting of the it can lead to wrong results.
- classmethod get_distance_matrix(pdb_df: Self, other_data: Self | tuple | None = None, use_r2: bool = True, square_form: bool = False) ndarray [source]#
Calculates the distance matrix given a
PDBDataFrame
object and (optional) reference data.- Parameters:
pdb_df (required) – a
PDBDataFrame
object.other_data (optional) – the coordinates of to calculate the distances against. Defaults to None.
use_r2 (optional) – whether to use r^2 or r for distance matrix. Defaults to True.
square_form (optional) – whether to output a square form of the density matrix. If two
PDBDataFrame``s are different or ``other_data
is not aPDBDataFrame
,square_form
is ignored. Defaults to False.
- Returns:
distance matrix (squared or condensed form)
- Raises:
ValueError – if
other_data
is not ofPDBDataFrame|tuple|None
type or wrong shape if it is atuple
.
PDBx Writer#
PDBx/mmCIF format writing.
Write a dict of Pandas DataFrame
back to a PDBx file.
- moldf.write_pdbx.write_pdbx(pdbx: dict[str, DataFrame], file_name: str | PathLike | None = None) None [source]#
Writes a
dict
ofPandas DataFrame
s into a PDBx file.- Parameters:
pdbx (required) – a
dict
ofPandas DataFrame
s to write.file_name (optional) – file name to write a PDBx file. If
None
,moldf_output.cif
will be used as the file name. Defaults to None.
- Raises:
TypeError – if
pdbx
is not a valid dict ofDataFrame
.
PDB Writer#
PDB format writing.
Write a dict of Pandas DataFrame
back to a PDB file.
Currently, only the _atom_site
category can be written back.
- moldf.write_pdb.IMPLEMENTED_PDB_CATS = ['_atom_site']#
PDB categories that are currently implemented.
- moldf.write_pdb.write_pdb(pdb: dict[str, DataFrame], file_name: str | PathLike | None = None, allow_chimera: bool = False) None [source]#
Write a dict of
Pandas DataFrame
s into a PDB file.- Parameters:
pdb (required) – a
dict
ofPandas DataFrame
s to write.file_name (optional) – file name to write a PDB file. If
None
,moldf_output.pdb
will be used as the file name. Defaults to None.allow_chimera (optional) – whether to allow writing to Chimera-formatted PDB files. Defaults to False.
- Raises:
TypeError – if
pdb
is not a valid dict ofDataFrame
.ValueError – if the
pdb
contains other than supported categories.
MOL2 Writer#
MOL2 format writing.
Write a dict of Pandas DataFrame
back to a MOL2 file.
Currently, only the MOLECULE
, ATOM
, and BOND
categories can be written back.
- moldf.write_mol2.IMPLEMENTED_MOL2_CATS = ['MOLECULE', 'ATOM', 'BOND', 'HEADER']#
MOL2 categories that are currently implemented.
- moldf.write_mol2.write_mol2(mol2: dict[str, DataFrame], file_name: str | PathLike | None = None) None [source]#
Write a dict of
Pandas DataFrame
s into a MOL2 file. See https://is.muni.cz/th/fzk5s/dp_jakub_Vana.pdf p19 for column definitions.- Parameters:
mol2 (required) – a
dict
ofPandas DataFrame
s to write.file_name (optional) – file name to write a MOL2 file. If
None
,moldf_output.mol2
will be used as the file name. Defaults to None.
- Raises:
TypeError – if
mol2
is not a valid dict ofDataFrame
.ValueError – if the
mol2
contains other than supported categories.
JCSV Writer#
Write any dict of Pandas DataFrame to JCSV.
- moldf.write_jcsv.write_jcsv(data: dict[str, DataFrame], file_name: str | PathLike | None = None, write_meta: bool = True, **kwargs) None [source]#
Write a dict of
Pandas DataFrame
s into a JCSV file. See https://github.com/Ruibin-Liu/JCSV for definitions.- Parameters:
data (required) – a
dict
ofPandas DataFrame
s to write.file_name (optional) – file name to write a JCSV file. If
None
,moldf_output.jcsv
will be used as the file name ifpath_or_buf
is not specified in**kwargs
. Defaults to None.write_meta (optional) – whether to write meta data into the first category. Currently, only the first line number for each category is recorded. Defaults to True.
**kwargs – keyword arguments for
pd.DataFrame.to_csv
. Invalid ones are ignored. Check https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
- Raises:
TypeError – if
data
is not a valid dict ofDataFrame
.ValueError – if
"
in any of the column names.
Covalent Bond#
Gets covalent bonds by covalent radii cutoff or ligand templates.
- moldf.covalent_bond.get_covalent_bond_cutoffs(element_symbols: list | set, single_radii_set: str | None = None) tuple[dict, dict, dict] [source]#
Gets the DataFrame for additive covalent radii of relevant elements.
- Parameters:
element_symbols (required) – a list or set of element symbols whose covalent radii cutoffs are returned.
single_radii_set (optional) – radii sets to use. If
None
,single_C
is used as to Cordero (PMID 18478144). Another option issingle_PA
which refers to Pyykkö’s studies (PMID 19058281;19856342;15832398, and doi:10.1103/PhysRevB.85.024115). Defaults to None.
- Raises:
ValueError – if the
single_radii_set
is not valid.- Returns:
- a tuple of the three dictionaries of the distance cutoffs for the pairs of
the queried element symbols: (single_bonds, double_bonds, triple_bonds).
- moldf.covalent_bond.get_residue_template(residue_name: str, parent_name: str | None = None, residue_template_file: str | PathLike | None = None, save_template_file: bool = True, template_file_dir: str | PathLike | None = None) dict [source]#
Gets the intra- covalent bonds of a residue or ligand.
- Parameters:
residue_name (required) – residue or ligand name to query.
parent_name (optional) – parent name of the residue or ligand. This is useful if the
residue_name
in the PDB file is used by RCSB for a different residue or ligand. IfNone
, it is the same asresidue_name
. Defaults to None.residue_template_file (optional) – residue or ligand template file name/path. If
None
, this function queries RCSB by theparent_name
. If it’s not a path, this function looks at the current working directory first and thentemplate_file_dir
if it is notNone
. Defaults to None.save_template_file (optional) – whether to save the downloaded (from RCSB) template file. Defaults to True.
template_file_dir (optional) – directory to save fetched template files. If
None
butsave_template_file
isTrue
, ‘./template_files’ directory is used. Defaults to None.
- Returns:
- a dictionary of bonds for the residue. The dictionary key format is
(atom_id_1: str, atom_id_2: str), and the dictionary value format is (bond_order: str, is_aromatic: bool, stereo_flag: str)
- Raises:
ValueError – if a
.cif
file for the residue/ligand cannot be downloaded from RCSB for any reason ifresidue_template_file
isNone
. Check the url provided in the error message to download manually if possible.FileNotFoundError – if
residue_template_file
cannot be found if notNone
.
Warning
RuntimeWarning: if the template file contains other residue/ligands.
PDBx Line Splitter#
- moldf.split_line(line: str, delimiter: str | None = None) list [source]#
Splits a string line into tokens separated by
delimiter
s, assuming all'
and"
in the start character or following a delimiter are paired to quote a token.- Parameters:
line (required) – line as a string
delimiter (optional) –
delimiter
to split the line. IfNone
,' '
(one space) is used. Defaults to None.
- Returns:
A list of tokens