Reputation: 492
I need to convert thousands of SMILES into SDF file using pybel via python scripts (several code lines below). Because SMILES are stored in csv and I want to use pandas (or dask for parallel in future), I choose to create pandas.
df['ROMol'] = None # Create a new column
for ind, row in df.iterrows():
mol = pybel.readstring("smi", row[args.smi_column])
df.loc[ind, 'ROMol'] = mol
Sample CSV file:
SMILES
C1=CC=CC=C1
CC(=O)Oc1ccccc1C(=O)O
C1CCCCC1
Pandas threw an error:
/pandas/core/indexing.py", line 1984, in _setitem_with_indexer_split_path elif len(ilocs) == 1 and lplane_indexer == len(value) and not is_scalar(pi): TypeError: object of type 'Molecule' has no len()
I overcome this issue by creating a list variable (call mol_list) and appending pybel.readstring objects into this list. Finally, I assign a new column in exist pandas dataframe:
df['ROMol'] = mol_list
However, I would like to use loc. How can I prevent pandas check len of object. I also convert dtype
of the new column and dataframe to object
following this guide.
Furthermore, I tried some other ways, and they work but I do not think they solve the main reason.
df.loc[ind, 'ROMol'] = [mol]
df["ROMol"][ind] = mol
It warnings:
Use df.loc[row_indexer, "col"] = values instead, to perform the assignment in a single step and ensure this keeps updating the original df.
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Besides, there are some refs: BUG: TypeError: object of type 'int' has no len() when saving DataFrame with object dtype column Pandas dataframe - TypeError: object of type '_io.TextIOWrapper' has no len()
Upvotes: 0
Views: 29
Reputation: 1783
Why not just use the apply
function?:
df["ROMol"] = df[args.smi_column].apply(lambda x: pybel.readstring("smi", x))
Not only is it cleaner but should also be more efficient.
If you are not constrained to using pybel, RDKit has utility functions to do exactly what you want.
from rdkit.Chem import PandasTools
PandasTools.AddMoleculeColumnToFrame(df, smilesCol=args.smi_column, molCol='ROMol')
PandasTools.WriteSDF(df, args.sdf_out, molColName='ROMol', idName=None, properties=list(df.columns))
Upvotes: 0