M.Vu
M.Vu

Reputation: 492

Pandas Dataframe loc assign a value (pybel object) to a cell: TypeError: object of type 'Molecule' has no len()

Context

I need to convert thousands of SMILES into SDF file using pybel via python scripts (several code lines below). Because SMILES are stored in csv and I want to use pandas (or dask for parallel in future), I choose to create pandas.

df['ROMol'] = None # Create a new column
for ind, row in df.iterrows():
      mol = pybel.readstring("smi", row[args.smi_column])
      df.loc[ind, 'ROMol'] = mol

Sample CSV file:

SMILES
C1=CC=CC=C1
CC(=O)Oc1ccccc1C(=O)O
C1CCCCC1

Problem

Pandas threw an error:

/pandas/core/indexing.py", line 1984, in _setitem_with_indexer_split_path elif len(ilocs) == 1 and lplane_indexer == len(value) and not is_scalar(pi): TypeError: object of type 'Molecule' has no len()

I overcome this issue by creating a list variable (call mol_list) and appending pybel.readstring objects into this list. Finally, I assign a new column in exist pandas dataframe:

df['ROMol'] = mol_list

However, I would like to use loc. How can I prevent pandas check len of object. I also convert dtype of the new column and dataframe to object following this guide.

Furthermore, I tried some other ways, and they work but I do not think they solve the main reason.

  1. Assign a list instead of a value
df.loc[ind, 'ROMol'] = [mol]
  1. Use view copy
df["ROMol"][ind] = mol

It warnings:

Use df.loc[row_indexer, "col"] = values instead, to perform the assignment in a single step and ensure this keeps updating the original df.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Besides, there are some refs: BUG: TypeError: object of type 'int' has no len() when saving DataFrame with object dtype column Pandas dataframe - TypeError: object of type '_io.TextIOWrapper' has no len()

Upvotes: 0

Views: 29

Answers (1)

Oliver Scott
Oliver Scott

Reputation: 1783

Why not just use the apply function?:

df["ROMol"] = df[args.smi_column].apply(lambda x: pybel.readstring("smi", x))

Not only is it cleaner but should also be more efficient.

If you are not constrained to using pybel, RDKit has utility functions to do exactly what you want.

from rdkit.Chem import PandasTools

PandasTools.AddMoleculeColumnToFrame(df, smilesCol=args.smi_column, molCol='ROMol')
PandasTools.WriteSDF(df, args.sdf_out, molColName='ROMol', idName=None, properties=list(df.columns))

Upvotes: 0

Related Questions