Charlie Crown
Charlie Crown

Reputation: 1089

Pandas dataframe concatenation

I have two dataframes. The first has only two columns, and N rows. N is hundreds to thousands. Each column is a molecules name, thus, it is a dataframe of pairs of molecules.

Second dataframe: I have a dataframe that is 1600 columns and M rows. M < N. Each column has a descriptor of a molecule. Thus, each molecule has 1600 descriptors.

Given these two dataframes, I want to create a 3rd dataframe that has 3200 columns (1600*2) and N rows. For each pair of molecules, I want to have the 1600 descriptors of the first molecules, followed (concatenated) by the 1600 descriptors of the second molecule.

So, I will have a new dataframe with 3200 descriptors for each pair of molecules.

Is there a pandas way to combine columns from different DataFrames? my MWE only works for my little example.

I have a MWE, however, when I try using it on the real dataframes, I get this error (diclofenac is the name of the molecule - the equivalent of a, b, or c in the MWE)

Traceback (most recent call last):
  File "/apps/psi4conda/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'diclofenac'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "ml_script.py", line 232, in <module>
    matrix.append(pd.concat([cof_df.loc[row['cof1']], cof_df[row['cof2']]], axis=0))
  File "/apps/psi4conda/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/apps/psi4conda/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 'diclofenac'

Here is the MWE

import numpy as np
import pandas as pd
# Dataframe with each molecules descriptors (real and binaries allowed)
df1 = pd.DataFrame([['a',1,True,3,4], ['b',55,False,76,87],['c',9,True,11,12]], columns=["name", "d1", "d2", "d3", "d4"])
df1 = df1.set_index("name")

# dataframe of pairs of molecules
df2 = pd.DataFrame({'cof1':['a', 'a','c','b'], 'cof2':['c','b','a','c']})

matrix = []
for index, rows in df2.iterrows():
    matrix.append(pd.concat([df1.loc[rows['cof1']],  df1.loc[rows['cof2']]], axis=0))
    
matrix = np.asarray(matrix)
df3 = pd.DataFrame(matrix)

The thing I don't get, is that it will successfully print to screen df1.loc[rows['cof1']], so it has no issues with the key in this call.

Upvotes: 0

Views: 153

Answers (1)

Aleix Molla
Aleix Molla

Reputation: 75

I wish I could comment and not write an answer here but I will try to help.

It seems your example code is working perfectly so based on the error I can only recommend you to find that particular KeyError: 'diclofenac' over both dataframe's and see if in any of them it contains a blank space of a capital letter that is raising that particular error.

In your example script, you can reproduce this error if either you change your df1 molecule name from a to A or do the same change in any particular molecule pair in your df2, so the error can also be on your df2 molecule names.

If you know your data is correct but may contain any capital, try to .lower() and .strip() every molecule name.

df1['name'].apply(lambda x: x.strip().lower())

Upvotes: 1

Related Questions