Reputation: 53

Efficient way to extract column values from sorted data frame

Suppose I have a data frame (sorted by column1) df:

column1   column2
1           'Yes'
1           'No'
2           'Yes'
3           'Yes'
3           'Yes'
3           'Yes'

For every value in column 1, I want to know its possible values in column2, and store these key-value pair in a structure.

For example, given the above df, I want a structure like

{1: ['yes','No']; 2: ['Yes'], 3:['Yes']}

I did something like

for key in df[column1].unique.tolist():
   values = df.loc[df[column1] == key]][column2].unique().tolist()
   dict.add((key, values))

However, as my original df is really big, this method takes too long. As the df is sorted by column1, I wonder if there is a more efficient method to achieve this.

Thank you!

Upvotes: 1

Answers (3)

Yuca

Reputation: 6091

remove redundant data with duplicates, then groupby by one column, take unique values, and save to dict:

final_dict = df.drop_duplicates().groupby('column1')['column2'].unique().to_dict()

Upvotes: 3

tawab_shakeel

Reputation: 3739

First drop duplicates then apply to_dict()

df.drop_duplicates(subset=['col1','col2'],inplace=True)
df.to_dict()

Upvotes: 1

krewsayder

Reputation: 446

I would look at the df.to_dict() method.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html

You should be able to get something like what you want. The example on the documentation site is as such:

>>> df = pd.DataFrame({'col1': [1, 2],
...                    'col2': [0.5, 0.75]},
...                   index=['row1', 'row2'])
>>> df
      col1  col2
row1     1  0.50
row2     2  0.75
df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}

Looks like you updated since I posted, check this out with the 'list' arguement.

‘list’ : dict like {column -> [values]}

Upvotes: 1

Efficient way to extract column values from sorted data frame

Answers (3)

Related Questions