neversaint
neversaint

Reputation: 63994

Convert two columns Pandas data frame to dictionary of list with first column as keys

I have the following data frame:

import pandas as pd

df = pd.DataFrame({
    "ClusterID" : [1,2,2,1,3],
    "Genes" : ['foo','qux','bar','cux','fii'],
})

Which looks like this:

  ClusterID Genes
0          1   foo
1          2   qux
2          2   bar
3          1   cux
4          3   fii

What I want to do is to convert them into a dictionary of list:

{ '1': ['foo','cux'],
  '2': ['qux','bar'],
  '3': ['fii']}

How can I do that?

Upvotes: 3

Views: 4794

Answers (2)

jezrael
jezrael

Reputation: 862541

You can use groupby and apply tolist and then use Series.to_dict:

import pandas as pd

df = pd.DataFrame({
    "ClusterID" : [1,2,2,1,3],
    "Genes" : ['foo','qux','bar','cux','fii'],
})
print df
   ClusterID Genes
0          1   foo
1          2   qux
2          2   bar
3          1   cux
4          3   fii

s = df.groupby('ClusterID')['Genes'].apply(lambda x: x.tolist())
print s
ClusterID
1    [foo, cux]
2    [qux, bar]
3         [fii]
Name: Genes, dtype: object

print s.to_dict()
{1: ['foo', 'cux'], 2: ['qux', 'bar'], 3: ['fii']}

Upvotes: 8

Matthew
Matthew

Reputation: 7590

dct = {x:df.Genes[df.ClusterID == x].tolist() for x in set(df.ClusterID)}
# dct == {1: ['foo','cux'], 2: ['qux','bar'], 3: ['fii']}

As your ClusterID column consists of integer values, your dictionary keys will be as well. If you want the keys to be strings as in your example, simply use the str function as

dct = {str(x):df.Genes[df.ClusterID == x].tolist() for x in set(df.ClusterID)}

Here we are using a dictionary comprehension statement. The expression set(df.ClusterID) will get us a set of the unique values in that column (we can use a set as the dictionary keys are unordered anyways). df.Genes[df.ClusterID == x] will get us the values in the Genes column corresponding to the rows with the ClusterID values equal to x. Using tolist() will cast the pandas.Series returned there to a list.

Thus this dictionary expression loops through each unique value in the ClusterID column, and stores the list of Genes values corresponding to that value as a list in a dictionary under that key.

Upvotes: 1

Related Questions