Make42
Make42

Reputation: 13108

Efficiently turn a dictionary with tuple-keys and list-values into a DataFrame

I have a dictionary that has tuples of strings as keys and lists as values, like

mydict = {(('aa', 'bbbb'), ('c',)): [1,52,35,12], (('c', 'aa'), ('d',)): [4424,512]}

which I want to get into panas DataFrame where the tuple-keys are supposed to be one column and the values the other column. Also I need the length of the tuple saved in a column. Finally I need the length of the keys divided by the the length of the keys as another column.

Currently I am using the code

myDF = pd.DataFrame()
for key, value in mydict.items():
    myDF_temp = pd.DataFrame.from_dict({'value_count': [len(value) / len(key)],
                                        'key_count': [len(key)]})
    myDF_temp['key'] = 1
    myDF_temp['value'] = 1
    myDF_temp['key'] = myDF_temp['key'].astype(object)
    myDF_temp['value'] = myDF_temp['value'].astype(object)
    myDF_temp.set_value(0, 'key', tuple(key))
    myDF_temp.set_value(0, 'value', tuple(value))
    myDF = myDF.append(myDF_temp)

which is very slow due to the re-appending of the DataFrames.

For this example I expect

myDF
   key_count  value_count                 key            value
0          2            1     ((c, aa), (d,))      (4424, 512)
0          2            2  ((aa, bbbb), (c,))  (1, 52, 35, 12)

How can I do this efficiently?

Upvotes: 2

Views: 887

Answers (2)

Make42
Make42

Reputation: 13108

Turns out there is a surprisingly simple answer. The trick is to put the lists into a list (which is pretty fast), so that only the outer list is unpacked by .from_dict:

mydict2 = {}
for key, value in mydict.items():
    mydict2[key] = [value]

myDF = pd.DataFrame.from_dict(mydict2, orient='index'). \
    reset_index(). \
    rename(columns={'index': 'key', 0: 'value'})
myDF['key_count'] = myDF.key.str.len()
myDF['value_count'] = myDF.value.str.len() / myDF.key_count

Upvotes: 0

jezrael
jezrael

Reputation: 863291

You can use Series constructor with str.len for lengths of tuples and mask for convert to one item tuples with apply:

mydict = {('a', 'b'): [1,2,3], ('c'): [4,5]}

df = pd.Series(mydict).reset_index()
df.columns = ['key','value']
print (df)
      key      value
0       c     [4, 5]
1  (a, b)  [1, 2, 3]

l = df['key'].str.len()
df['key_count'] = l
df['value_count'] = df['value'].str.len() / l
df['key']  = df['key'].mask(l == 1, df['key'].apply(tuple))
df['value']  = df['value'].apply(tuple)
print (df)
      key      value  key_count  value_count
0    (c,)     (4, 5)          1          2.0
1  (a, b)  (1, 2, 3)          2          1.5

With you new data:

print (df)
           key            value  key_count  value_count
0        (c,)      (4424, 512)          1          2.0
1  (aa, bbbb)  (1, 52, 35, 12)          2          2.0

EDIT:

mydict = {(('aa', 'bbbb'), ('c',)): [1,52,35,12], (('c', 'aa'), ('d',)): [4424,512]}
s1 = pd.Series(mydict)
s = pd.Series(s1.index.values.tolist())
df = pd.concat([s,s1.reset_index(drop=True)], axis=1)
df.columns = ['key','value']
print (df)
                  key            value
0  ((aa, bbbb), (c,))  [1, 52, 35, 12]
1     ((c, aa), (d,))      [4424, 512]

l = df['key'].str.len()
df['key_count'] = l
df['value_count'] = df['value'].str.len() / l
df['key']  = df['key'].mask(l == 1, df['key'].apply(tuple))
df['value']  = df['value'].apply(tuple)
print (df)
                  key            value  key_count  value_count
0  ((aa, bbbb), (c,))  (1, 52, 35, 12)          2          2.0
1     ((c, aa), (d,))      (4424, 512)          2          1.0

Upvotes: 2

Related Questions