How to efficiently convert a subdictionary into matrix in python

Question

I have a dictionary like this:

{'test2':{'hi':4,'bye':3}, 'religion.christian_20674': {'path': 1, 'religious': 1, 'hi':1}}

the value of this dictionary is itself a dictionary.

what my output should look like:

how can I do that efficiently?

I have read this post, which the shape of matrix is different from mine.

this one was closest to my case, but it had a set inside the dictionary not another dictionary.

the thing that is different in my question is that I want also conver the value of the inside dictionary as the values of the matrix.

I was thinking something like this:

doc_final =[[]]
for item in dic1:
    for item2, value in dic1[item]:
        doc_final[item][item2] = value

but it wasnt the correct way.

Thanks for your help :)

tel · Accepted Answer

There does not seem to be any built in way in Pandas or Numpy to split up your rows like you want. Happily, you can do so with a single dictionary comprehension. The splitsubdicts function shown below provides this dict comprehension, and the todf function wraps up the whole conversion process:

def splitsubdicts(d):
    return {('%s_%d' % (k0, i + 1)):{k1:v1} for k0,v0 in d.items() for i,(k1,v1) in enumerate(v0.items())}

def todf(d):
    # .fillna(0) replaces the missing data with 0 (by default NaN is assigned to missing data)
    return pd.DataFrame(splitsubdicts(splitsubdicts(d))).T.fillna(0)

You can use todf like this:

d = {'Test2': {'hi':4, 'bye':3}, 'religion.christian_20674': {'path': 1, 'religious': 1, 'hi':1}}
df = todf(d)
print(df)

Output:

                              bye   hi  path  religious
Test2_1_1                     0.0  4.0   0.0        0.0
Test2_2_1                     3.0  0.0   0.0        0.0
religion.christian_20674_1_1  0.0  0.0   1.0        0.0
religion.christian_20674_2_1  0.0  0.0   0.0        1.0
religion.christian_20674_3_1  0.0  1.0   0.0        0.0

If you actually want a Numpy array, you can easily convert the dataframe:

arr = df.values
print(arr)

Output:

[[0. 4. 0. 0.]
 [3. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]]

You can also convert the dataframe to a structured array instead, which lets you keep your row and column labels:

arr = df.to_records()
print(arr.dtype.names)
print(arr)

Output:

('index', 'bye', 'hi', 'path', 'religious')
[('Test2_1_1', 0., 4., 0., 0.)
 ('Test2_2_1', 3., 0., 0., 0.)
 ('religion.christian_20674_1_1', 0., 0., 1., 0.)
 ('religion.christian_20674_2_1', 0., 0., 0., 1.)
 ('religion.christian_20674_3_1', 0., 1., 0., 0.)]

Edit: explanation of `splitsubdicts`

The nested dictionary comprehension used in splitsubdicts might seem kind of confusing. Really it's just a shorthand for writing nested loops. You can expand the comprehension out in a couple of for loops as so:

def splitsubdicts(d):
    ret = {}

    for k0,v0 in d.items():
        for i,(k1,v1) in enumerate(v0.items()):
            ret['{}_{}'.format(k0, i + 1)] = {k1: v1}

    return ret

The values returned by this loop-based version of splitsubdicts will be identical to those returned by the comprehension-based version above. The comprehension-based version might be slightly faster than the loop-based version, but in practical terms it's not the kind of thing anyone should worry about.

How to efficiently convert a subdictionary into matrix in python

Answers (2)

Edit: explanation of `splitsubdicts`

Related Questions

How to efficiently convert a subdictionary into matrix in python

Answers (2)

Edit: explanation of splitsubdicts

Related Questions

Edit: explanation of `splitsubdicts`