OK 400
OK 400

Reputation: 831

Create pd.DataFrame from dictionary with multi-dimensional array

I've the following dictionary:

dictA = {'A': [[1, 2, 3], [1, 2, 3], [1, 2, 3]],
         'B': [[4, 4, 4], [4, 4, 4],],
         'C': [[4, 6, 0]]
        }

I want to convert it to a pd.DataFrame(), expecting this:

id       ColA        ColB        ColC
0         1           4           4
1         2           4           6
2         3           4           0
3         1           4           
4         2           4
5         3           4
6         1
7         2
8         3

How can I do that? I'm trying

pd.DataFrame(dictAll.items(), columns=['ColA', 'ColB', 'ColC'])

But it obviously doesn't work!

Upvotes: 1

Views: 1103

Answers (1)

Red
Red

Reputation: 27547

Here is how:

import pandas as pd
import numpy as np

dictA = {'A': [[1, 2, 3], [1, 2, 3], [1, 2, 3]],
         'B': [[4, 4, 4], [4, 4, 4],],
         'C': [[4, 6, 0]]}

df = pd.DataFrame(dict([(f'Col{k}', pd.Series([a for b in v for a in b])) for k,v in dictA.items()])).replace(np.nan, '')
print(df)

Output:

   ColA ColB ColC
0     1    4    4
1     2    4    6
2     3    4    0
3     1    4     
4     2    4     
5     3    4     
6     1          
7     2          
8     3  

Now, let's have a look at the problem one step at a time.

  1. The first thing we might try is simply:

    df = pd.DataFrame(dictA)
    print(df)
    

    Which, of course, return this error:

     ValueError: arrays must all be same length
    
  2. So now we need a way to be able to create dataframes from a dict with arrays of different lengths. For that, we can:

    df = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in dictA.items()]))
    print(df)
    

    Output:

               A          B          C
    0  [1, 2, 3]  [4, 4, 4]  [4, 6, 0]
    1  [1, 2, 3]  [4, 4, 4]        NaN
    2  [1, 2, 3]        NaN        NaN
    
  3. We want the dataframe to be vertical, so for each iteration, flatten out the lists with a list comprehension:

    df = pd.DataFrame(dict([(k, pd.Series([a for b in v for a in b])) for k, v in dictA.items()]))
    print(df)
    

    Output:

       A    B    C
    0  1  4.0  4.0
    1  2  4.0  6.0
    2  3  4.0  0.0
    3  1  4.0  NaN
    4  2  4.0  NaN
    5  3  4.0  NaN
    6  1  NaN  NaN
    7  2  NaN  NaN
    8  3  NaN  NaN
    
  4. Now we want to replace all the NaNs with blanks. For that, we need to import numpy as np, and do:

    df = pd.DataFrame(dict([(k, pd.Series([a for b in v for a in b])) for k, v in dictA.items()])).replace(np.nan, '')
    print(df)
    

    Output:

       A  B  C
    0  1  4  4
    1  2  4  6
    2  3  4  0
    3  1  4   
    4  2  4   
    5  3  4   
    6  1      
    7  2      
    8  3     
    
  5. Finally use formatted string to convert the letters into "Col" letters:

    df = pd.DataFrame(dict([(f'Col{k}', pd.Series([a for b in v for a in b])) for k,v in dictA.items()])).replace(np.nan, '')
    print(df)
    

    Output:

       ColA ColB ColC
    0     1    4    4
    1     2    4    6
    2     3    4    0
    3     1    4     
    4     2    4     
    5     3    4     
    6     1          
    7     2          
    8     3  
    

Upvotes: 3

Related Questions