Create Pandas Dataframe from Elements in List of Dicts

Question

I have to iterate through the rows of a column (in an existing dataframe) which contains a list of dicts, and then create two new dataframes out of the data from there. The general shape of one of these lists looks like this:

[
 {"a": 10, "type": "square"}, {"type": "square", "b":11}, 
 {"type": "square", "c": 12}, {"d": 13, "type": "square"},
 {"type": "square", "e": 14}, {"a": 15, "type": "circle"}, 
 {"type": "circle", "b": 16}, {"type": "circle", "c": 17}, 
 {"d": 18, "type": "circle"}, {"type": "circle", "e": 19}
]

I have thousands of rows of these guys and want to create two new dataframes, one for circles and one for squares, resulting in a dataframe whose first row looks roughly like this:

      type    a  b  c  d  e
0    square   10 11 12 13 14

So far, I've tried converting the whole thing to json, which worked ok but seemed to change the nature of the dataframe so that it wasn't manipulable anymore. The json also created a dataframe with multiple rows (one for each element) and I wasn't able to "flatten" the dataframe onto the one key (in this case it would be "type").

I've also tried DataFrame.from_records, DataFrame.from_dict, and various, similar other ways of reading in the data using pandas with no luck.

EDIT: Sorry for being unclear,the above example of a dictionary lives in a "cell" of an existing dataframe, and I think the first step I'm looking for involves extracting it out of that "cell." So far, I've tried various ways to transform the object into something usable (like the list above), but haven't been successful. I would need to create the variable to look something like this my_list = df.column[0] for example, so I could then iterate over the rows.

piRSquared · Accepted Answer

let l be your list of dictionaries

l = [
 {"a": 10, "type": "square"}, {"type": "square", "b":11}, 
 {"type": "square", "c": 12}, {"d": 13, "type": "square"},
 {"type": "square", "e": 14}, {"a": 15, "type": "circle"}, 
 {"type": "circle", "b": 16}, {"type": "circle", "c": 17}, 
 {"d": 18, "type": "circle"}, {"type": "circle", "e": 19}
]

Then let's define a series s to be 10 rows of this list

s = pd.Series([l] * 10)
print(s)

0    [{'type': 'square', 'a': 10}, {'type': 'square...
1    [{'type': 'square', 'a': 10}, {'type': 'square...
2    [{'type': 'square', 'a': 10}, {'type': 'square...
3    [{'type': 'square', 'a': 10}, {'type': 'square...
4    [{'type': 'square', 'a': 10}, {'type': 'square...
5    [{'type': 'square', 'a': 10}, {'type': 'square...
6    [{'type': 'square', 'a': 10}, {'type': 'square...
7    [{'type': 'square', 'a': 10}, {'type': 'square...
8    [{'type': 'square', 'a': 10}, {'type': 'square...
9    [{'type': 'square', 'a': 10}, {'type': 'square...
dtype: object

Now I'll define a function that uses a dictionary comprehension to rearrange the list to something more palatable for pd.Series. In fact, the keys of the dictionary will be tuples so that the index of the produced series with be a pd.MultiIndex. This will make it easier to break into 2 separate dataframes later.

def proc(l):
    return pd.Series(
        {(li['type'], k): v for li in l for k, v in li.items() if k != 'type'})

Now I use apply

df = s.apply(proc)
df

  circle                 square                
       a   b   c   d   e      a   b   c   d   e
0     15  16  17  18  19     10  11  12  13  14
1     15  16  17  18  19     10  11  12  13  14
2     15  16  17  18  19     10  11  12  13  14
3     15  16  17  18  19     10  11  12  13  14
4     15  16  17  18  19     10  11  12  13  14
5     15  16  17  18  19     10  11  12  13  14
6     15  16  17  18  19     10  11  12  13  14
7     15  16  17  18  19     10  11  12  13  14
8     15  16  17  18  19     10  11  12  13  14
9     15  16  17  18  19     10  11  12  13  14

I can assign my 2 dataframes quite easily from this point

circle = df.circle
square = df.square

alternative approach
Instead of using apply, we could use a set of comprehensions on s

df = pd.DataFrame(
    {k: {(li['type'], k): v
         for li in l
         for k, v in li.items() if k != 'type'}
     for k, l in s.iteritems()}
).T

timing
The multi-comprehension approach seems faster

Create Pandas Dataframe from Elements in List of Dicts

Answers (2)

Related Questions