shibby
shibby

Reputation: 33

Create Pandas Dataframe from Elements in List of Dicts

I have to iterate through the rows of a column (in an existing dataframe) which contains a list of dicts, and then create two new dataframes out of the data from there. The general shape of one of these lists looks like this:

[
 {"a": 10, "type": "square"}, {"type": "square", "b":11}, 
 {"type": "square", "c": 12}, {"d": 13, "type": "square"},
 {"type": "square", "e": 14}, {"a": 15, "type": "circle"}, 
 {"type": "circle", "b": 16}, {"type": "circle", "c": 17}, 
 {"d": 18, "type": "circle"}, {"type": "circle", "e": 19}
]

I have thousands of rows of these guys and want to create two new dataframes, one for circles and one for squares, resulting in a dataframe whose first row looks roughly like this:

      type    a  b  c  d  e
0    square   10 11 12 13 14

So far, I've tried converting the whole thing to json, which worked ok but seemed to change the nature of the dataframe so that it wasn't manipulable anymore. The json also created a dataframe with multiple rows (one for each element) and I wasn't able to "flatten" the dataframe onto the one key (in this case it would be "type").

I've also tried DataFrame.from_records, DataFrame.from_dict, and various, similar other ways of reading in the data using pandas with no luck.

EDIT: Sorry for being unclear,the above example of a dictionary lives in a "cell" of an existing dataframe, and I think the first step I'm looking for involves extracting it out of that "cell." So far, I've tried various ways to transform the object into something usable (like the list above), but haven't been successful. I would need to create the variable to look something like this my_list = df.column[0] for example, so I could then iterate over the rows.

Upvotes: 1

Views: 859

Answers (2)

lmo
lmo

Reputation: 38500

This works for your example:

pd.DataFrame(myList).groupby('type').agg(lambda x: x.dropna())

         a   b   c   d   e
type                      
circle  15  16  17  18  19
square  10  11  12  13  14

the idea is to read in the list of dicts, convert them into a single DataFrame with one row per dict, group them by type, and then use the agg method to drop all missing values in each variable.

data

 myList = [
 {"a": 10, "type": "square"}, {"type": "square", "b":11}, 
 {"type": "square", "c": 12}, {"d": 13, "type": "square"},
 {"type": "square", "e": 14}, {"a": 15, "type": "circle"}, 
 {"type": "circle", "b": 16}, {"type": "circle", "c": 17}, 
 {"d": 18, "type": "circle"}, {"type": "circle", "e": 19}
]

The above answer runs into a problem if the types are repeated as in the following list:

myList2 = [
 {"a": 10, "type": "square"}, {"type": "square", "b":11}, 
 {"type": "square", "c": 12}, {"d": 13, "type": "square"},
 {"type": "square", "e": 14}, {"a": 15, "type": "circle"}, 
 {"type": "circle", "b": 16}, {"type": "circle", "c": 17}, 
 {"d": 18, "type": "circle"}, {"type": "circle", "e": 19},
  {"a": 11, "type": "square"}, {"type": "square", "b":12}, 
 {"type": "square", "c": 13}, {"d": 14, "type": "square"},
 {"type": "square", "e": 15}, {"a": 16, "type": "circle"}, 
 {"type": "circle", "b": 17}, {"type": "circle", "c": 18}, 
 {"d": 20, "type": "circle"}, {"type": "circle", "e": 20}
]

As long as the list of dicts is regular, that is, there are 5 adjacent dicts for each type, then you can add a list comprehension to the groupby method as follows.

pd.DataFrame(myList2).groupby(['type',
                               [math.floor(i / 5) for i, _ in enumerate(myList)]])
                     .agg(lambda x: x.dropna())

           a   b   c   d   e
type                        
circle 1  15  16  17  18  19
       3  16  17  18  20  20
square 0  10  11  12  13  14
       2  11  12  13  14  15

Upvotes: 1

piRSquared
piRSquared

Reputation: 294218

let l be your list of dictionaries

l = [
 {"a": 10, "type": "square"}, {"type": "square", "b":11}, 
 {"type": "square", "c": 12}, {"d": 13, "type": "square"},
 {"type": "square", "e": 14}, {"a": 15, "type": "circle"}, 
 {"type": "circle", "b": 16}, {"type": "circle", "c": 17}, 
 {"d": 18, "type": "circle"}, {"type": "circle", "e": 19}
]

Then let's define a series s to be 10 rows of this list

s = pd.Series([l] * 10)
print(s)

0    [{'type': 'square', 'a': 10}, {'type': 'square...
1    [{'type': 'square', 'a': 10}, {'type': 'square...
2    [{'type': 'square', 'a': 10}, {'type': 'square...
3    [{'type': 'square', 'a': 10}, {'type': 'square...
4    [{'type': 'square', 'a': 10}, {'type': 'square...
5    [{'type': 'square', 'a': 10}, {'type': 'square...
6    [{'type': 'square', 'a': 10}, {'type': 'square...
7    [{'type': 'square', 'a': 10}, {'type': 'square...
8    [{'type': 'square', 'a': 10}, {'type': 'square...
9    [{'type': 'square', 'a': 10}, {'type': 'square...
dtype: object

Now I'll define a function that uses a dictionary comprehension to rearrange the list to something more palatable for pd.Series. In fact, the keys of the dictionary will be tuples so that the index of the produced series with be a pd.MultiIndex. This will make it easier to break into 2 separate dataframes later.

def proc(l):
    return pd.Series(
        {(li['type'], k): v for li in l for k, v in li.items() if k != 'type'})

Now I use apply

df = s.apply(proc)
df

  circle                 square                
       a   b   c   d   e      a   b   c   d   e
0     15  16  17  18  19     10  11  12  13  14
1     15  16  17  18  19     10  11  12  13  14
2     15  16  17  18  19     10  11  12  13  14
3     15  16  17  18  19     10  11  12  13  14
4     15  16  17  18  19     10  11  12  13  14
5     15  16  17  18  19     10  11  12  13  14
6     15  16  17  18  19     10  11  12  13  14
7     15  16  17  18  19     10  11  12  13  14
8     15  16  17  18  19     10  11  12  13  14
9     15  16  17  18  19     10  11  12  13  14

I can assign my 2 dataframes quite easily from this point

circle = df.circle
square = df.square

alternative approach
Instead of using apply, we could use a set of comprehensions on s

df = pd.DataFrame(
    {k: {(li['type'], k): v
         for li in l
         for k, v in li.items() if k != 'type'}
     for k, l in s.iteritems()}
).T

timing
The multi-comprehension approach seems faster

enter image description here

Upvotes: 3

Related Questions