Reputation: 33
I have to iterate through the rows of a column (in an existing dataframe) which contains a list of dicts, and then create two new dataframes out of the data from there. The general shape of one of these lists looks like this:
[
{"a": 10, "type": "square"}, {"type": "square", "b":11},
{"type": "square", "c": 12}, {"d": 13, "type": "square"},
{"type": "square", "e": 14}, {"a": 15, "type": "circle"},
{"type": "circle", "b": 16}, {"type": "circle", "c": 17},
{"d": 18, "type": "circle"}, {"type": "circle", "e": 19}
]
I have thousands of rows of these guys and want to create two new dataframes, one for circles and one for squares, resulting in a dataframe whose first row looks roughly like this:
type a b c d e
0 square 10 11 12 13 14
So far, I've tried converting the whole thing to json, which worked ok but seemed to change the nature of the dataframe so that it wasn't manipulable anymore. The json also created a dataframe with multiple rows (one for each element) and I wasn't able to "flatten" the dataframe onto the one key (in this case it would be "type").
I've also tried DataFrame.from_records
, DataFrame.from_dict
, and various, similar other ways of reading in the data using pandas with no luck.
EDIT: Sorry for being unclear,the above example of a dictionary lives in a "cell" of an existing dataframe, and I think the first step I'm looking for involves extracting it out of that "cell." So far, I've tried various ways to transform the object into something usable (like the list above), but haven't been successful. I would need to create the variable to look something like this my_list = df.column[0]
for example, so I could then iterate over the rows.
Upvotes: 1
Views: 859
Reputation: 38500
This works for your example:
pd.DataFrame(myList).groupby('type').agg(lambda x: x.dropna())
a b c d e
type
circle 15 16 17 18 19
square 10 11 12 13 14
the idea is to read in the list of dicts, convert them into a single DataFrame with one row per dict, group them by type, and then use the agg
method to drop all missing values in each variable.
data
myList = [
{"a": 10, "type": "square"}, {"type": "square", "b":11},
{"type": "square", "c": 12}, {"d": 13, "type": "square"},
{"type": "square", "e": 14}, {"a": 15, "type": "circle"},
{"type": "circle", "b": 16}, {"type": "circle", "c": 17},
{"d": 18, "type": "circle"}, {"type": "circle", "e": 19}
]
The above answer runs into a problem if the types are repeated as in the following list:
myList2 = [
{"a": 10, "type": "square"}, {"type": "square", "b":11},
{"type": "square", "c": 12}, {"d": 13, "type": "square"},
{"type": "square", "e": 14}, {"a": 15, "type": "circle"},
{"type": "circle", "b": 16}, {"type": "circle", "c": 17},
{"d": 18, "type": "circle"}, {"type": "circle", "e": 19},
{"a": 11, "type": "square"}, {"type": "square", "b":12},
{"type": "square", "c": 13}, {"d": 14, "type": "square"},
{"type": "square", "e": 15}, {"a": 16, "type": "circle"},
{"type": "circle", "b": 17}, {"type": "circle", "c": 18},
{"d": 20, "type": "circle"}, {"type": "circle", "e": 20}
]
As long as the list of dicts is regular, that is, there are 5 adjacent dicts for each type, then you can add a list comprehension to the groupby
method as follows.
pd.DataFrame(myList2).groupby(['type',
[math.floor(i / 5) for i, _ in enumerate(myList)]])
.agg(lambda x: x.dropna())
a b c d e
type
circle 1 15 16 17 18 19
3 16 17 18 20 20
square 0 10 11 12 13 14
2 11 12 13 14 15
Upvotes: 1
Reputation: 294218
let l
be your list of dictionaries
l = [
{"a": 10, "type": "square"}, {"type": "square", "b":11},
{"type": "square", "c": 12}, {"d": 13, "type": "square"},
{"type": "square", "e": 14}, {"a": 15, "type": "circle"},
{"type": "circle", "b": 16}, {"type": "circle", "c": 17},
{"d": 18, "type": "circle"}, {"type": "circle", "e": 19}
]
Then let's define a series s
to be 10 rows of this list
s = pd.Series([l] * 10)
print(s)
0 [{'type': 'square', 'a': 10}, {'type': 'square...
1 [{'type': 'square', 'a': 10}, {'type': 'square...
2 [{'type': 'square', 'a': 10}, {'type': 'square...
3 [{'type': 'square', 'a': 10}, {'type': 'square...
4 [{'type': 'square', 'a': 10}, {'type': 'square...
5 [{'type': 'square', 'a': 10}, {'type': 'square...
6 [{'type': 'square', 'a': 10}, {'type': 'square...
7 [{'type': 'square', 'a': 10}, {'type': 'square...
8 [{'type': 'square', 'a': 10}, {'type': 'square...
9 [{'type': 'square', 'a': 10}, {'type': 'square...
dtype: object
Now I'll define a function that uses a dictionary comprehension to rearrange the list to something more palatable for pd.Series
. In fact, the keys of the dictionary will be tuple
s so that the index of the produced series with be a pd.MultiIndex
. This will make it easier to break into 2 separate dataframes later.
def proc(l):
return pd.Series(
{(li['type'], k): v for li in l for k, v in li.items() if k != 'type'})
Now I use apply
df = s.apply(proc)
df
circle square
a b c d e a b c d e
0 15 16 17 18 19 10 11 12 13 14
1 15 16 17 18 19 10 11 12 13 14
2 15 16 17 18 19 10 11 12 13 14
3 15 16 17 18 19 10 11 12 13 14
4 15 16 17 18 19 10 11 12 13 14
5 15 16 17 18 19 10 11 12 13 14
6 15 16 17 18 19 10 11 12 13 14
7 15 16 17 18 19 10 11 12 13 14
8 15 16 17 18 19 10 11 12 13 14
9 15 16 17 18 19 10 11 12 13 14
I can assign my 2 dataframes quite easily from this point
circle = df.circle
square = df.square
alternative approach
Instead of using apply, we could use a set of comprehensions on s
df = pd.DataFrame(
{k: {(li['type'], k): v
for li in l
for k, v in li.items() if k != 'type'}
for k, l in s.iteritems()}
).T
timing
The multi-comprehension approach seems faster
Upvotes: 3