Reputation: 23
I'm very new to Python, and am using Pandas to convert a bunch of MySQL tables to JSON. My current solution works just fine, but (1) it is not very pythonic, and (2) I feel like there must be some pre-baked Pandas fucntion that does what I need...? Any guidance for the following problem would be helpful.
Say I have two data frames, authors
and a join table plays_authors
that represents a 1:many relationship of authors to plays.
print authors
> author_id dates notes
> 0 1 1700s a
> 1 2 1800s b
> 2 3 1900s c
print plays_authors
> author_id play_id
> 0 1 12
> 1 1 13
> 2 1 21
> 3 2 18
> 4 3 3
> 5 3 7
I want to merge plays_authors
onto authors
, but instead of having multiple rows per author (1 per play_id
), I want one row per author, with an array of play_id
values so that I can easily export them as json records.
print authors
> author_id dates notes play_id
> 0 1 1700s a [12, 13, 21]
> 1 2 1800s b [18]
> 2 3 1900s c [3, 7]
authors.to_json(orient="records")
> '[{
> "author_id":"1",
> "dates":"1700s",
> "notes":"a",
> "play_id":["12","13","21"]
> },
> {
> "author_id":"2",
> "dates":"1800s",
> "notes":"b",
> "play_id":["18"]
> },
> {
> "author_id":"3",
> "dates":"1900s",
> "notes":"c",
> "play_id":["3","7"]
> }]'
My current solution:
# main_df: main dataframe to transform
# join_df: the dataframe of the join table w values to add to df
# main_index: name of main_df index column
# multi_index: name of column w/ multiple values per main_index, added by merge with join_df
# jointype: type of merge to perform, e.g. left, right, inner, outer
def consolidate(main_df, join_df, main_index, multi_index, jointype):
# merge
main_df = pd.merge(main_df, join_df, on=main_index, how=jointype)
# consolidate
new_df = pd.DataFrame({})
for i in main_df[main_index].unique():
i_rows = main_df.loc[main_df[main_index] == i]
values = []
for column in main_df.columns:
values.append(i_rows[:1][column].values[0])
row_dict = dict(zip(main_df.columns, values))
row_dict[multi_index] = list(i_rows[multi_index])
new_df = new_df.append(row_dict, ignore_index=True)
return new_df
authors = consolidate(authors, plays_authors, 'author_id', 'play_id', 'left')
Is there a simple groupby / better dict solution out there that's currently just over my head?
Upvotes: 2
Views: 1141
Reputation: 210832
Data:
In [131]: a
Out[131]:
author_id dates notes
0 1 1700s a
1 2 1800s b
2 3 1900s c
In [132]: pa
Out[132]:
author_id play_id
0 1 12
1 1 13
2 1 21
3 2 18
4 3 3
5 3 7
Solution:
In [133]: a.merge(pa.groupby('author_id')['play_id'].apply(list).reset_index())
Out[133]:
author_id dates notes play_id
0 1 1700s a [12, 13, 21]
1 2 1800s b [18]
2 3 1900s c [3, 7]
Upvotes: 2