Pandas: merge dataframes and consolidate multiple joined values into an array

Question

I'm very new to Python, and am using Pandas to convert a bunch of MySQL tables to JSON. My current solution works just fine, but (1) it is not very pythonic, and (2) I feel like there must be some pre-baked Pandas fucntion that does what I need...? Any guidance for the following problem would be helpful.

Say I have two data frames, authors and a join table plays_authors that represents a 1:many relationship of authors to plays.

print authors
>   author_id  dates notes
> 0         1  1700s     a 
> 1         2  1800s     b 
> 2         3  1900s     c 


print plays_authors
>      author_id play_id
> 0         1      12
> 1         1      13
> 2         1      21
> 3         2      18
> 4         3       3
> 5         3       7

I want to merge plays_authors onto authors, but instead of having multiple rows per author (1 per play_id), I want one row per author, with an array of play_id values so that I can easily export them as json records.

print authors
>   author_id  dates notes       play_id
> 0         1  1700s     a  [12, 13, 21]
> 1         2  1800s     b          [18]
> 2         3  1900s     c        [3, 7]

authors.to_json(orient="records")
> '[{
>    "author_id":"1",
>    "dates":"1700s",
>    "notes":"a",
>    "play_id":["12","13","21"]
>   },
>   {
>    "author_id":"2",
>    "dates":"1800s",
>    "notes":"b",
>    "play_id":["18"]
>   },
>   {
>    "author_id":"3",
>    "dates":"1900s",
>    "notes":"c",
>    "play_id":["3","7"]
>  }]'

My current solution:

# main_df: main dataframe to transform
# join_df: the dataframe of the join table w values to add to df
# main_index: name of main_df index column
# multi_index: name of column w/ multiple values per main_index, added by merge with join_df
# jointype: type of merge to perform, e.g. left, right, inner, outer

def consolidate(main_df, join_df, main_index, multi_index, jointype):
    # merge
    main_df = pd.merge(main_df, join_df, on=main_index, how=jointype)
    # consolidate
    new_df = pd.DataFrame({})

    for i in main_df[main_index].unique():
        i_rows = main_df.loc[main_df[main_index] == i]
        values = []

        for column in main_df.columns:
            values.append(i_rows[:1][column].values[0])

        row_dict = dict(zip(main_df.columns, values))
        row_dict[multi_index] = list(i_rows[multi_index])
        new_df = new_df.append(row_dict, ignore_index=True)

    return new_df


authors = consolidate(authors, plays_authors, 'author_id', 'play_id', 'left')

Is there a simple groupby / better dict solution out there that's currently just over my head?

MaxU - stand with Ukraine · Accepted Answer

Data:

In [131]: a
Out[131]:
   author_id  dates notes
0          1  1700s     a
1          2  1800s     b
2          3  1900s     c

In [132]: pa
Out[132]:
   author_id  play_id
0          1       12
1          1       13
2          1       21
3          2       18
4          3        3
5          3        7

Solution:

In [133]: a.merge(pa.groupby('author_id')['play_id'].apply(list).reset_index())
Out[133]:
   author_id  dates notes       play_id
0          1  1700s     a  [12, 13, 21]
1          2  1800s     b          [18]
2          3  1900s     c        [3, 7]

Pandas: merge dataframes and consolidate multiple joined values into an array

Answers (1)

Related Questions