n8-da-gr8
n8-da-gr8

Reputation: 551

pandas groupby and populate columns based on values

Given a dataframe (https://pastebin.com/MdqWz4Ke)

# some data
data3 = [["Alex","Tampa","A23","1","Ax","Red"],
     ["Alex","Tampa","A23","1","Ay","Blue"],
     ["Alex","Tampa","B43","1","Bx","Green"],
     ["Alex","Tampa","B43","1","By","White"],
     ["Alex","Tampa","C55","1","Cx","Red"],
     ["Alex","Tampa","C55","1","Cy","White"],
     ["Alex","Tampa","C55","2","Cx","Purple"],
     ["Alex","Tampa","C55","2","Cy","Black"],
     ["Tim","San Diego","A23","1","Ax","Green"],
     ["Tim","San Diego","A23","1","Ay","Black"],
     ["Tim","San Diego","B43","1","Bx","Yellow"],
     ["Tim","San Diego","B43","1","By","Black"],
     ["Tim","San Diego","C55","1","Cx","Pink"],
     ["Tim","San Diego","C55","1","Cy","Orange"],
     ["Tim","San Diego","A23","2","Ax","Green"],
     ["Tim","San Diego","A23","2","Ay","Red"],
     ["Tim","San Diego","B43","2","Bx",""],
     ["Tim","San Diego","B43","2","By",""],        
     ["Mark","Houston","A23","1","Ax","Purple"],
     ["Mark","Houston","A23","1","Ay","Yellow"],
     ["Mark","Houston","B43","1","Bx","Gray"],
     ["Mark","Houston","B43","1","By","White"],
     ["Mark","Houston","C55","1","Cx",""],
     ["Mark","Houston","C55","1","Cy",""],        
     ["Anthony","Seattle","A23","","Ax","Orange"],
     ["Anthony","Seattle","A23","","Ay","Black"],
     ["Anthony","Seattle","B43","","Bx","Red"],
     ["Anthony","Seattle","B43","","By","Black"],
     ["Anthony","Seattle","C55","","Cx","Blue"],
     ["Anthony","Seattle","C55","","Cy","Pink"]]

# create dataframe
df3 = pd.DataFrame(data3,columns=[
        "Name","City","Domain","Sequence","Group","Value"])

How can I compare values in groups and conditionally populate columns with those values?

# add Compared columns
df3["Compared Group"] = ""
df3["Compared Value"] = ""

# replace nulls with np.NaN
df3.replace(r"^s*$", np.nan, regex=True, inplace = True)

# fillna for missing Sequence and Value
df3.fillna({"Sequence":"N/A","Value":"NULL"},inplace=True)

# expected result
result = [["Alex","Tampa","A23","1","Ax","Red","Ay","Blue"],
          ["Alex","Tampa","B43","1","Bx","Green","By","White"],
          ["Alex","Tampa","C55","1","Cx","Red","Cy","White"],
          ["Alex","Tampa","C55","2","Cx","Purple","Cy","Black"],
         ["Tim","San Diego","A23","1","Ax","Green","Ay","Black"],
         ["Tim","San Diego","A23","2","Ax","Green","Ay","Red"],
         ["Tim","San Diego","B43","1","Bx","Yellow","By","Black"],
         ["Tim","San Diego","B43","2","Bx","NULL","By","NULL"],        
         ["Tim","San Diego","C55","1","Cx","Pink","Cy","Orange"],        
         ["Mark","Houston","A23","1","Ax","Purple","Ay","Yellow"],
         ["Mark","Houston","B43","1","Bx","Gray","By","White"],
         ["Mark","Houston","C55","1","Cx","NULL","Cy","NULL"],
         ["Anthony","Seattle","A23","","Ax","Orange","Ay","Black"],
         ["Anthony","Seattle","B43","","Bx","Red","By","Black"],
         ["Anthony","Seattle","C55","","Cx","Blue","Cy","Pink"]]

    result_df = pd.DataFrame(result,columns=[
            "Name","City","Domain","Sequence","Group",
            "Value","Compared Group","Compared Value"])

Notes:

I've created a dictionary mapping the Group values

# map groups with dictionary
group_dict = {"Ax":"Ay","Bx":"By","Cx":"Cy"}

and created the groupby object

# groupby
grouped = df3.groupby(["Name","Sequence","Domain","Group"], group_keys=False)

My original plan was to .loc in order to populate the Compared columns and maybe using map with the dictionary, but when trying to access the values within the groups...

for name in df3["Name"]:
    print(grouped.get_group((name,"Ax")))

I get the following error:

ValueError: must supply a a same-length tuple to get_group with multiple grouping keys

Which I assume is because not all groups will contain the same number and types of Group values (for example, Tim hasSequence 1 and 2 for Ax while Alex has only Sequence 1 for Ax). I'm not sure how to proceed from here in order to merge and transform these rows.

Upvotes: 0

Views: 852

Answers (2)

Andy L.
Andy L.

Reputation: 25239

Base on your sample, you may try set_index by the customized groupid of Group and unstack. Next, pretty-up column names and reset_index back

s = df3.groupby(["Name","Sequence","Domain",]).Group.cumcount()
df_out = (df3.set_index(["Name", "City", "Sequence","Domain", s])
             .unstack()
             .sort_index(level=1, axis=1))
df_out.columns = (df_out.columns.set_levels(['', 'Compared '], level=1)
                                .map('{0[1]}{0[0]}'.format))
df_out.reset_index()

Out[297]:
       Name         City Sequence Domain Group   Value Compared Group  \
0      Alex        Tampa        1    A23    Ax     Red             Ay
1      Alex        Tampa        1    B43    Bx   Green             By
2      Alex        Tampa        1    C55    Cx     Red             Cy
3      Alex        Tampa        2    C55    Cx  Purple             Cy
4   Anthony      Seattle      N/A    A23    Ax  Orange             Ay
5   Anthony      Seattle      N/A    B43    Bx     Red             By
6   Anthony      Seattle      N/A    C55    Cx    Blue             Cy
7      Mark      Houston        1    A23    Ax  Purple             Ay
8      Mark      Houston        1    B43    Bx    Gray             By
9      Mark      Houston        1    C55    Cx    NULL             Cy
10      Tim  Los Angeles        1    A23    Ax   Green             Ay
11      Tim  Los Angeles        1    B43    Bx  Yellow             By
12      Tim  Los Angeles        1    C55    Cx    Pink             Cy
13      Tim  Los Angeles        2    A23    Ax   Green             Ay
14      Tim  Los Angeles        2    B43    Bx    NULL             By

   Compared Value
0            Blue
1           White
2           White
3           Black
4           Black
5           Black
6            Pink
7          Yellow
8           White
9            NULL
10          Black
11          Black
12         Orange
13            Red
14           NULL

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150735

Given your sample data, you could do:

def myfunc(x):
    # extract rows 0 2 4 ...
    # reset_index rename the rows as 0 1 2 ...
    df1 = x.iloc[::2].reset_index(drop=True)

    # extract rows 1 3 5
    df2 = x.iloc[1::2].reset_index(drop=True)

    # merge put the two dataframes next together
    return df1.merge(df2, left_index=True, right_index=True)

# group by the other columns and select only ['Group', 'Value']
(df3.groupby(['Name', 'City', 'Domain', 'Sequence'])[['Group','Value']]
 .apply(myfunc)                # concatenate the rows
 .reset_index(-1, drop=True)   # drop the unnecessary index
 .reset_index()                # make the other original columns as data instead of index
)

Output:

       Name         City Domain Sequence Group_x Value_x Group_y Value_y
0      Alex        Tampa    A23        1      Ax     Red      Ay    Blue
1      Alex        Tampa    B43        1      Bx   Green      By   White
2      Alex        Tampa    C55        1      Cx     Red      Cy   White
3      Alex        Tampa    C55        2      Cx  Purple      Cy   Black
4   Anthony      Seattle    A23      N/A      Ax  Orange      Ay   Black
5   Anthony      Seattle    B43      N/A      Bx     Red      By   Black
6   Anthony      Seattle    C55      N/A      Cx    Blue      Cy    Pink
7      Mark      Houston    A23        1      Ax  Purple      Ay  Yellow
8      Mark      Houston    B43        1      Bx    Gray      By   White
9      Mark      Houston    C55        1      Cx    NULL      Cy    NULL
10      Tim  Los Angeles    A23        1      Ax   Green      Ay   Black
11      Tim  Los Angeles    A23        2      Ax   Green      Ay     Red
12      Tim  Los Angeles    B43        1      Bx  Yellow      By   Black
13      Tim  Los Angeles    B43        2      Bx    NULL      By    NULL
14      Tim  Los Angeles    C55        1      Cx    Pink      Cy  Orange

Upvotes: 1

Related Questions