Reputation: 19
I have the following input file in csv:
INPUT
ID,GroupID,Person,Parent
ID_001,A001,John Doe,Yes
ID_002,A001,Mary Jane,No
ID_003,A001,James Smith;John Doe,Yes
ID_004,B003,Nathan Drake,Yes
ID_005,B003,Troy Baker,No
The desired output is the following:
** DESIRED OUTPUT**
ID,GroupID,Person
ID_001,A001,John Doe;Mary Jane;James Smith
ID_003,A001,John Doe;Mary Jane;James Smith
ID_004,B003,Nathan Drake;Troy Baker
Basically, I want to group by the same GroupID and then concatenate all the values present in Person column that belong to that group. Then, in my output, for each group I want to return the ID(s) of those rows where the Parent column is "Yes", the GroupID, and the concatenated person values for each group.
I am able to concatenate all person values for a particular group and remove any duplicate values from the person column in my output. Here is what I have so far:
import pandas as pd
inputcsv = path to the input csv file
outputcsv = path to the output csv file
colnames = ['ID', 'GroupID', 'Person', 'Parent']
df1 = pd.read_csv(inputcsv, names = colnames, header = None, skiprows = 1)
#First I do a groupby on GroupID, concatenate the values in the Person column, and finally remove the duplicate person values from the output before saving the df to a csv.
df2 = df1.groupby('GroupID')['Person'].apply(';'.join).str.split(';').apply(set).apply(';'.join).reset_index()
df2.to_csv(outputcsv, sep=',', index=False)
This yields the following output:
GroupID,Person
A001,John Doe;Mary Jane;James Smith
B003,Nathan Drake;Troy Baker
I can't figure out how to include the ID column and include all rows in a group where the Parent is "Yes" (as shown in the desired output above).
Upvotes: 0
Views: 1018
Reputation: 323266
IIUC
df.Person=df.Person.str.split(';')#1st split the string to list
df['Person']=df.groupby(['GroupID']).Person.transform(lambda x : ';'.join(set(sum(x,[]))))# then we do transform , this will add each group rowwise same result , link https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object
df=df.loc[df.Parent.eq('Yes')] # then using Parent to filter
df
Out[239]:
ID GroupID Person Parent
0 ID_001 A001 James Smith;John Doe;Mary Jane Yes
2 ID_003 A001 James Smith;John Doe;Mary Jane Yes
3 ID_004 B003 Troy Baker;Nathan Drake Yes
Upvotes: 1