Umar.H
Umar.H

Reputation: 23099

merge multiple string values into one df column

apologies my multiple attempts looking through SO have been in vain. I need some assistance with the below:

I have a df as follows :

import pandas as pd
df = pd.DataFrame({'subject' : ['Eng' ,'Math', 'Science'], 'role' : ['Teacher', '', 'Dr'], 'ID' : ['123', '444', 'NaN'] })
 print(df)

    ID  role    subject
0   123 Teacher Eng
1   444         Math
2   NaN Dr      Science

As this is potentially unvalidated data, what I would usually do (in excel boo) is to create a memo field which aggregates the data and does not potentially damage any good data on upload.

so I've been toying with itterrows to try to loop through the df and merge the fields into one. The caveat being that I need to use /n to separate the values in the same column (CHAR(10) for you excel buffs)

so I've tried..

for index, row in df2.iterrows():
if row['ID'] and row['subject] and row['role'] == "": #search row for any data while leaving out blanks
    row['C_MEMO'] = row[['ID', 'subject', 'role']]
else:
    pass

but this isn't really working, any advice/help would be appreciated.

my desired output:

   ID        role      subject C_MEMO
0   123     Teacher     Eng     123,
                                Teacher
                                Eng

1   444                 Math    444,
                                Math

2   NaN     Dr          Science Dr,
                                Science

things to consider:

I'm working with multiple data types.

I'm happy to have the new output as a string.

Upvotes: 3

Views: 93

Answers (3)

jpp
jpp

Reputation: 164823

You can use a list comprehension with str.join:

bad_vals = {'', 'Nan'}
df['C_Memo'] = [', '.join([i for i in row if i not in bad_vals]) for row in df.values]

print(df)

    ID     role  subject             C_Memo
0  123  Teacher      Eng  123, Teacher, Eng
1  444              Math          444, Math
2  NaN       Dr  Science        Dr, Science

This is not necessarily worse than pd.DataFrame.apply, which has overheads associated with iterating over pd.Series objects for each row.

Upvotes: 1

jezrael
jezrael

Reputation: 863531

Use:

#first replace values
df = df.replace(['', 'NaN'], np.nan)
df["C_MEMO"] = df.apply(lambda x: '\n'.join(x.dropna()), axis=1)

Or if missing values are strings only:

df["C_MEMO"] = df.apply(lambda x: '\n'.join(x[~x.isin(['', 'NaN'])]), axis=1)

print(df)
   subject     role   ID             C_MEMO
0      Eng  Teacher  123  Eng\nTeacher\n123
1     Math      NaN  444          Math\n444
2  Science       Dr  NaN        Science\nDr

Upvotes: 2

Rakesh
Rakesh

Reputation: 82795

import pandas as pd
df = pd.DataFrame({'subject' : ['Eng' ,'Math', 'Science'], 'role' : ['Teacher', '', 'Dr'], 'ID' : ['123', '444', 'NaN'] })
df["C_MEMO"] = (df["ID"] + "," + df["role"] + "," +  df["subject"]).str.replace(",,", ",")
print(df)

Output:

    ID     role  subject           C_MEMO
0  123  Teacher      Eng  123,Teacher,Eng
1  444              Math         444,Math
2  NaN       Dr  Science   NaN,Dr,Science

Or using df.apply

df["C_MEMO"] = df.apply(lambda x: ", ".join([x["ID"], x["role"], x["subject"]]).replace(", ,", ","), axis=1)

Upvotes: 3

Related Questions