Reputation: 2533
I have a dataframe that looks like this:
df = pd.DataFrame({
'name': ['John','William', 'Nancy', 'Susan', 'Robert', 'Lucy', 'Blake', 'Sally', 'Bruce'],
'injury': ['right hand broken', 'lacerated left foot', 'foot broken', 'right foot fractured', '', 'sprained finger', 'chest pain', 'swelling in arm', 'laceration to arms, hands, and foot']
})
name injury
0 John right hand broken
1 William lacerated left foot
2 Nancy foot broken
3 Susan right foot fractured
4 Robert
5 Lucy sprained finger
6 Blake chest pain
7 Sally swelling in arm
8 Bruce lacerations to arm, hands, and foot <-- this is a weird case, since there are multiple body parts
Notably, some of the values in the injury
column are blank.
I want to replace the values in the injury
column with only the affected body part. In my case, that would be hand, foot, finger, and chest, arm. There are dozens more... this is a small example.
The desired dataframe would look like this:
name injury
0 John hand
1 William foot
2 Nancy foot
3 Susan foot
4 Robert
5 Lucy finger
6 Blake chest
7 Sally arm
8 Bruce arm, hand, foot
I could do something like this:
df.loc[df['injury'].str.contains('hand'), 'injury'] = 'hand'
df.loc[df['injury'].str.contains('foot'), 'injury'] = 'foot'
df.loc[df['injury'].str.contains('finger'), 'injury'] = 'finger'
df.loc[df['injury'].str.contains('chest'), 'injury'] = 'chest'
df.loc[df['injury'].str.contains('arm'), 'injury'] = 'arm'
But, this might not be the most elegant way.
Is there a more elegant way to do this? (e.g. using a dictionary)
(any advice on that last case with multiple body parts would be appreciated)
Thank you!
Upvotes: 1
Views: 53
Reputation: 3706
selected_words = ["hand", "foot", "finger", "chest", "arms", "arm", "hands"]
df["injury"] = (
df["injury"]
.str.replace(",", "")
.str.split(" ", expand=False)
.apply(lambda x: ", ".join(set([i for i in x if i in selected_words])))
)
print(df)
name injury
0 John hand
1 William foot
2 Nancy foot
3 Susan foot
4 Robert
5 Lucy finger
6 Blake chest
7 Sally arm
8 Bruce arms, foot, hands
Upvotes: 0
Reputation: 161
I think you should maintain a list of text, and using apply function:
body_parts = ['hand', 'foot', 'finger', 'chest', 'arm']
def test(value):
body_text = []
for body_part in body_parts:
if body_part in value:
body_text.append(body_part)
if body_text:
return ', '.join(body_text)
return value
df['injury'] = df['injury'].apply(test)
return:
name injury
0 John hand
1 William foot
2 Nancy foot
3 Susan foot
4 Robert
5 Lucy finger
6 Blake chest
7 Sally arm
8 Bruce hand, foot, arm
Upvotes: 1
Reputation: 33960
The standard way to get the first match of a regex on a string column is to use .extract()
, please see the quickstart 10 minutes to pandas: working with text data.
df['injury'].str.extract('(arm|chest|finger|foot|hand)', expand=False)
0 hand
1 foot
2 foot
3 foot
4 NaN
5 finger
6 chest
7 arm
8 arm
Name: injury, dtype: object
Note row 4 returned NaN rather than '' (but it's trivial to apply .fillna('')
to the result). More importantly in row 8 we'll only return the first match, not all matches. You need to decide how you want to handle this. See .extractall()
Upvotes: 0