Reputation: 6159
I have a dataframe like this.
df = pd.DataFrame({
"Name" : ["ABC LLC Ram corp", "IJK Inc"],
"id" : [101, 102]
})
Name id
0 ABC LLC Ram corp 101
1 IJK Inc 102
I am trying to split the Name series into multiple rows based on my separator. I am able to split but unable to retain the separators too.
separators = ["inc","corp","llc"]
My expected output is,
Name id
ABC LLC 101
RAM corp 101
IJK Inc 102
Please help, thanks.
Upvotes: 2
Views: 151
Reputation: 71687
You can use str.findall
to find all the occurrence of matching regex pattern in column Name
, then assign these matching occurrences to the column Name
and explode
the dataframe on Name
:
pat = fr"(?i)(.*?(?:{'|'.join(separators)}))"
df.assign(Name=df['Name'].str.findall(pat)).explode('Name')
Regex details:
(?i)
: Case insensitive flag(
: Start of capturing group.*?
: Matches any character except line terminators between zero and unlimited times, as few times as possible (lazy match
).(?:
: start of a non capturing group{'|'.join(separators)}
: f-string
expression which evaluates to inc|corp|llc
)
: End of non-capturing group)
: End of capturing group Name id
0 ABC LLC 101
0 Ram corp 101
1 IJK Inc 102
Upvotes: 4
Reputation: 75100
A bit verbose approach , by replacing the spaces after the words with comma and then split:
d = dict(zip([f'{i} ' for i in separators],[f'{i},' for i in separators]))
#{'inc ': 'inc,', 'corp ': 'corp,', 'llc ': 'llc,'}
out = (df.assign(Name=df['Name'].str.lower()
.replace(d,regex=True).str.title().str.split(",")).explode("Name"))
print(out)
Name id
0 Abc Llc 101
0 Ram Corp 101
1 Ijk Inc 102
Upvotes: 3