Pyd
Pyd

Reputation: 6159

How to use multiple separators in a pandas Series and split into multiple rows

I have a dataframe like this.

df = pd.DataFrame({
    "Name" : ["ABC LLC Ram corp", "IJK Inc"],
    "id" : [101, 102]
 })

    Name                id
0 ABC LLC Ram corp      101
1 IJK Inc               102

I am trying to split the Name series into multiple rows based on my separator. I am able to split but unable to retain the separators too.

separators = ["inc","corp","llc"]

My expected output is,

Name       id
ABC LLC    101
RAM corp   101
IJK Inc    102

Please help, thanks.

Upvotes: 2

Views: 151

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71687

You can use str.findall to find all the occurrence of matching regex pattern in column Name, then assign these matching occurrences to the column Name and explode the dataframe on Name:

pat = fr"(?i)(.*?(?:{'|'.join(separators)}))"
df.assign(Name=df['Name'].str.findall(pat)).explode('Name')

Regex details:

  • (?i) : Case insensitive flag
  • ( : Start of capturing group
  • .*? : Matches any character except line terminators between zero and unlimited times, as few times as possible (lazy match).
  • (?: : start of a non capturing group
  • {'|'.join(separators)}: f-string expression which evaluates to inc|corp|llc
  • ) : End of non-capturing group
  • ) : End of capturing group

        Name   id
0    ABC LLC  101
0   Ram corp  101
1    IJK Inc  102

Upvotes: 4

anky
anky

Reputation: 75100

A bit verbose approach , by replacing the spaces after the words with comma and then split:

d = dict(zip([f'{i} ' for i in separators],[f'{i},' for i in separators]))
#{'inc ': 'inc,', 'corp ': 'corp,', 'llc ': 'llc,'}

out = (df.assign(Name=df['Name'].str.lower()
       .replace(d,regex=True).str.title().str.split(",")).explode("Name"))

print(out)

       Name   id
0   Abc Llc  101
0  Ram Corp  101
1   Ijk Inc  102

Upvotes: 3

Related Questions