youngpj
youngpj

Reputation: 41

Split string value in Pandas & add to a new column the split values one after the other

I have a Pandas dataframe like this :

id  A   B
0   1   toto+tata
1   1   toto+tata
2   2   titi+tutu
3   2   titi+tutu
4   3   toto+tata+titi
5   3   toto+tata+titi
6   3   toto+tata+titi

Thanks to the split function, I can split these string values :

for i in range(len(df)):
    var = output.iloc[i, 1].split("+")
    print(var)
['toto', 'tata']
['toto', 'tata']
['titi', 'tutu']
['titi', 'tutu']
['toto', 'tata', 'titi']
['toto', 'tata', 'titi']
['toto', 'tata', 'titi']

Now, I want to put these values in a new column, one value after the other :

id  A   B              C
0   1   toto+tata      toto
1   1   toto+tata      tata
2   2   titi+tutu      titi
3   2   titi+tutu      tutu
4   3   toto+tata+titi toto
5   3   toto+tata+titi tata
6   3   toto+tata+titi titi

I tried to do this :

for i in range(len(df)):
    var = df.iloc[i, 1].split("+")
    print(var)
    for y in range(len(var)):
        df.at[i, 'C'] = var[y]

But it always returns the last value of the split :

id  A   B              C
0   1   toto+tata      tata
1   1   toto+tata      tata
2   2   titi+tutu      tutu
3   2   titi+tutu      tutu
4   3   toto+tata+titi titi
5   3   toto+tata+titi titi
6   3   toto+tata+titi titi

I'm missing this little detail for my algo to work but I can't find it.

Upvotes: 1

Views: 211

Answers (1)

mozway
mozway

Reputation: 260690

Assuming you always have as many rows in a group that there are '+' separated items, that the groups are consecutive, and that the strings are identical per group.

A simple way is to remove the duplicates, str.split, and explode:

df['C'] = df.groupby('A')['B'].first().str.split('+').explode().values

output:

   id  A               B     C
0   0  1       toto+tata  toto
1   1  1       toto+tata  tata
2   2  2       titi+tutu  titi
3   3  2       titi+tutu  tutu
4   4  3  toto+tata+titi  toto
5   5  3  toto+tata+titi  tata
6   6  3  toto+tata+titi  titi

If the rows are not grouped per consecutive "group", apply the same logic per group with groupby+transform:

Example:

# shuffle the rows to generate an example
df2 = df.sample(frac=1)

# extract the chunks
df2['C'] = df2.groupby('A')['B'].transform(lambda x: x.head(1).str.split('+').explode().values)

output:

   id  A               B     C
4   4  3  toto+tata+titi  toto
1   1  1       toto+tata  toto
0   0  1       toto+tata  tata
3   3  2       titi+tutu  titi
6   6  3  toto+tata+titi  tata
5   5  3  toto+tata+titi  titi
2   2  2       titi+tutu  tutu

Upvotes: 1

Related Questions