Cactus Philosopher
Cactus Philosopher

Reputation: 864

Select a partial string from a list within a column in Pandas DataFrame

I have some DataFrame:

d = {'fruit': ['apple', 'pear', 'peach'], 'values': ['apple_1_0,peach_1_5','pear_1_3','mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)
df

fruit   values
0   apple   apple_1_0,peach_1_5
1   pear    pear_1_3
2   peach   mango_1_0,banana_1_0,pineapple_1_10

The strings in the values column are comma separated, and I'd like the strings that contain the substring '_1_0'.

Desired output:

    fruit   values
0   apple   apple_1_0
1   pear    NaN
2   peach   mango_1_0,banana_1_0

Something like this is somewhat close to what I'm trying to do but is painfully slow over ~100,000 rows:

for row in range(len(df)):
    print([zero for zero in df['values'].str.split(',', expand=False)[row] if "_1_0" in zero])

['apple_1_0']
[]
['mango_1_0', 'banana_1_0']

Upvotes: 2

Views: 911

Answers (4)

anubhava
anubhava

Reputation: 784908

Using findall you may do this:

import numpy as np
import pandas as pd

d = {'fruit': ['apple', 'pear', 'peach'], 'values': ['apple_1_0,peach_1_5','pear_1_3','mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)

df['values'] = df['values'].str.findall(r'[^,]*_1_0(?=,|$)').apply(','.join).replace('', np.NaN)    
print ( df )
   fruit                values
0  apple             apple_1_0
1   pear                   NaN
2  peach  mango_1_0,banana_1_0

Regex [^,]*_1_0(?=,|$) matches a non-comma string that ends in _1_0 followed by comma or end of string.


We can use a lambda as well:

df['values'] = df['values'].str.findall(r'[^,]*_1_0(?=,|$)').apply(lambda items: ','.join(items) if len(items) > 0 else np.NaN)

Upvotes: 1

sammywemmy
sammywemmy

Reputation: 28659

This is an alternative, as a list comprehension :

    df["values"] = [ ",".join(entry if entry.endswith("1_0") 
                              else "" 
                              for entry in val.split(","))
                       .rstrip(",")
                   for val in df["values"]
                   ]

     df = df.replace({"": np.nan})

    df


   fruit    values
0   apple   apple_1_0
1   pear    NaN
2   peach   mango_1_0,banana_1_0

Upvotes: 1

AMC
AMC

Reputation: 2702

Straightforward solution:

import numpy as np
import pandas as pd

d = {'fruit': ['apple', 'pear', 'peach'],
     'values': ['apple_1_0,peach_1_5', 'pear_1_3', 'mango_1_0,banana_1_0,pineapple_1_10']}
df = pd.DataFrame(data=d)

new_data = df['values'].str.split(',')
new_data = new_data.apply(lambda lst: [elem for elem in lst if '_1_0' in elem])
new_data = new_data.str.join(",")
new_data = new_data.replace('', np.NaN)

Upvotes: 1

BENY
BENY

Reputation: 323226

Let us try explode

s = df['values'].str.split(',').explode()
df['New_values'] = s.where(s.str.endswith('_1_0')).dropna().groupby(level=0).agg(','.join)
df
Out[29]: 
   fruit                               values            New_values
0  apple                  apple_1_0,peach_1_5             apple_1_0
1   pear                             pear_1_3                   NaN
2  peach  mango_1_0,banana_1_0,pineapple_1_10  mango_1_0,banana_1_0

Upvotes: 1

Related Questions