Thornhale
Thornhale

Reputation: 2376

How do I efficiently parse a substring in a Pandas Dataframe based on a value in a column?

Assume, I have the following table:

random_string|end_location|substring
-------------|------------|---------
HappyBirthday|     4      |Happ
GoodBye      |     5      |GoodB
NaN          |    NaN     |NaN
Haensel      |     2      |Ha
...          |     ...    |...

This table represents the desired output. The initial input are just the first 2 columns. The question is: How do I get there elegantly? Rows without content should not be processed.

I tried the following:

df['random_string'].str[0:4] or [0:5]

The problem with this approach is that ALL strings will be cut to the same length which is not the desired outcome.

I tried:

for row in df.iterrows():
    end = int(row[1]['end_location'])
    df.loc[row[0], 'substring'] = row[1]['random_substring'][0:end]

This works but I feel is rather unelegant and inefficient. (How) (C/)can the above be performed in a more elegant - perhaps vectorized way. Maybe something with apply could work?

Upvotes: 2

Views: 242

Answers (2)

piRSquared
piRSquared

Reputation: 294328

df

   random_string  end_location
0  HappyBirthday           4.0
1        GoodBye           5.0
2            NaN           NaN
3        Haensel           2.0

Work with subset

d1 = df.dropna()
rs = d1.random_string.values.tolist()
el = d1.end_location.values.astype(int).tolist()  # Thx @MaxU for `astype(int)`
df.loc[d1.index, 'substring'] = [s[:n] for s, n in zip(rs, el)]

   random_string  end_location substring
0  HappyBirthday           4.0      Happ
1        GoodBye           5.0     GoodB
2            NaN           NaN       NaN
3        Haensel           2.0        Ha

Timing

df = pd.concat([df] * 10**4, ignore_index=True)

%%timeit
mask = (df['random_string'].str.len() >= 0) & (df['end_location'] >= 0)
df.loc[mask, 'substring'] = [t[0][:int(t[1])] for t in df[mask].values.tolist()]
10 loops, best of 3: 26.1 ms per loop

%%timeit
d1 = df.dropna()
rs = d1.random_string.values.tolist()
el = d1.end_location.values.astype(int).tolist()
df.loc[d1.index, 'substring'] = [s[:int(n)] for s, n in zip(rs, el)]
10 loops, best of 3: 21.5 ms per loop

Upvotes: 2

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

try this:

In [24]: df
Out[24]:
   random_string  end_location
0  HappyBirthday           4.0
1        GoodBye           5.0
2            NaN           NaN
3        Haensel           2.0

In [25]: mask = (df['random_string'].str.len() >= 0) & (df['end_location'] >= 0)

In [26]: df[mask]
Out[26]:
   random_string  end_location
0  HappyBirthday           4.0
1        GoodBye           5.0
3        Haensel           2.0

In [27]: df.loc[mask, 'substring'] = [t[0][:int(t[1])] for t in df[mask].values.tolist()]

In [28]: df
Out[28]:
   random_string  end_location substring
0  HappyBirthday           4.0      Happ
1        GoodBye           5.0     GoodB
2            NaN           NaN       NaN
3        Haensel           2.0        Ha

Timing for a larger (40K rows) DF

In [179]: df = pd.concat([df] * 10**4, ignore_index=True)

In [40]: %%timeit
    ...: mask = (df['random_string'].str.len() >= 0) & (df['end_location'] >= 0)
    ...: [t[0][:int(t[1])] for t in df[mask].values.tolist()]
    ...:
10 loops, best of 3: 77.3 ms per loop

In [41]: df.shape
Out[41]: (40000, 2)

Upvotes: 2

Related Questions