How do I efficiently parse a substring in a Pandas Dataframe based on a value in a column?

Question

Assume, I have the following table:

random_string|end_location|substring
-------------|------------|---------
HappyBirthday|     4      |Happ
GoodBye      |     5      |GoodB
NaN          |    NaN     |NaN
Haensel      |     2      |Ha
...          |     ...    |...

This table represents the desired output. The initial input are just the first 2 columns. The question is: How do I get there elegantly? Rows without content should not be processed.

I tried the following:

df['random_string'].str[0:4] or [0:5]

The problem with this approach is that ALL strings will be cut to the same length which is not the desired outcome.

I tried:

for row in df.iterrows():
    end = int(row[1]['end_location'])
    df.loc[row[0], 'substring'] = row[1]['random_substring'][0:end]

This works but I feel is rather unelegant and inefficient. (How) (C/)can the above be performed in a more elegant - perhaps vectorized way. Maybe something with apply could work?

piRSquared · Accepted Answer

df

   random_string  end_location
0  HappyBirthday           4.0
1        GoodBye           5.0
2            NaN           NaN
3        Haensel           2.0

Work with subset

d1 = df.dropna()
rs = d1.random_string.values.tolist()
el = d1.end_location.values.astype(int).tolist()  # Thx @MaxU for `astype(int)`
df.loc[d1.index, 'substring'] = [s[:n] for s, n in zip(rs, el)]

   random_string  end_location substring
0  HappyBirthday           4.0      Happ
1        GoodBye           5.0     GoodB
2            NaN           NaN       NaN
3        Haensel           2.0        Ha

Timing

df = pd.concat([df] * 10**4, ignore_index=True)

%%timeit
mask = (df['random_string'].str.len() >= 0) & (df['end_location'] >= 0)
df.loc[mask, 'substring'] = [t[0][:int(t[1])] for t in df[mask].values.tolist()]
10 loops, best of 3: 26.1 ms per loop

%%timeit
d1 = df.dropna()
rs = d1.random_string.values.tolist()
el = d1.end_location.values.astype(int).tolist()
df.loc[d1.index, 'substring'] = [s[:int(n)] for s, n in zip(rs, el)]
10 loops, best of 3: 21.5 ms per loop

How do I efficiently parse a substring in a Pandas Dataframe based on a value in a column?

Answers (2)

Related Questions