Reputation: 2376
Assume, I have the following table:
random_string|end_location|substring
-------------|------------|---------
HappyBirthday| 4 |Happ
GoodBye | 5 |GoodB
NaN | NaN |NaN
Haensel | 2 |Ha
... | ... |...
This table represents the desired output. The initial input are just the first 2 columns. The question is: How do I get there elegantly? Rows without content should not be processed.
I tried the following:
df['random_string'].str[0:4] or [0:5]
The problem with this approach is that ALL strings will be cut to the same length which is not the desired outcome.
I tried:
for row in df.iterrows():
end = int(row[1]['end_location'])
df.loc[row[0], 'substring'] = row[1]['random_substring'][0:end]
This works but I feel is rather unelegant and inefficient. (How) (C/)can the above be performed in a more elegant - perhaps vectorized way. Maybe something with apply could work?
Upvotes: 2
Views: 242
Reputation: 294328
df
random_string end_location
0 HappyBirthday 4.0
1 GoodBye 5.0
2 NaN NaN
3 Haensel 2.0
Work with subset
d1 = df.dropna()
rs = d1.random_string.values.tolist()
el = d1.end_location.values.astype(int).tolist() # Thx @MaxU for `astype(int)`
df.loc[d1.index, 'substring'] = [s[:n] for s, n in zip(rs, el)]
random_string end_location substring
0 HappyBirthday 4.0 Happ
1 GoodBye 5.0 GoodB
2 NaN NaN NaN
3 Haensel 2.0 Ha
Timing
df = pd.concat([df] * 10**4, ignore_index=True)
%%timeit
mask = (df['random_string'].str.len() >= 0) & (df['end_location'] >= 0)
df.loc[mask, 'substring'] = [t[0][:int(t[1])] for t in df[mask].values.tolist()]
10 loops, best of 3: 26.1 ms per loop
%%timeit
d1 = df.dropna()
rs = d1.random_string.values.tolist()
el = d1.end_location.values.astype(int).tolist()
df.loc[d1.index, 'substring'] = [s[:int(n)] for s, n in zip(rs, el)]
10 loops, best of 3: 21.5 ms per loop
Upvotes: 2
Reputation: 210852
try this:
In [24]: df
Out[24]:
random_string end_location
0 HappyBirthday 4.0
1 GoodBye 5.0
2 NaN NaN
3 Haensel 2.0
In [25]: mask = (df['random_string'].str.len() >= 0) & (df['end_location'] >= 0)
In [26]: df[mask]
Out[26]:
random_string end_location
0 HappyBirthday 4.0
1 GoodBye 5.0
3 Haensel 2.0
In [27]: df.loc[mask, 'substring'] = [t[0][:int(t[1])] for t in df[mask].values.tolist()]
In [28]: df
Out[28]:
random_string end_location substring
0 HappyBirthday 4.0 Happ
1 GoodBye 5.0 GoodB
2 NaN NaN NaN
3 Haensel 2.0 Ha
Timing for a larger (40K rows) DF
In [179]: df = pd.concat([df] * 10**4, ignore_index=True)
In [40]: %%timeit
...: mask = (df['random_string'].str.len() >= 0) & (df['end_location'] >= 0)
...: [t[0][:int(t[1])] for t in df[mask].values.tolist()]
...:
10 loops, best of 3: 77.3 ms per loop
In [41]: df.shape
Out[41]: (40000, 2)
Upvotes: 2