Reputation: 1624
I have the following kind of strings in my column seen below. I would like to parse out everything after the last _
of each string, and if there is no _
then leave the string as-is. (as my below try will just exclude strings with no _
)
so far I have tried below, seen here: Python pandas: remove everything after a delimiter in a string . But it is just parsing out everything after first _
d6['SOURCE_NAME'] = d6['SOURCE_NAME'].str.split('_').str[0]
Here are some example strings in my SOURCE_NAME column.
Stackoverflow_1234
Stack_Over_Flow_1234
Stackoverflow
Stack_Overflow_1234
Expected:
Stackoverflow
Stack_Over_Flow
Stackoverflow
Stack_Overflow
any help would be appreciated.
Upvotes: 2
Views: 12212
Reputation: 828
Use a combination of str.rsplit
and str.get
for your desired outcome. str.rsplit
simply splits a string from the end, while str.get
gets the nth element of an iterator within a pd.Series object.
d6['SOURCE_NAME'] = df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
the n
argument in rsplit
limits number of splits in output so that you only keep everything before the last '_'.
Even though a solution using pd.Series.apply
is almost half as fast, I like this one because is more expressive in it's syntax. If you want to use the pd.Series.apply
solution (faster) check the timing part!
strs = ['Stackoverflow_1234',
'Stack_Over_Flow_1234',
'Stackoverflow',
'Stack_Overflow_1234']
df = pd.DataFrame(data={'SOURCE_NAME': strs})
This results in
print(df)
SOURCE_NAME
0 Stackoverflow_1234
1 Stack_Over_Flow_1234
2 Stackoverflow
3 Stack_Overflow_1234
Using the proposed solution:
df['SOURCE_NAME'].str.rsplit('_', 1).str.get(0)
0 Stackoverflow
1 Stack_Over_Flow
2 Stackoverflow
3 Stack_Overflow
Name: SOURCE_NAME, dtype: object
Interestingly, using pd.Series.str
is not necessarily faster than using pd.Series.apply
:
import pandas as pd
df = pd.DataFrame(data={'SOURCE_NAME': ['stackoverflow_1234_abcd'] * 1000})
%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
497 µs ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
1.04 ms ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# increasing the number of rows x 100
df = pd.concat([df] * 100)
%timeit df['SOURCE_NAME'].apply(lambda x: x.rsplit('_', 1)[0])
31.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['SOURCE_NAME'].str.rsplit('_', n=1).str.get(0)
84.1 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 9
Reputation: 33
You can use the string.split('_') function to split the string into a list of substrings around every underscore, then recombine them without the last element. Here is a snippet using your examples:
a = ["Stackoverflow_1234", "Stack_Over_Flow_1234", "Stackoverflow", "Stack_Overflow_1234"]
for e in a:
# Split the string into a list, separated at '_'
splitStr = e.split("_")
# If there is only 1 element, we can use it directly
if len(splitStr) == 1:
print(splitStr[0])
# Slice off the final substring and join the remaining
# substrings back together with underscores
else:
print("_".join(splitStr[:-1]))
Upvotes: 1
Reputation: 922
Using rsplit() returns what you want to achieve, you can tell it how many times to split your string.
s = "Stack_Over_Flow_1234"
s.rsplit('_', 1)[0] # Split my string one time and get the first part of it
This then returns 'Stack_Over_Flow'
Upvotes: 1
Reputation: 515
you could try applying lambda as such:
d6['SOURCE_NAME'] = df['SOURCE_NAME'].apply(lambda x: x.split('_')[0])
Hope that helps!
Upvotes: 1