Reputation: 103
I have a pandas dataframe column with strings that look like
'2fvRE-Ku89lkRVJ44QQFN ABACUS LABS, INC'
and I want to convert it to look like
'ABACUS LABS, INC'.
My piece code :
list1 = data_df['Vendor'].str.split()
print(list1)
excludeList = list()
for y in list1:
if (any([x for x in y if x.isalpha()]) and any([x for x in y if x.isdigit()])) :
excludeList.append(y)
if y.isdigit() or len(y) == 1:
excludeList.append(y)
resList = [x for x in list1 if x not in excludeList]
print(restList)
It however gives me an error for
'list' object has no attribute 'isdigit'
Can anyone help me how I can remove the alphanumeric words from the string and only retain the text part in my pandas dataframe column?
Upvotes: 3
Views: 1588
Reputation: 626896
You can use
data_df = pd.DataFrame({'Vendor': ['2fvRE-Ku89lkRVJ44QQFN ABACUS LABS, INC', 'abc123 CAT LABS, INC']})
data_df['Vendor'].str.replace(r'^(?:[A-Za-z-]+\d|[\d-]+[A-Za-z])[\w-]*\s*', '', regex=True)
# => 0 ABACUS LABS, INC
# 1 CAT LABS, INC
# Name: Vendor, dtype: object
See the regex demo.
Regex details
^
- start of string(?:[A-Za-z-]+\d|[\d-]+[A-Za-z])
- either one or more letters/dashes and then a digit or a one or more digits/dashes and then a letter[\w-]*
- zero or more word or -
chars\s*
- zero or more whitespace chars.Upvotes: 2
Reputation: 23217
Assuming you have column Text
in dataframe df
, you can try:
df2 = df['Text'].str.split().explode()
m = df2.str.contains(r'[A-Za-z]') & df2.str.contains(r'\d')
df_out = df2[~m].groupby(level=0).agg(' '.join)
df_out = df_out.to_frame(name='Text')
Explanation
We split the text into separate words then explode the list of words into multiple rows with one word in one row. Then we test whether the word contains any alpha character(s) and digit(s) by regex by using .str.contains()
as follows:
.str.contains(r'[A-Za-z]') # test any character in [A-Za-z] in string
and
.str.contains(r'\d') # test any numeric digit in string
Then with the boolean mask m
of the alpha and digit tests, we select only those row entries that does not contains both the alpha and digits by:
df2[~m]
Then, we assemble the filtered words (without alphanumeric words) back to a sentence by using
groupby(level=0).agg(' '.join)
Here, we group by level=0
which is the original row index before explode (i.e. original row number).
Demo
data = {'Text': ['2fvRE-Ku89lkRVJ44QQFN ABACUS LABS, INC', 'abc123 CAT LABS, INC']}
df = pd.DataFrame(data)
Text
0 2fvRE-Ku89lkRVJ44QQFN ABACUS LABS, INC
1 abc123 CAT LABS, INC
df2 = df['Text'].str.split().explode()
m = df2.str.contains(r'[A-Za-z]') & df2.str.contains(r'\d')
df_out = df2[~m].groupby(level=0).agg(' '.join)
df_out = df_out.to_frame(name='Text')
Text
0 ABACUS LABS, INC
1 CAT LABS, INC
We can also simplify it as:
df2 = df['Text'].str.findall(r'\b(?!.*[A-Za-z]+.*\d+)(?!.*\d+.*[A-Za-z]+.*).+\b').str.join(' ').str.strip()
Explanation
Here the regex we use is still to adhere to the requirement of excluding alphanumeric word. Regex:
r'\b(?!.*[A-Za-z]+.*\d+)(?!.*\d+.*[A-Za-z]+.*).+\b'
Within the word boundary \b
.... \b
, we use 2 negative lookaheads to check against both alpha and numeric characters. We need 2 negative lookaheads instead of one because it is possible the alpha can appear before the digit or vice versa.
Upvotes: 2
Reputation: 16660
You can use regular expressions to ensure quick and elegant solution:
df2 = df['Text'].str.findall(r'((?<=\s)[a-zA-Z,]+(?=\s|$))').agg(' '.join)
Let's break it down:
df['Text']
extract list of matches of the regex.' '.join
function that concatenates the values in the list adding space in-between.The regex is doing this:
[a-zA-Z,]
which allows letters lowercase and uppercase as well as a comma.Comparing with @SeaBean solution the time difference on my machine is notable (per 2 million records dataframe):
There is also smaller memory impact of my solution vs. SeaBean's as he is creating additional temporary dataframe.
Upvotes: 3