Reputation: 1343
I have read multiple posts regarding this error, but I still can't figure it out. When I try to loop through my function:
def fix_Plan(location):
letters_only = re.sub("[^a-zA-Z]", # Search for all non-letters
" ", # Replace all non-letters with spaces
location) # Column and row to search
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]
return (" ".join(meaningful_words))
col_Plan = fix_Plan(train["Plan"][0])
num_responses = train["Plan"].size
clean_Plan_responses = []
for i in range(0,num_responses):
clean_Plan_responses.append(fix_Plan(train["Plan"][i]))
Here is the error:
Traceback (most recent call last):
File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 48, in <module>
clean_Plan_responses.append(fix_Plan(train["Plan"][i]))
File "C:/Users/xxxxx/PycharmProjects/tronc/tronc2.py", line 22, in fix_Plan
location) # Column and row to search
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36\lib\re.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
Upvotes: 132
Views: 551073
Reputation: 999
I had the same problem. And it's very interesting that every time I did something, the problem was not solved until I realized that there were two special characters in the string.
For example, for me, the text has two characters:
‎
(Left-to-Right Mark) and ‌
(Zero-width non-joiner)
The solution for me was to delete these two characters and the problem was solved.
import re
mystring = "‎Some Time W‌e"
mystring = re.sub(r"‎", "", mystring)
mystring = re.sub(r"‌", "", mystring)
I hope this has helped someone who has a problem like me.
Upvotes: 4
Reputation: 23081
str.replace
insteadThis is about 7 years too late for OP but if you got here because you got a similar error by using re.sub
on a pandas column, consider using str.replace
built into pandas instead. The reason is that the most common reason this error pops up is when a pandas column contains (unexpected) NaN values in it which re.sub
cannot handle whereas str.replace
handles it under the hood for us.
Example:
train = pd.DataFrame({'Plan': ["th1s", '1s', 'N01ce', 'and', float('nan')]})
[re.sub("[^a-zA-Z]", " ", x) for x in train['Plan']] # <--- TypeError: expected string or bytes-like object
train['Plan'].str.replace(r"[^a-zA-Z]", " ", regex=True) # <--- OK
Now for OP, their fix_Plan
function does more than just replacing strings; however, we can still do all of that in a vectorized way as follows (more or less replace re
functions by its pandas counterparts).
stops = set(stopwords.words("english"))
stop_words = '|'.join(fr"\b{w}\b" for w in stops) # pattern to catch stop words
clean_Plan_responses = (
train['Plan']
.str.replace("[^a-zA-Z]", " ", regex=True) # replace all non-letters with spaces
.str.lower() # convert to lower case
.str.replace(stop_words, "", regex=True) # remove all stop words
.str.split().str.join(" ") # remove extraneous space characters
)
Upvotes: 1
Reputation: 15
from my experience in Python, this is caused by a None value in the second argument used in the function re.findall().
import re
x = re.findall(r"\[(.*?)\]", None)
One reproduce the error with this code sample.
To avoid this error message, one can filter the null values or add a condition to put them out of the processing
Upvotes: 0
Reputation: 291
The simplest solution is to apply Python str
function to the column you are trying to loop through.
If you are using pandas
, this can be implemented as:
dataframe['column_name']=dataframe['column_name'].apply(str)
Upvotes: 29
Reputation: 89
I suppose better would be to use re.match() function. here is an example which may help you.
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
sentences = word_tokenize("I love to learn NLP \n 'a :(")
#for i in range(len(sentences)):
sentences = [word.lower() for word in sentences if re.match('^[a-zA-Z]+', word)]
sentences
Upvotes: 0
Reputation: 33714
As you stated in the comments, some of the values appeared to be floats, not strings. You will need to change it to strings before passing it to re.sub
. The simplest way is to change location
to str(location)
when using re.sub
. It wouldn't hurt to do it anyways even if it's already a str
.
letters_only = re.sub("[^a-zA-Z]", # Search for all non-letters
" ", # Replace all non-letters with spaces
str(location))
Upvotes: 175