Reputation: 29581
I retrieved a bunch of text records from my postgresql database and intend to preprocess these text documents before analyzing them.
I want to tokenize the documents but ran into some problem during tokenizing
#some other bunch of regex replacements
#toToken is the text string
toTokens = self.regexClitics1.sub(" \\1",toTokens)
toTokens = self.regexClitics2.sub(" \\1 \\2",toTokens)
toTokens = str.strip(toTokens)
The error is TypeError: descriptor 'strip' requires a 'str' object but received a 'unicode'
I'm curious, why does this error occurs, when the encoding of the database is UTF-8?
Upvotes: 0
Views: 1739