Hessiann
Hessiann

Reputation: 51

Replacing characters in Python returns TypeError: string operation on non-string array

I have a dataframe 'data' and I want to replace all the punctuation marks in a given column with nothing (so I want to remove them).

Im using Python 3 and Pandas and Numpy to pre-format some text before working with a Neural Network.

symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
dataClean = data['description']

for i in symbols:
    dataClean = np.char.replace(dataClean,i,"")

I expected that, for every item in dataClean (it goes from 0 to 2549), every string contained in each row got punctuation marks deleted. But I got this in return:

TypeError                                 Traceback (most recent call last)
<ipython-input-87-aa944ae6e61c> in <module>
      3 
      4 for i in symbols:
----> 5     dataClean = np.char.replace(dataClean,i,"")
      6 
      7 print(dataClean[2])

~\Anaconda3\lib\site-packages\numpy\core\defchararray.py in replace(a, old, new, count)
   1184     return _to_string_or_unicode_array(
   1185         _vec_string(
-> 1186             a, object_, 'replace', [old, new] + _clean_args(count)))
   1187 
   1188 

TypeError: string operation on non-string array

Upvotes: 2

Views: 8310

Answers (2)

unutbu
unutbu

Reputation: 879451

If dataClean is a Pandas Series of strings, you could use the Series.str.translate method:

symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
dataClean = data['description']
dataClean = dataClean.str.translate({ord(symbol):"" for symbol in symbols})

For example, suppose we had the DataFrame, df:

In [59]: df = pd.DataFrame({'data':['[Yes?]', '(No!)', 100]}); df
Out[59]: 
     data
0  [Yes?]
1   (No!)
2     100

Then we can make a dict mapping unicode ordinals to strings (or, in this case, the empty string):

In [52]: symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
In [57]: {ord(symbol):"" for symbol in symbols}
Out[57]: 
{33: '',
 34: '',
 ...
 126: '',
 10: ''}

Every unicode ordinal, or code point which corresponds to a unicode character. A Python3 string is a sequence of unicode characters. For each string in the Series, the translate method replaces each character in the string with the corresponding string given by the dict mapping.

In [60]: df['data'].str.translate({ord(symbol):"" for symbol in symbols})
Out[60]: 
0    Yes
1     No
2    NaN
Name: data, dtype: object

Notice that translate maps non-strings such as the 100 in the third row to NaN.

Upvotes: 4

you can use :

symbols = "[!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n]"
dataClean = dataClean.str.replace(symbols, "")

Upvotes: 1

Related Questions