Reputation: 51
I have a dataframe 'data' and I want to replace all the punctuation marks in a given column with nothing (so I want to remove them).
Im using Python 3 and Pandas and Numpy to pre-format some text before working with a Neural Network.
symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
dataClean = data['description']
for i in symbols:
dataClean = np.char.replace(dataClean,i,"")
I expected that, for every item in dataClean (it goes from 0 to 2549), every string contained in each row got punctuation marks deleted. But I got this in return:
TypeError Traceback (most recent call last)
<ipython-input-87-aa944ae6e61c> in <module>
3
4 for i in symbols:
----> 5 dataClean = np.char.replace(dataClean,i,"")
6
7 print(dataClean[2])
~\Anaconda3\lib\site-packages\numpy\core\defchararray.py in replace(a, old, new, count)
1184 return _to_string_or_unicode_array(
1185 _vec_string(
-> 1186 a, object_, 'replace', [old, new] + _clean_args(count)))
1187
1188
TypeError: string operation on non-string array
Upvotes: 2
Views: 8310
Reputation: 879451
If dataClean
is a Pandas Series of strings, you could use the Series.str.translate
method:
symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
dataClean = data['description']
dataClean = dataClean.str.translate({ord(symbol):"" for symbol in symbols})
For example, suppose we had the DataFrame, df
:
In [59]: df = pd.DataFrame({'data':['[Yes?]', '(No!)', 100]}); df
Out[59]:
data
0 [Yes?]
1 (No!)
2 100
Then we can make a dict mapping unicode ordinals to strings (or, in this case, the empty string):
In [52]: symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
In [57]: {ord(symbol):"" for symbol in symbols}
Out[57]:
{33: '',
34: '',
...
126: '',
10: ''}
Every unicode ordinal, or code point which corresponds to a unicode character.
A Python3 string is a sequence of unicode characters. For each string in the Series, the translate
method replaces each character in the string with the corresponding string given by the dict mapping.
In [60]: df['data'].str.translate({ord(symbol):"" for symbol in symbols})
Out[60]:
0 Yes
1 No
2 NaN
Name: data, dtype: object
Notice that translate
maps non-strings such as the 100
in the third row to NaN
.
Upvotes: 4
Reputation: 599
you can use :
symbols = "[!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n]"
dataClean = dataClean.str.replace(symbols, "")
Upvotes: 1