Paul
Paul

Reputation: 259

How to clean \xc2\xa0 \xc2\xa0..... in text data

When I was trying to read a text file with the following python code:

     with open(file, 'r') as myfile:
          data = myfile.read()

Got some weird characters start with \x...., what do they stand for and how to get rid of them in reading a text file?

e.g.

...... \xc2\xa0 \xc2\xa0 chapter 1 tuesday 1984 \xe2\x80\x9chey , jake , your mom sent me to pick you up \xe2\x80\x9d jacob robbins knew better than to accept a ride from a stranger , but when his mom\xe2\x80\x99s friend ronny was waiting for him in front of school he reluctantly got in the car \xe2\x80\x9cmy name is jacob........

Upvotes: 4

Views: 11874

Answers (4)

gajendran c
gajendran c

Reputation: 1

the below code clears the issue

path.decode('utf-8','ignore').strip()

Upvotes: 0

gajendran c
gajendran c

Reputation: 1

 def main():
      args = parse_args()
      if args.file :
          //To clean \xc2\xa0 \xc2\xa0… in text data 
          file_to_read = args.file.decode('utf-8','ignore').strip() 
          f = open(file_to_read, "r+")
          text_from_file = f.read()  
      else :
          text_from_file = sys.argv[1]

Upvotes: 0

Zach Gates
Zach Gates

Reputation: 4155

Those are string escapes. They represent a character by its hexadecimal value. For example, \x24 is 0x24, which is the dollar sign.

>>> '\x24'
'$'
>>> chr(0x24)
'$'

One such escape (from the ones you provided) is \xc2 which is Â, a capital A with a circumflex.

Upvotes: 2

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799230

That's UTF-8 encoded text. You open the file as UTF-8.

with open(file, 'r', encoding='utf-8') as myfile:
   ...

2.x:

with codecs.open(file, 'r', encoding='utf-8') as myfile:
   ...

Unicode In Python, Completely Demystified

Upvotes: 7

Related Questions