emvl
emvl

Reputation: 21

Python how to replace \xa0 with space

I have bytes in which \xao replaces a space. I tried many things to replace it by a space:

tmpstr = tmpstr.decode()

and then

tmpstr = unicodedata.normalize('NFD', tmpstr)

or

tmpstr = unicodedata.normalize('NFC', tmpstr)

or

tmpstr = unicodedata.normalize('NFKD', tmpstr)

I also tried:

tmpstr = tmpstr.replace(u'\xa0', u' ')

or

tmpstr = tmpstr.replace('\xa0', ' ')

Nothing works. Any idea? and also

tmpstr = 'reimbursement\xa0up to 100 Euros per day'
tmpstr = '{"amount_paid":0,"checked_in":false,"checkin_date":"","checkin_secret":"xxx","data":{"Accomodation":{"Accomodation":"Of your choice booked by yourself (reimbursement\u00a0up to 100 Euros per day)","Additional information":""},"Administration":{"Accomodation 1":"","Accomodation 2":"","Accomodation price 1":"","Accomodation price 2":"","Date of birth":"","From 2":"","Membership":"Accepted","Nationality":"","Notes":"","Other reimbursements":"","Phone number":"","To 2":"","Transport reimbursement":""},"Financial help for travel":{"Apply for reimbursement":"No","Could you please elaborate a bit upon the reasons why ?":"","What would be the estimated amount of the travel reimbursement you would need ?":""},"Participation mode":{"From":"31/10/2021","How do you plan to participate?":"In person","To":"13/11/2021"},"Personal Data":{"Affiliation":"xxx","Country":"Italy","Email Address":"xxx","Expertise and topic of research":"xxx","First Name":"xxx","Gender":"Male","I do not want my name and email address to be kept and used by the Institut Pascal for future mailings for and/or by the Institut Pascal":"No","I do not want my pictures to be published on the IPa website and social networks":"No","Last Name":"xxx","Position":"PostDoc","Reason that you are interested in participating in this program":"xxx","Special Requirements":"xxx","Would you need an official invitation for visa-purposes ?":"No"}},"event_id":xxx,"full_name":"xxx","paid":false,"personal_data":{"affiliation":"xxx","country":"xxx","email":"xxx","firstName":"xxx","phone":"","position":"xxx","surname":"xxx","title":"Male"},"price":0,"registrant_id":"11349","registration_date":"2021-07-10T22:08:12.380610+00:00","ticket_price":0}'

Thanks

Upvotes: 2

Views: 797

Answers (1)

Danish Bansal
Danish Bansal

Reputation: 700

You can try passing your string from a regular expression like below

[0-9a-zA-Z \.\-_{add more special characters}]*
  1. * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
  2. 0-9 matches a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
  3. a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
  4. A-Z matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
  5. "space" matches the character with index 3210 (2016 or 408) literally (case sensitive)
  6. . matches the character . with index 4610 (2E16 or 568) literally (case sensitive)
  7. - matches the character - with index 4510 (2D16 or 558) literally (case sensitive)
  8. _ matches the character _ with index 9510 (5F16 or 1378) literally (case sensitive)

You can paste your string here and see if this works for you before implementing in code

You can also look at this answer. It is the opposite of mine. This answer is removing \x chars using regex.

Example python code

import re

text = "reimbursement\xa0up to 100 Euros per day"
pattern = r'[0-9a-zA-Z\.\-_]*'

print(" ".join([i.strip() for i in re.findall(pattern, text)]))

# -----------------------------------------------

print(re.sub(r'[^\x00-\x7F]+', ' ', text).encode('utf-8').decode('utf-8', 'ignore').strip())

Upvotes: 1

Related Questions