fnsjdnfksjdb
fnsjdnfksjdb

Reputation: 1663

Removing certain characters from a long string in python

I'm working on a project that involves taking some source code and boiling it down to just the words that are displayed on the page. I can get it to remove all the html tags, and all of the stuff between script tags, but I can't figure out how to remove all the characters that start with a backslash. A page will have \t, \n, and \x** where * seem to be any lowercase letter or number.

How would I write a code that would replace all these parts of the strings with spaces? I'm working in python.

For example, this is a string from a webpage:

\n\t\n\t\n\t\tApple - Wikipedia, the free encyclopedia\n\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\tLanguage:English\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9Aragon\xc3\xa9sAsturianuAz\xc9\x99rbaycanca\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbeB\xc3\xa2n-l\xc3\xa2m-g\xc3\xbaBasa Banyumasan\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba\xd0

would become:

Apple - Wikipedia, the free encyclopedia Language:English sAsturianuAz rbaycanca Basa Banyumasan

Upvotes: 1

Views: 1560

Answers (5)

fraxel
fraxel

Reputation: 35269

Assuming default ascii encoding, we can do this quite nicely in one line, without evil regex ;), by iterating over the string and removing values based on their encoding value using ord(i) < 128, or whatever specification we choose:

>>> ' '.join(''.join([i if ord(i) < 128 else ' ' for i in mystring]).split())
#Output:
Apple - Wikipedia, the free encyclopedia Language:English Aragon sAsturianuAz rbaycanca B n-l m-g Basa Banyumasan

Or we could specify a string of allowed characters and use 'in', like this using the builtin string.ascii_letters:

>>> import string
>>> ' '.join(''.join([i if i in string.ascii_letters else ' ' for i in mystring]).split())
#Output:
Apple Wikipedia the free encyclopedia Language English Aragon sAsturianuAz rbaycanca B n l m g Basa Banyumasan

This removes punctuation too (but we could easily avoid that by adding those characters back into a string check definition if we want, check = string.ascii_letters + ',.-:')

Upvotes: 0

Hugh Bothwell
Hugh Bothwell

Reputation: 56654

Wikipedia uses UTF-8 string encoding. To convert to plain ASCII, you must

  1. convert from UTF-8 to unicode
  2. convert from unicode to ASCII, with replacement of uncodable characters
  3. convert the uncodable-character-replacements to spaces
  4. convert multiple whitespaces (tabs, linefeeds, etc) to single spaces
  5. strip leading and trailing spaces

.

s = "\n\t\n\t\n\t\tApple - Wikipedia, the free encyclopedia\n\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\tLanguage:English\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9Aragon\xc3\xa9sAsturianuAz\xc9\x99rbaycanca\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbeB\xc3\xa2n-l\xc3\xa2m-g\xc3\xbaBasa Banyumasan\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba"

import re
whitespaces = re.compile('\s+', flags=re.M)
def utf8_to_ascii(s, ws=whitespaces):
    s = s.encode("utf8")
    s = s.decode("ascii", errors="replace")
    s = s.replace(u"\ufffd", " ")
    s = ws.sub(" ", s)
    return s.strip()

s = utf8_to_ascii(s)

which finally results in the string

Apple - Wikipedia, the free encyclopedia Language:English Aragon sAsturianuAz rbaycanca B n-l m-g Basa Banyumasan

Upvotes: 0

Vidul
Vidul

Reputation: 10538

Given that the plain text words contains at least three chars:

' '.join(re.findall(r'\w{3,}', s)) # where s represents the string

Or:

' '.join(re.findall(r'(?:\w{3,}|-(?=\s))', s)) # in order to preserve the dash char

Upvotes: 0

Maria Zverina
Maria Zverina

Reputation: 11173

s = repr('''\n\t\n\t\n\t\tApple - Wikipedia, the free encyclopedia\n\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\tLanguage:English\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9Aragon\xc3\xa9sAsturianuAz\xc9\x99rbaycanca\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbeB\xc3\xa2n-l\xc3\xa2m-g\xc3\xbaBasa Banyumasan\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba\xd0''')
s =  re.sub(r'\\[tn]', '', s)
s =  re.sub(r'\\x..', '', s)
print s

Upvotes: 1

varunl
varunl

Reputation: 20229

Write a regex to match all the desired patters and then replace them with a space.

Upvotes: 0

Related Questions