Reputation: 1663
I'm working on a project that involves taking some source code and boiling it down to just the words that are displayed on the page. I can get it to remove all the html tags, and all of the stuff between script tags, but I can't figure out how to remove all the characters that start with a backslash. A page will have \t, \n, and \x** where * seem to be any lowercase letter or number.
How would I write a code that would replace all these parts of the strings with spaces? I'm working in python.
For example, this is a string from a webpage:
\n\t\n\t\n\t\tApple - Wikipedia, the free encyclopedia\n\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\tLanguage:English\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9Aragon\xc3\xa9sAsturianuAz\xc9\x99rbaycanca\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbeB\xc3\xa2n-l\xc3\xa2m-g\xc3\xbaBasa Banyumasan\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba\xd0
would become:
Apple - Wikipedia, the free encyclopedia Language:English sAsturianuAz rbaycanca Basa Banyumasan
Upvotes: 1
Views: 1560
Reputation: 35269
Assuming default ascii encoding, we can do this quite nicely in one line, without evil regex ;), by iterating over the string and removing values based on their encoding value using ord(i) < 128
, or whatever specification we choose:
>>> ' '.join(''.join([i if ord(i) < 128 else ' ' for i in mystring]).split())
#Output:
Apple - Wikipedia, the free encyclopedia Language:English Aragon sAsturianuAz rbaycanca B n-l m-g Basa Banyumasan
Or we could specify a string of allowed characters and use 'in', like this using the builtin
string.ascii_letters
:
>>> import string
>>> ' '.join(''.join([i if i in string.ascii_letters else ' ' for i in mystring]).split())
#Output:
Apple Wikipedia the free encyclopedia Language English Aragon sAsturianuAz rbaycanca B n l m g Basa Banyumasan
This removes punctuation too (but we could easily avoid that by adding those characters back into a string check definition if we want, check = string.ascii_letters + ',.-:'
)
Upvotes: 0
Reputation: 56654
Wikipedia uses UTF-8 string encoding. To convert to plain ASCII, you must
.
s = "\n\t\n\t\n\t\tApple - Wikipedia, the free encyclopedia\n\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\tLanguage:English\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9Aragon\xc3\xa9sAsturianuAz\xc9\x99rbaycanca\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbeB\xc3\xa2n-l\xc3\xa2m-g\xc3\xbaBasa Banyumasan\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba"
import re
whitespaces = re.compile('\s+', flags=re.M)
def utf8_to_ascii(s, ws=whitespaces):
s = s.encode("utf8")
s = s.decode("ascii", errors="replace")
s = s.replace(u"\ufffd", " ")
s = ws.sub(" ", s)
return s.strip()
s = utf8_to_ascii(s)
which finally results in the string
Apple - Wikipedia, the free encyclopedia Language:English Aragon sAsturianuAz rbaycanca B n-l m-g Basa Banyumasan
Upvotes: 0
Reputation: 10538
Given that the plain text words contains at least three chars:
' '.join(re.findall(r'\w{3,}', s)) # where s represents the string
Or:
' '.join(re.findall(r'(?:\w{3,}|-(?=\s))', s)) # in order to preserve the dash char
Upvotes: 0
Reputation: 11173
s = repr('''\n\t\n\t\n\t\tApple - Wikipedia, the free encyclopedia\n\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\tLanguage:English\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9Aragon\xc3\xa9sAsturianuAz\xc9\x99rbaycanca\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbeB\xc3\xa2n-l\xc3\xa2m-g\xc3\xbaBasa Banyumasan\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba\xd0''')
s = re.sub(r'\\[tn]', '', s)
s = re.sub(r'\\x..', '', s)
print s
Upvotes: 1