Why does this regular expression not work?

Question

I have a function that parses HTML code so it is easy to read and write with. In order to do this I must split the string with multiple delimiters and as you can see I have used re.split() and I cannot find a better solution. However, when I submit some HTML such as this, it has absolutely no effect. This has lead me to believe that my regular expression is incorrectly written. What should be there instead?

def parsed(data):
    """Removes junk from the data so it can be easily processed."""
    data = str(data)
    # This checks for a cruft and removes it if it exists.
    if re.search("b'", data):
        data = data[2:-1]
    lines = re.split(r'
|
', data)  # This clarifies the lines for writing.
    return lines

This isn't a duplicate if you find a similar question, I've been crawling around for ages and it still doesn't work.

Martijn Pieters · Accepted Answer

You are converting a bytes value to string:

data = str(data)
# This checks for a cruft and removes it if it exists.
if re.search("b'", data):
    data = data[2:-1]

which means that all line delimiters have been converted to their Python escape codes:

>>> str(b'
')
"b'
'"

That is a literal b, literal quote, literal \ backslash, literal n, literal quote. You would have to split on r'(\n|\r)' instead, but most of all, you shouldn't turn bytes values to string representations here. Python produced the representation of the bytes value as a literal string you can paste back into your Python interpreter, which is not the same thing as the value contained in the object.

You want to decode to string instead:

if isinstance(data, bytes):
    data = data.decode('utf8')

where I am assuming that the data is encoded with UTF8. If this is data from a web request, the response headers quite often include the character set used to encode the data in the Content-Type header, look for the charset= parameter.

A response produced by the urllib.request module has an .info() method, and the character set can be extracted (if provided) with:

charset = response.info().get_param('charset')

where the return value is None if no character set was provided.

You don't need to use a regular expression to split lines, the str type has a dedicated method, str.splitlines():

Return a list of the lines in the string, breaking at line boundaries. This method uses the universal newlines approach to splitting lines. Line breaks are not included in the resulting list unless keepends is given and true.

For example, 'ab c de fg kl '.splitlines() returns ['ab c', '', 'de fg', 'kl'], while the same call with splitlines(True) returns ['ab c ', ' ', 'de fg ', 'kl '].

Why does this regular expression not work?

Answers (1)

Related Questions