Ben Usman
Ben Usman

Reputation: 8387

Python: re.sub changes nothing

I have the following code:

def gettextbyxpath(tree, xpath):
    node = tree.xpath(xpath)[0]
    try:
        text = etree.tostring(node, method="text", encoding='UTF-8').strip()
        text = re.sub(' +',' ', text)
        text = re.sub('\n+','\n', text)
        text = re.sub('\n \n','\n', text)
    except:
        text = 'ERROR'
    return text

With the last line I try to get rid of lines with just a single space in them. There are quite a lot of them in real data.

When I run the code above as an isolated test it works fine, but in real code the last line doesn't do anything at all! I've tried comparing files generated with and without it - there are no differences.

Example input:

        Brand:

   777,Royal Lion



    Main Products:

           battery, 777, carbon zinc, paper jacket,

I'm trying to get rid of the vertical white space between the lines.

Any ideas of why my code could be behaving like this?

Upvotes: 2

Views: 609

Answers (2)

itsjeyd
itsjeyd

Reputation: 5290

As to why your code behaves the way you described: The value of text that you obtain from the second call to re.sub does not contain the pattern you are trying to substitute in your last call to re.sub:

>>> text = re.sub('\n+', '\n', text) # 2nd call to re.sub
>>> text
>>> 'Brand:\n 777,Royal Lion\n Main Products:\n battery, 777, carbon zinc, paper jacket,'

So, you need to remove the second \n from the pattern in your last call to re.sub:

text = re.sub('\n ','\n', text)

This will yield:

Brand:
777,Royal Lion
Main Products:
battery, 777, carbon zinc, paper jacket,

Alternative solution

def gettextbyxpath(tree, xpath):
    node = tree.xpath(xpath)[0]
    try:
        text = etree.tostring(node, method="text", encoding='UTF-8').strip()
        text = '\n'.join(line.strip() for line in text.split('\n') if line.strip())
    except:
        text = 'ERROR'
    return text

Output

Brand:
777,Royal Lion
Main Products:
battery, 777, carbon zinc, paper jacket,

What's different about this approach is that instead of doing successive substitutions with re.sub we split the output of etree.tostring at \n. Then, we filter the result of that to exclude all lines that are reduced to the empty string when calling .strip() on them. This leaves us with just the lines that have actual content, with all white space removed from the left and right side. To get the final result, we join the lines with a single newline (\n).

Upvotes: 2

aar cee
aar cee

Reputation: 249

The following code should remove, tab, new lines, and spaces except single space.

import re

a ="""
 Brand:

 777,Royal Lion



 Main Products:

 battery, 777, carbon zinc, paper jacket,
"""
p = re.compile(r'[\n\t]+|[ ]{2,}')
print p.sub('',a)

Upvotes: 1

Related Questions