Reputation: 8387
I have the following code:
def gettextbyxpath(tree, xpath):
node = tree.xpath(xpath)[0]
try:
text = etree.tostring(node, method="text", encoding='UTF-8').strip()
text = re.sub(' +',' ', text)
text = re.sub('\n+','\n', text)
text = re.sub('\n \n','\n', text)
except:
text = 'ERROR'
return text
With the last line I try to get rid of lines with just a single space in them. There are quite a lot of them in real data.
When I run the code above as an isolated test it works fine, but in real code the last line doesn't do anything at all! I've tried comparing files generated with and without it - there are no differences.
Example input:
Brand:
777,Royal Lion
Main Products:
battery, 777, carbon zinc, paper jacket,
I'm trying to get rid of the vertical white space between the lines.
Any ideas of why my code could be behaving like this?
Upvotes: 2
Views: 609
Reputation: 5290
As to why your code behaves the way you described: The value of text
that you obtain from the second call to re.sub
does not contain the pattern you are trying to substitute in your last call to re.sub
:
>>> text = re.sub('\n+', '\n', text) # 2nd call to re.sub
>>> text
>>> 'Brand:\n 777,Royal Lion\n Main Products:\n battery, 777, carbon zinc, paper jacket,'
So, you need to remove the second \n
from the pattern in your last call to re.sub
:
text = re.sub('\n ','\n', text)
This will yield:
Brand:
777,Royal Lion
Main Products:
battery, 777, carbon zinc, paper jacket,
Alternative solution
def gettextbyxpath(tree, xpath):
node = tree.xpath(xpath)[0]
try:
text = etree.tostring(node, method="text", encoding='UTF-8').strip()
text = '\n'.join(line.strip() for line in text.split('\n') if line.strip())
except:
text = 'ERROR'
return text
Output
Brand:
777,Royal Lion
Main Products:
battery, 777, carbon zinc, paper jacket,
What's different about this approach is that instead of doing successive substitutions with re.sub
we split the output of etree.tostring
at \n
. Then, we filter the result of that to exclude all lines that are reduced to the empty string when calling .strip()
on them. This leaves us with just the lines that have actual content, with all white space removed from the left and right side. To get the final result, we join the lines with a single newline (\n
).
Upvotes: 2
Reputation: 249
The following code should remove, tab, new lines, and spaces except single space.
import re
a ="""
Brand:
777,Royal Lion
Main Products:
battery, 777, carbon zinc, paper jacket,
"""
p = re.compile(r'[\n\t]+|[ ]{2,}')
print p.sub('',a)
Upvotes: 1