Reputation: 4122
I am using Python.org version 2.7 64 bit on Windows Vista 64 bit to run scrapy. I am using the following to remove \n \r characters and html tags from my screen output:
body = response.xpath("//p").extract()
body2 = str(body)
body3 = re.sub(r'\s{2,}', ' ', body2)
print remove_tags(body3)
This removes the HTML special characters fine, however the \r \n characters are not being removed from the final output. Am I doing something wrong?
Thanks
Upvotes: 0
Views: 903
Reputation: 182
buddy what you need is the regex
(\\[rn]|\s){2,}
try this out and let me know if this worked out.
Upvotes: 1
Reputation: 89567
Yes, since you are not sure what type of newline the document contains you should replace your pattern with:
\s{2,}|[\r\n]
Indeed, most of the time, newlines can be CRLF (windows convention), or only LF (unix convention) (that is probably the case with you current document.) or only CR (for old apple OS)
Upvotes: 1