Reputation: 189
I'm writing a spider in Scrapy 1.0.3 which would scrape an archive of Unicode pages and yield the text within the p tags of a page and dump it into a JSON file. My code looks like this:
def parse(self,response):
sel = Selector(response)
list=response.xpath('//p[@class="articletext"]/font').extract()
list0=response.xpath('//p[@class="titletext"]').extract()
string = ''.join(list).encode('utf-8').strip('\r\t\n')
string0 = ''.join(list0).encode('utf-8').strip('\r\t\n')
fullstring = string0 + string
stringjson=json.dumps(fullstring)
with open('output.json', 'w') as f:
f.write(stringjson)
try:
json.loads(stringjson)
print("Valid JSON")
except ValueError:
print("Not valid JSON")
However I get unwanted sequences of /r /t /n characters that I am unable to remove despite using split(). Why isn't it working and how would I go about making it work?
Upvotes: 1
Views: 4357
Reputation: 161
Alternative solution: the "normalize-space" function for xpath.
For example:
list=response.xpath('normalize-space(//p[@class="articletext"]/font)').extract()
instead of
list=response.xpath('//p[@class="articletext"]/font').extract()
The normalize-space function strips leading and trailing white-space from a string, replaces sequences of whitespace characters by a single space, and returns the resulting string.
Upvotes: 2
Reputation: 4491
You will want to use any of the multiple approaches to removing a character from a string in Python. strip()
only removes whitespace from the start and end. Going with a method similar to what you're already doing:
string = ''.join(c for c in list if c not in '\r\t\n')
string0 = ''.join(c for c in list0 if c not in '\r\t\n')
You could also just add string
and string0
together before doing this so that you only have to do it once.
EDIT (Response to comment):
>>> test_string
'This\r\n \tis\t\t \t\t\t(only) a \r\ntest. \r\n\r\n\r\nCarry\t \ton'
>>> ''.join(c for c in test_string if c not in '\r\t\n')
'This is (only) a test. Carry on'
Upvotes: 4
Reputation: 963
What do you mean "unable to remove"? Do you have a string with content already? Removing them is fairly easy:
str = "Test\r\n\twhatever\r\n\t"
str = str.replace("\r", '')
str = str.replace("\n", '')
str = str.replace("\t", '')
Upvotes: 1