Reputation: 189

Getting rid of unwanted characters in scrapy response

I'm writing a spider in Scrapy 1.0.3 which would scrape an archive of Unicode pages and yield the text within the p tags of a page and dump it into a JSON file. My code looks like this:

  def parse(self,response):
    sel = Selector(response)
    list=response.xpath('//p[@class="articletext"]/font').extract()
    list0=response.xpath('//p[@class="titletext"]').extract()
    string = ''.join(list).encode('utf-8').strip('\r\t\n')
    string0 = ''.join(list0).encode('utf-8').strip('\r\t\n')
    fullstring = string0 + string
    stringjson=json.dumps(fullstring)

    with open('output.json', 'w') as f:
        f.write(stringjson)

    try:
        json.loads(stringjson)
        print("Valid JSON")
    except ValueError:
        print("Not valid JSON")

However I get unwanted sequences of /r /t /n characters that I am unable to remove despite using split(). Why isn't it working and how would I go about making it work?

Upvotes: 1

Answers (3)

Mark

Reputation: 161

Alternative solution: the "normalize-space" function for xpath.

For example:

list=response.xpath('normalize-space(//p[@class="articletext"]/font)').extract()

instead of

list=response.xpath('//p[@class="articletext"]/font').extract()

The normalize-space function strips leading and trailing white-space from a string, replaces sequences of whitespace characters by a single space, and returns the resulting string.

Upvotes: 2

Rejected

Reputation: 4491

You will want to use any of the multiple approaches to removing a character from a string in Python. strip() only removes whitespace from the start and end. Going with a method similar to what you're already doing:

string = ''.join(c for c in list if c not in '\r\t\n')
string0 = ''.join(c for c in list0 if c not in '\r\t\n')

You could also just add string and string0 together before doing this so that you only have to do it once.

EDIT (Response to comment):

>>> test_string
'This\r\n \tis\t\t \t\t\t(only) a \r\ntest. \r\n\r\n\r\nCarry\t \ton'
>>> ''.join(c for c in test_string if c not in '\r\t\n')
'This is (only) a test. Carry on'

Upvotes: 4

piezol

Reputation: 963

What do you mean "unable to remove"? Do you have a string with content already? Removing them is fairly easy:

str = "Test\r\n\twhatever\r\n\t"
str = str.replace("\r", '')
str = str.replace("\n", '')
str = str.replace("\t", '')

Upvotes: 1

Getting rid of unwanted characters in scrapy response

Answers (3)

Related Questions