dev-x
dev-x

Reputation: 899

Remove symbol when crawling data using scrapy

I want to get text by scrapy from a website. This is sample code:

def parse(self, response):
        for kamusset in response.css("div#d1"):
            text = kamusset.css("div b::text").extract()
            print(dict(text=text))

This is the result: enter image description here

I want to remove the '.' symbol and every number symbol. So, I use regular expression. I change my code:

def parse(self, response):
        for kamusset in response.css("div#d1"):
            text = kamusset.css("div b::text").re(r'[a-z]+')
            print(dict(text=text))

But the result is:enter image description here

I don't expect the result like that. I want to get like this:

{'text': ['abadi', 'mengabadi', 'mengabadikan', 'pengabadian', 'keabadian']}. How to do that?

Upvotes: 0

Views: 238

Answers (1)

Tiny.D
Tiny.D

Reputation: 6556

You can parse from text you scraped with re:

import re
text = ['aba.di','meng.a.ba.di','megn.a.ba.di.kan','1','2','peng.a.ba.di.an','ke.a.ba.di.an','1','2']
stack = [re.sub('[^a-zA-Z]+', '', e) for e in text]
text_new = [i for i in stack if i!=""]
print(text_new)

text_new will be:

['abadi', 'mengabadi', 'megnabadikan', 'pengabadian', 'keabadian']

Upvotes: 1

Related Questions