Simo
Simo

Reputation: 57

Scrapy not returning results in for specific tags

I just started Using Scrapy today, but i have a prior programming experience with javascript, so please, bear with me, i'll give a very detailed explanation:

Im using a gramReport to analyze some instagram profiles (Extract Number Of Followers,Number of posts and other data. ), since i have a list of different profiles i wanted to automate this task;

The final idea would be Like this :

1. Use Scrapy to crawl a specific profile ( so append 'profile' to 'gramreport.com/user/' )
2. Extract specific data and save it in a csv

I thought that python would do the job, started searching and found scrapy , the documentation was perfect for me. https://doc.scrapy.org/en/latest/intro/tutorial.html

I decided to give it a go just like the tutorial, i created a spider:

import scrapy
class QuotesSpider(scrapy.Spider):
name = "profile"
start_urls = [
    'http://gramreport.com/user/cats.gato'
]

def parse(self, response):
    page = response.url.split("/")[-1]
    filename = 'profile-%s.html' % page
    with open(filename, 'wb') as f:
        f.write(response.body)

so scrapy crawl profile Works Perfectly i cant get the html page. Next i try using the shell:

scrapy shell 'http://gramreport.com/user/cats.gato'

Great i can get some data via Xpath Or CSS:

//Followers:
response.xpath('/html/body/div[3]/table[1]/tr/td[2]/table/tr[1]/td/div/table/tr[2]/td/text()').extract()

//Posts:
response.xpath('/html/body/div[3]/table[1]/tr/td[2]/table/tr[3]/td/div/table/tr[2]/td/text()').extract()

//Page Name:
response.xpath('/html/body/div[3]/table[1]/tr/td[1]/div/div/div/span[2]/text()').extract()

//Average Likes:
response.xpath('/html/body/div[3]/div[1]/div/div/div[1]/div/text()').extract()

//Average Comments:
response.xpath('/html/body/div[3]/div[1]/div/div/div[2]/div/text()').extract()

Most of the results i get have the u' character and other regular expressions such as [u'\n\t\t\t252,124\t\t'] but i think there are already answered posts for that.

But, there are some data that i can't extract, i just get no results at all;

First of them is the Recent Interactions Table, this happens because of AJAX , but i just can't understand how to fix it; Maybe using a delay?

Second the Top Hashtags and Top User Mentions tables;

Their Xpaths don't work, nor does the css selector; I can't figure out why.

Upvotes: 0

Views: 319

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21406

There's an AJAX request being made when the page loads.

If you open up web inspector when loading the page you'll see an AJAX request like this:

enter image description here

If you ctrl+f some of the ids being used in this request in page source you'll see some javascript like:

enter image description here

You can find this url using scrapy and just forward the request:

def parse(self, response):

    script = response.xpath("//script[contains(text(), 'getresultsb']")
    url = script.re('url:"(.+?)"')  # capture between ""
    headers = {
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'X-Requested-With': 'XMLHttpRequest',
    }
    yield Request(url, 
        method='POST', 
        body='dmn=ok', 
        callback=self.parse_recent
        headers=headers,
    )

def parse_recent(self, response):
    # parse recent data here

Upvotes: 2

Related Questions