Madhur
Madhur

Reputation: 857

css selector using ID not working in scrapy

I am scraping 'common wealth games' medal counts from this page : https://en.wikipedia.org/wiki/1930_British_Empire_Games

Once the data is scraped I want to move to next page. To do so I want to select a <table> tag which has an attribute ID '#collapsibleTable1'.

Now here comes the interesting part. When I do $('#collapsibleTable1') on chrome console, I get the desired output.

However, when I try to do response.css('#collapsibleTable1') in scrapy shell, it gives an empty list.

It would be of great help if somebody could explain why it's behaving this way.

Upvotes: 0

Views: 825

Answers (2)

csamleong
csamleong

Reputation: 859

I had the same problem, just started on web crawling, and found out I couldn't scrape certain contents from a website. As stranac put it, some contents are rendered by the javascript dynamically, we need to go to data source for solution.

Adding my answer, as some people like me didn't how to start and might need some directions, please see the official documents in scrapy on how to get the data from the data source, there are multiple ways to handle it based on your situation.

  • If the data is defined in Javascript code - use wgrep to find the URL of the data source
  • If the data is coming from the original URL - need to inspect source code and see where do they being passed in
  • If the data is hardcoded in Javascript - we need to parse the Javascript and get the data from there

My understanding from the above is, there are 2 ways to deal with this problem:

  • Use scrapy-splash so that you are able to retrieve the HTML of the DOM of the webpage, then your css selector will work
  • Use headless browser which is selenium, which is very popular for dynamic website, basically the program is getting what you are seeing from the browser.

More details are covered in the official doc. Hope the reference helps.

Upvotes: 0

stranac
stranac

Reputation: 28256

It looks like there is some javascript manipulation happening, as that id isn't contained in the actual HTML source (which you can see if you print(response.text))

Chrome's dev tools will show the current state of the DOM after all the javascript has been executed, which is not what scrapy sees.

Looking at the source, the data you want is shown as:

<table class="nowraplinks collapsible autocollapse navbox-inner" style="border-spacing:0;background:transparent;color:inherit">
<tr>
<th scope="col" class="navbox-title" colspan="2">
<div class="plainlinks hlist navbar mini">
<ul>
<li class="nv-view"><a href="/wiki/Template:Commonwealth_Games_Medal_Counts" title="Template:Commonwealth Games Medal Counts"><abbr title="View this template" style=";;background:none transparent;border:none;-moz-box-shadow:none;-webkit-box-shadow:none;box-shadow:none;">v</abbr></a></li>
<li class="nv-talk"><a href="/wiki/Template_talk:Commonwealth_Games_Medal_Counts" title="Template talk:Commonwealth Games Medal Counts"><abbr title="Discuss this template" style=";;background:none transparent;border:none;-moz-box-shadow:none;-webkit-box-shadow:none;box-shadow:none;">t</abbr></a></li>
<li class="nv-edit"><a class="external text" href="//en.wikipedia.org/w/index.php?title=Template:Commonwealth_Games_Medal_Counts&amp;action=edit"><abbr title="Edit this template" style=";;background:none transparent;border:none;-moz-box-shadow:none;-webkit-box-shadow:none;box-shadow:none;">e</abbr></a></li>
</ul>
</div>
<div id="Commonwealth_Games_medal_tables" style="font-size:114%;margin:0 4em"><a href="/wiki/All-time_Commonwealth_Games_medal_table" title="All-time Commonwealth Games medal table">Commonwealth Games medal tables</a></div>
</th>
</tr>
<tr>
<td colspan="2" class="navbox-list navbox-odd hlist" style="width:100%;padding:0px">
<div style="padding:0em 0.25em">
<ul>
<li><a href="/wiki/1930_British_Empire_Games#Medal_table" title="1930 British Empire Games">1930</a></li>
<li><a href="/wiki/1934_British_Empire_Games#Medals_by_country" title="1934 British Empire Games">1934</a></li>
<li><a href="/wiki/1938_British_Empire_Games#Medals_by_country" title="1938 British Empire Games">1938</a></li>
<li><a href="/wiki/1950_British_Empire_Games#Medals_by_country" title="1950 British Empire Games">1950</a></li>
<li><a href="/wiki/1954_British_Empire_and_Commonwealth_Games#Medal_table" title="1954 British Empire and Commonwealth Games">1954</a></li>
<li><a href="/wiki/1958_British_Empire_and_Commonwealth_Games#Medals_by_country" title="1958 British Empire and Commonwealth Games">1958</a></li>
<li><a href="/wiki/1962_British_Empire_and_Commonwealth_Games#Medals_by_country" title="1962 British Empire and Commonwealth Games">1962</a></li>
<li><a href="/wiki/1966_British_Empire_and_Commonwealth_Games#Medals_by_country" title="1966 British Empire and Commonwealth Games">1966</a></li>
<li><a href="/wiki/1970_British_Commonwealth_Games#Medals_by_country" title="1970 British Commonwealth Games">1970</a></li>
<li><a href="/wiki/1974_British_Commonwealth_Games#Medals_by_country" title="1974 British Commonwealth Games">1974</a></li>
<li><a href="/wiki/1978_Commonwealth_Games#Medals_by_country" title="1978 Commonwealth Games">1978</a></li>
<li><a href="/wiki/1982_Commonwealth_Games#Medals_by_country" title="1982 Commonwealth Games">1982</a></li>
<li><a href="/wiki/1986_Commonwealth_Games#Medals_by_country" title="1986 Commonwealth Games">1986</a></li>
<li><a href="/wiki/1990_Commonwealth_Games#Medals_by_country" title="1990 Commonwealth Games">1990</a></li>
<li><a href="/wiki/1994_Commonwealth_Games#Medal_table" title="1994 Commonwealth Games">1994</a></li>
<li><a href="/wiki/1998_Commonwealth_Games#Medal_table" title="1998 Commonwealth Games">1998</a></li>
<li><a href="/wiki/2002_Commonwealth_Games#Final_medal_table" title="2002 Commonwealth Games">2002</a></li>
<li><a href="/wiki/2006_Commonwealth_Games_medal_table" title="2006 Commonwealth Games medal table">2006</a></li>
<li><a href="/wiki/2010_Commonwealth_Games_medal_table" title="2010 Commonwealth Games medal table">2010</a></li>
<li><a href="/wiki/2014_Commonwealth_Games_medal_table" title="2014 Commonwealth Games medal table">2014</a></li>
<li><a href="/wiki/2018_Commonwealth_Games_medal_table" title="2018 Commonwealth Games medal table">2018</a></li>
</ul>
</div>
</td>
</tr>
</table>

Upvotes: 1

Related Questions