user3175993
user3175993

Reputation: 19

Web scraping with nested frames and javascript

I want to get answers from a online chatbot. http://talkingbox.dyndns.org:49495/braintalk? (the ? belongs to the link)

To send a question you just have to send a simple request:

http://talkingbox.dyndns.org:49495/in?id=3B9054BC032E53EF691A9A1803040F1C&msg=[Here the question]

Source looks like this:

<frameset cols="*,185" frameborder="no" border="0" framespacing="0">
<frameset rows="100,*,82" frameborder="no" border="0" framespacing="0">
    <frame src="http://thebot.de/bt_banner.html" marginwidth="0" name="frtop" scrolling="no" marginheight="0" frameborder="no">
    <frame src="out?id=3B9054BC032E53EF691A9A1803040F1C" name="frout" marginwidth="0" marginheight="0">
    <frameset rows="100%,*" border="0" framespacing="0" frameborder="no">
        <frame src="bt_in?id=3B9054BC032E53EF691A9A1803040F1C" name="frin" scrolling="no" marginwidth="0" marginheight="0" noresize>
        <frame src="" name="frempty" marginwidth="0" marginheight="0" scrolling="auto" frameborder="no" >
    </frameset>
</frameset>
<frameset frameborder="no" border="0" framespacing="0" rows="82,*">
    <frame src="stats?" name="fr1" scrolling="no" marginwidth="0" marginheight="0" frameborder="no">
    <frame src="http://thebot.de/bt_rechts.html" name="fr2" scrolling="auto" marginwidth="0" marginheight="0" frameborder="no" >
</frameset>
</frameset>

I was using "mechanize" and beautifulsoup for web scraping but I suppose mechanize does not support dynamic webpages.

How can I get the answers in this case?

I am also looking for a solution which work good on Windows and Linux.

Upvotes: 0

Views: 2301

Answers (2)

Guy Gavriely
Guy Gavriely

Reputation: 11396

be it BeautifulSoup, mechanize, Requests or even Scrapy, loading that dynamic pages will have to be done by another step written by you.

for example, using scrapy this may look something like:

class TheBotSpider(BaseSpider):
    name = 'thebot'
    allowed_domains = ['thebot.de', 'talkingbox.dyndns.org']

    def __init__(self, *a, **kw):
        super(TheBotSpider, self).__init__(*a, **kw)
        self.domain = 'http://talkingbox.dyndns.org:49495/'
        self.start_urls = [self.domain + 
                           'in?id=3B9054BC032E53EF691A9A1803040F1C&msg=' + 
                           self.question]

    def parse(self, response):
        sel = Selector(response)
        url = sel.xpath('//frame[@name="frout"]/@src').extract()[0]
        yield Request(url=url, callback=dynamic_page)

    def dynamic_page(self, response):
        .... xpath to scrape answer

run it with a question as argument:

scrapy crawl thebot -a question=[Here the question]

for more details on how to use scrapy see scrapy tutorial

Upvotes: 1

laike9m
laike9m

Reputation: 19388

I would use Requests for task like this.

import requests

r = requests.get("http://talkingbox.dyndns.org:49495/in?id=3B9054BC032E53EF691A9A1803040F1C&msg=" + your_question)

For webpages that do not contain dynamic content, r.text is what you want.

Since you didn't provide more information about dynamic webpage, there is not much more to say.

Upvotes: 0

Related Questions