gongarek
gongarek

Reputation: 1034

Scraping complex comments in Scrapy

I am using Scrapy. I want to scrape comments for example on page: https://www.thingiverse.com/thing:2/comments

I will scrape more sites, so I want have flexible code.

I have no idea how scrape comments without loosing informations about in which 'container' comment is, and comment's 'depth'.

Let's say that I will have 3 Fields. Id_container, content and depth. These informations will be enough to get know about relations between comments. How to code that every comment will have this informations?

The question is general, so any tips will be useful

Upvotes: 0

Views: 154

Answers (1)

Thiago Curvelo
Thiago Curvelo

Reputation: 3740

To avoid losing the hierarchy information, you could start by getting all depth 1 comments and getting deeper, e.g:

from collections import OrderedDict
from pprint import pprint

def get_children_hierarchy(selector, depth=1):
    hierarchy = OrderedDict()
    children = selector.css(f'.depth-{depth}').xpath('..')
    for child in children:
        key = child.xpath('./@id').get()
        hierarchy[key] = get_children_hierarchy(child, depth+1)
    return hierarchy or None

pprint(get_children_hierarchy(response))

Output:

OrderedDict([('comment-2217537', None),
             ('comment-1518847', None),
             ('comment-1507448', None),
             ('comment-1233476', None),
             ('comment-1109024',
              OrderedDict([('comment-1554022', None),
                           ('comment-1215964', None)])),
             ('comment-874441', None),
             ('comment-712565',
              OrderedDict([('comment-731427',
                            OrderedDict([('comment-809279',
                                          OrderedDict([('comment-819752',
                                                        OrderedDict([('comment-1696778',
                                                                      None)]))]))]))])),
             ('comment-472013', None),
             ('comment-472012', OrderedDict([('comment-858213', None)])),
             ('comment-403673', None)])

Then, with comment id, you can have all information you want for that particular comment.

Upvotes: 2

Related Questions