edyvedy13
edyvedy13

Reputation: 2296

Scraping data from html table, selecting elements between titles

I am trying to scrape information from the following url:http://www.mobygames.com/game/xbox360/wheelman/credits with this code;

# Imports
import requests
from bs4 import BeautifulSoup
credit_link = "http://www.mobygames.com/game/xbox360/wheelman/credits"
response = requests.get(credit_link)
soup = BeautifulSoup(response.text, "lxml")
credit_infor= soup.find("div", class_="col-md-8 col-lg-8")
credit_infor1 = credit_infor.select('table[summary="List of Credits"]')[0].find_all('tr')

This is the format that I need to get:

info          credit_to  studio                   game       console
starring      138920     starring                 Wheelman   Xbox 360
Studio Heads  151851     Midway Newcastle Studio  Wheelman   Xbox 360
Studio Heads  73709      Midway Newcastle Studio  Wheelman   Xbox 360

Where info corresponds to first "td" in each row, credit_to corresponds to id of particular contributor (e.g. 138920 is id of Vin Diesel) starring corresponds to titles. I think I can handle everything except getting studio name (i.e. titles) near each row (it will be switched from Midway Newcastle Studio to San Diego QA Team later and so on). How could I do it?

Upvotes: 1

Views: 65

Answers (1)

Keyur Potdar
Keyur Potdar

Reputation: 7238

According to your program, credit_infor1 will have a list of all tr tags (rows). If you check the HTML, the rows that have the title (studio) in them, they don't have a class attribute. For all the other rows, they have class="crln" attribute.

So, you can iterate over all the rows and check if the current row has class as an attribute using the has_attr() function (which is somewhat hidden in the docs). If the attribute is not present, change the title, else continue with the scraping of other data.

Continuing your program:

studio = ''
for row in credit_infor1:
    if not row.has_attr('class'):
        studio = row.h2.text
        continue

    # get other values that you want from this row below

    info = row.find('td').text
    # similarly get all the other values you need each time

    print(info + ' | ' + studio)

Partial output:

Starring | Starring
Studio Heads | Midway Newcastle Studio
Executive Producers | Midway Newcastle Studio
Technical Directors | Midway Newcastle Studio
Lead Programmers | Midway Newcastle Studio
...
QA Manager | San Diego QA Team
Compliance QA Manager | San Diego QA Team
QA Data Analyst | San Diego QA Team
...
SQA Analyst | SQS India QA
QA Team | SQS India QA
Executive Producers | Tigon Studios
Head of Game Production | Tigon Studios
...

Upvotes: 1

Related Questions