Marie10
Marie10

Reputation: 161

Map h4 to div siblings in dataframe Beautifulsoup python

I'm scraping a webpage but having trouble mapping the information into a dataframe. There are no tables in the HTML. Here is an example of the HTML:

html= [
<h2>Event Title<h2>
<div class="row">
    <h4>Category 1<h4>
    <div>A<div>
    <h4>Category 2<h4>
    <div>B<div>
    <h4>Category 3<h4>
    <div>C<div>
    <h4>Category 4<h4>
    <div>D<div>
]

Here is my code using requests and Beautifulsoup in python:

data = []
event = soup.find('h2')
for i in soup.find_all('div', {'class': 'row'}):
    categories = [x.text for x in i.findAll('h4')]
    info = [x.text for x in i.findAll('div')]

    datum = {'event': event.get_text().replace('\n', '').replace('\r', ''), 
             'categories ': categories , 
             'info ': info }

    data.append(datum)

df = pd.DataFrame(data)
df

The dataframe ends up looking like with one event title and two lists:

index - event - categories - info
1 - Event Title - ['Category 1','Category 2','Category 3','Category 4'] - ["Category 1 \n A\n Category 2\n B\n Category 3\n C\n Category 4\n D\n"]

I would like it to map somehow to end up so that h4 Category 1 is related to div A.

index - event - categories - info
1 - Event Title - Category 1 - A
2 - Event Title - Category 2 - B
3 - Event Title - Category 3 - C
4 - Event Title - Category 4 - D

Since h4 and div are siblings and not parent-child , it is possible to separate this in my web scrape code? I have multiple pages with different event titles and the data is too large to do it by hand.

I have also tried, among others:

data = []

event = soup.find('h2').get_text()

for i in soup.find_all('div', {'class': 'row'}):
    categories = [x.text for x in soup.findAll('h4')]
    cats = soup.find_all('h4')
    cat = cats[3]
    info = cat.findNextSiblings('div')

    datum = {'event': event, 'categories ': categories , 'info': info} 
    data.append(datum)

    df1 = pd.DataFrame(data)
df1

The result of this one gives me a df of:

index - event - categories - info
1 - Event Title - ['Category 1','Category 2','Category 3','Category 4'] - [<div>A<div>, <div>B<div>, <div>C<div>, <div>D<div>]

Here is the weblink to inspect the elements: https://www.ibjjfdb.com/ChampionshipResults/926/PublicResults

Any ideas would be helpful. Thank you!

Upvotes: 0

Views: 396

Answers (1)

Stef
Stef

Reputation: 30579

Type, category and info are all at the same level in your linked example, so you'll have to iterate through them and update type and category as soon as a new type or category is encountered (please note - I had to introduce a new column type for the result type).

Regarding the pandas dataframe: it's much better in terms of performance and also easier to read in the code if you first collect all data in a list and only then at the end make a dataframe from this list.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

data = []
r = requests.get("https://www.ibjjfdb.com/ChampionshipResults/926/PublicResults")
soup = BeautifulSoup(r.content)

event = soup.find('h2').get_text(strip=True)
for i in soup.find_all('div', {'class': 'col-xs-12'}):
    for s in i.find_all(['h3','h4','div'],recursive=False):
        if s.name == 'h3':
            typ = re.sub('\s+', ' ', s.get_text(strip=True))
        elif s.name == 'h4':
            cat = re.sub('\s+', ' ', s.get_text(strip=True))
        elif s.name == 'div':
            divs = s.find_all('div')
            if len(divs) > 0:
                for di in divs:
                    info = re.sub('\s+', ' ', di.get_text(strip=True))
            else:
                info = re.sub('\s+', ' ', s.get_text(strip=True))
            data.append((event,typ,cat,info))

df = pd.DataFrame(data, columns=['Event','Type','Category','Info'])

This yields a dataframe with 452 rows and 4 columns, sample output of df.iloc[0]:

Event       World Jiu-Jitsu IBJJF Championship 2018
Type                           Results of Academies
Category                                 Adult Male
Info                    10 - Ribeiro Jiu-Jitsu - 15

Upvotes: 1

Related Questions