zalexhp
zalexhp

Reputation: 201

Python Web Scraping: Output to csv

I'm doing some progress with web scraping however I still need some help to perform some operations:

import requests
import pandas as pd
from bs4 import BeautifulSoup




url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'

# soup = BeautifulSoup(requests.get(converturl).content, 'html.parser')

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

out = []

for tr in soup.select('.col-md-4 tbody tr'):

On the class col-md-4 I know there are 3 tables I want to generate a csv which as an output has three values: first name, last name, and for the last value I want the header name of the table.

first name, last name, header table

Any help would be appreciated.

Upvotes: 0

Views: 109

Answers (3)

twhitcomb
twhitcomb

Reputation: 63

You need to first iterate through each table you want to scrape, then for each table, get its header and rows of data. For each row of data, you want to parse out the First Name and Last Name (along with the header of the table).

Here's a verbose working example:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

out = []

# Iterate through each of the three tables
for table in soup.select(".col-md-4 table"):

    # Grab the header and rows from the table
    header = table.select("thead th")[0].text.strip()
    rows = [s.text.strip() for s in table.select("tbody tr")]

    t = []  # This list will contain the rows of data for this table

    # Iterate through rows in this table
    for row in rows:

        # Split by comma (last_name, first_name)
        split = row.split(",")

        last_name = split[0].strip()
        first_name = split[1].strip()

        # Create the row of data
        t.append([first_name, last_name, header])

    # Convert list of rows to a DataFrame
    df = pd.DataFrame(t, columns=["first_name", "last_name", "table_name"])

    # Append to list of DataFrames
    out.append(df)

# Write to CSVs...
out[0].to_csv("first_table.csv", index=None)  # etc...

Whenever you're web scraping, I highly recommend using strip() on all of the text you parse to make sure you don't have superfluous spaces in your data.

I hope this helps!

Upvotes: 1

Milan Cermak
Milan Cermak

Reputation: 8074

This might work:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
tables = soup.select('.col-md-4 table')
rows = []

for table in tables:
    cleaned = list(table.stripped_strings)
    header, names = cleaned[0], cleaned[1:]
    data = [name.split(', ') + [header] for name in names]
    rows.extend(data)

result = pd.DataFrame.from_records(rows, columns=['surname', 'name', 'table'])

Upvotes: 1

zalexhp
zalexhp

Reputation: 201

This is what I have done on my own:

import requests
import pandas as pd
from bs4 import BeautifulSoup





url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'


soup = BeautifulSoup(requests.get(url).content, 'html.parser')

filename = url.rsplit('/', 1)[1] + '.csv'


tables = soup.select('.col-md-4 table')
rows = []

for tr in tables:
    t = tr.get_text(strip=True, separator='|').split('|')
    rows.append(t)
    df = pd.DataFrame(rows)
    print(df)
    df.to_csv(filename)

Thanks,

Upvotes: 1

Related Questions