matt
matt

Reputation: 97

Beautiful Soup to Scrape Data from Static Webpages

I am trying to values from a table of multiple static webpages. It is the verb conjugation data for Korean verbs here: https://koreanverb.app/

My Python script uses Beautiful Soup. The goal is to grab all conjugations from multiple URL inputs and output the data to a CSV file.

Conjugations are stored on the page in table with class "table-responsive" and under the table rows with class "conjugation-row". There are multiple "conjugation-row" table rows on each page. My script is someone only grabbing the first table row with class "conjugation-row".

Why isn't the for loop grabbing all the td elements with class "conjugation-row"? I would appreciate a solution that grabs all tr with class "conjugation-row". I tried using job_elements = results.find("tr", class_="conjugation-row"), but I get the following error:

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Furthermore, when I do get the data and output to a CSV file, the data is in separate rows as expected, but leaves empty spaces., It places the data rows for the second URL at the index after all data rows for the first URL. See example output here:

enter image description here

See code here:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

# create csv file
outfile = open("scrape.csv","w",newline='')
writer = csv.writer(outfile)

## define first URL to grab conjugation names
url1 = 'https://koreanverb.app/?search=%ED%95%98%EB%8B%A4'

# define dataframe columns
df = pd.DataFrame(columns=['conjugation name'])

# get URL content
response = requests.get(url1)
soup = BeautifulSoup(response.content, 'html.parser')
    
# get table with all verb conjugations
results = soup.find("div", class_="table-responsive")


##### GET CONJUGATIONS AND APPEND TO CSV

# define URLs
urls = ['https://koreanverb.app/?search=%ED%95%98%EB%8B%A4', 
        'https://koreanverb.app/?search=%EB%A8%B9%EB%8B%A4',
        'https://koreanverb.app/?search=%EB%A7%88%EC%8B%9C%EB%8B%A4']

# loop to get data
for url in urls:
    response = requests.get(url)
    soup2 = BeautifulSoup(response.content, 'html.parser')
    
    # get table with all verb conjugations
    results2 = soup2.find("div", class_="table-responsive")
    
    # get dictionary form of verb/adjective
    verb_results = soup2.find('dl', class_='dl-horizontal')
    verb_title = verb_results.find('dd')
    verb_title_text = verb_title.text

    job_elements = results2.find_all("tr", class_="conjugation-row")
    for job_element in job_elements:
        conjugation_name = job_element.find("td", class_="conjugation-name")
        conjugation_korean = conjugation_name.find_next_sibling("td")
        conjugation_name_text = conjugation_name.text
        conjugation_korean_text = conjugation_korean.text
        data_column = pd.DataFrame({    'conjugation name': [conjugation_name_text],
                                        verb_title_text: [conjugation_korean_text],

        })
        #data_column = pd.DataFrame({verb_title_text: [conjugation_korean_text]})        
        df = df.append(data_column, ignore_index = True)
        
# save to csv
df.to_csv('scrape.csv')
outfile.close()
print('Verb Conjugations Collected and Appended to CSV, one per column')

Upvotes: 0

Views: 207

Answers (1)

kite
kite

Reputation: 541

Get all the job_elements using find_all() since find() only returns the first occurrence and iterate over them in a for loop like below.

job_elements = results.find_all("tr", class_="conjugation-row")
for job_element in job_elements:
    conjugation_name = job_element.find("td", class_="conjugation-name")
    conjugation_korean = conjugation_name.find_next_sibling("td")
    conjugation_name_text = conjugation_name.text
    conjugation_korean_text = conjugation_korean.text

    # append element to data
    df2 = pd.DataFrame([[conjugation_name_text,conjugation_korean_text]],columns=['conjugation_name','conjugation_korean'])
    df = df.append(df2)

The error is where you are trying to use find() on a variable of type list.

As your script is growing big, I made some modifications like using get_conjugations() function and some proper names that are easy to understand. Firstly, conjugation_names and conjugation_korean_names are added into pandas Dataframe columns and then other columns are added subsequently (korean0, korean1 ...).

import requests
from bs4 import BeautifulSoup
import pandas as pd

# function to parse the html data & get conjugations
def get_conjugations(url):
    #set return lists
    conjugation_names = []
    conjugation_korean_names = []
    #get html text
    html = requests.get(url).text
    #parse the html text
    soup = BeautifulSoup(html, 'html.parser')
    #get table
    table = soup.find("div", class_="table-responsive")
    table_rows = table.find_all("tr", class_="conjugation-row")
    for row in table_rows:
        conjugation_name = row.find("td", class_="conjugation-name")
        conjugation_korean = conjugation_name.find_next_sibling("td")
        conjugation_names.append(conjugation_name.text)
        conjugation_korean_names.append(conjugation_korean.text)
    #return both lists
    return conjugation_names, conjugation_korean_names

# create csv file
outfile = open("scrape.csv", "w", newline='')

urls = ['https://koreanverb.app/?search=%ED%95%98%EB%8B%A4',
        'https://koreanverb.app/?search=%EB%A8%B9%EB%8B%A4',
        'https://koreanverb.app/?search=%EB%A7%88%EC%8B%9C%EB%8B%A4']

# define dataframe columns
df = pd.DataFrame(columns=['conjugation_name', 'conjugation_korean', 'korean0', 'korean1'])

conjugation_names, conjugation_korean_names = get_conjugations(urls[0])
df['conjugation_name'] = conjugation_names
df['conjugation_korean'] = conjugation_korean_names

for index, url in enumerate(urls[1:]):
    conjugation_names, conjugation_korean_names = get_conjugations(url)
    #set column name
    column_name = 'korean' + str(index)
    df[column_name] = conjugation_korean_names

#save to csv
df.to_csv('scrape.csv')
outfile.close()

# Print DONE
print('Export to CSV Complete')

Output:

,conjugation_name,conjugation_korean,korean0,korean1
0,declarative present informal low,해,먹어,마셔
1,declarative present informal high,해요,먹어요,마셔요
2,declarative present formal low,한다,먹는다,마신다
3,declarative present formal high,합니다,먹습니다,마십니다
...

Note: This assumes that elements in different URLs are in same order.

Upvotes: 2

Related Questions