Pandas + BeautifulSoup - How to add data to new column but same row under for loop

Question

I'm Trying to scrape a website and under each profile there are different number of companies listed, for some profile its 5, for some 0, for some 50+, etc. All companies are in 'li' tags under a 'Div' tag. I can locate company fine. But I can't put it in required excel DataFrame.

I want to scrape name of companies, and have name of all companies in same row for each profile, each company should have its own column but same row. And new row for each profile. If I iterate over "companies" by for loop it creates both new row and new column for each company name in companies.

I can do this by locating each 'li' tag separately but that's not elegant solution as I don't know number of companies under each profile. Any help making this work via short code is appreciated. Thank you for reading!

My Code:

company_list = []
for profile in profiles:
    page = requests.get(profile, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    companies = soup.find("div", class_="spec-list attributes-insurance").find_all("li")

    for company in companies:
        item = {f'company_{companies.index(company)}': company.text.strip()}
        company_list.append(item)

pd.DataFrame(company_list).to_excel('company_list.xlsx')

What I get from this code: https://i.sstatic.net/C9q7J.png

What I want: https://i.sstatic.net/etgrH.png

Ferris · Accepted Answer

try to store the company list to a dict.

# save data to a dict
data_list = dict()
for tag, profile in enumerate(profiles):
    # every profile store a company_list
    company_list = []
    
    ########### same code ###########
    page = requests.get(profile, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    companies = soup.find("div", class_="spec-list attributes-insurance").find_all("li")

    for company in companies:
        item = {f'company_{companies.index(company)}': company.text.strip()}
        company_list.append(item)
    ########### same code ###########

    data_list[tag] = company_list

output sample:

data_list = {
    0: ['1199SEIU', 'AMERIGROUP', 'Aetna', 'Anthem', 'Beacon'],
    1: ['Aetna', 'Anthem', 'Beacon'],
    2: ['Beacon', 'AMERIGROUP'], 
}

dfn = pd.Series(data_list).explode().to_frame().reset_index()
dfn.columns = ['idx', 'company']
dfn['tag'] = dfn['company']

# print(dfn)
df_output = dfn.set_index(['idx', 'company'])['tag'].unstack()
df_output.columns = [f'company_{i}' for (i, col) in  enumerate(df_output.columns)]

print(df_output)

    company_0   company_1 company_2 company_3 company_4
idx                                                    
0    1199SEIU  AMERIGROUP     Aetna    Anthem    Beacon
1         NaN         NaN     Aetna    Anthem    Beacon
2         NaN  AMERIGROUP       NaN       NaN    Beacon

Pandas + BeautifulSoup - How to add data to new column but same row under for loop

Answers (2)

Related Questions