yf879
yf879

Reputation: 168

Pandas + BeautifulSoup - How to add data to new column but same row under for loop

I'm Trying to scrape a website and under each profile there are different number of companies listed, for some profile its 5, for some 0, for some 50+, etc. All companies are in 'li' tags under a 'Div' tag. I can locate company fine. But I can't put it in required excel DataFrame.

I want to scrape name of companies, and have name of all companies in same row for each profile, each company should have its own column but same row. And new row for each profile. If I iterate over "companies" by for loop it creates both new row and new column for each company name in companies.

I can do this by locating each 'li' tag separately but that's not elegant solution as I don't know number of companies under each profile. Any help making this work via short code is appreciated. Thank you for reading!

My Code:

company_list = []
for profile in profiles:
    page = requests.get(profile, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    companies = soup.find("div", class_="spec-list attributes-insurance").find_all("li")

    for company in companies:
        item = {f'company_{companies.index(company)}': company.text.strip()}
        company_list.append(item)

pd.DataFrame(company_list).to_excel('company_list.xlsx')

What I get from this code: https://i.sstatic.net/C9q7J.png

What I want: https://i.sstatic.net/etgrH.png

Upvotes: 0

Views: 274

Answers (2)

Ferris
Ferris

Reputation: 5601

try to store the company list to a dict.

# save data to a dict
data_list = dict()
for tag, profile in enumerate(profiles):
    # every profile store a company_list
    company_list = []
    
    ########### same code ###########
    page = requests.get(profile, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    companies = soup.find("div", class_="spec-list attributes-insurance").find_all("li")

    for company in companies:
        item = {f'company_{companies.index(company)}': company.text.strip()}
        company_list.append(item)
    ########### same code ###########

    data_list[tag] = company_list   

output sample:

data_list = {
    0: ['1199SEIU', 'AMERIGROUP', 'Aetna', 'Anthem', 'Beacon'],
    1: ['Aetna', 'Anthem', 'Beacon'],
    2: ['Beacon', 'AMERIGROUP'], 
}

dfn = pd.Series(data_list).explode().to_frame().reset_index()
dfn.columns = ['idx', 'company']
dfn['tag'] = dfn['company']

# print(dfn)
df_output = dfn.set_index(['idx', 'company'])['tag'].unstack()
df_output.columns = [f'company_{i}' for (i, col) in  enumerate(df_output.columns)]

print(df_output)

    company_0   company_1 company_2 company_3 company_4
idx                                                    
0    1199SEIU  AMERIGROUP     Aetna    Anthem    Beacon
1         NaN         NaN     Aetna    Anthem    Beacon
2         NaN  AMERIGROUP       NaN       NaN    Beacon

Upvotes: 1

nicktids
nicktids

Reputation: 173

Think you might just need to loop through each column and drop the empty rows from that.

I used a csv but you can change that line to read_excel

df_original = pd.read_csv("test.csv",index_col="index")
print(df_original)
df_new = pd.DataFrame()

for col in df_original.columns:
    print(col)
    df_tmp =  df_original[col].dropna().reset_index(drop=True)
    print(df_tmp)
    df_new[col] = df_tmp

print(df_new)

Upvotes: 0

Related Questions