Reputation: 168
I'm Trying to scrape a website and under each profile there are different number of companies listed, for some profile its 5, for some 0, for some 50+, etc. All companies are in 'li' tags under a 'Div' tag. I can locate company fine. But I can't put it in required excel DataFrame.
I want to scrape name of companies, and have name of all companies in same row for each profile, each company should have its own column but same row. And new row for each profile. If I iterate over "companies" by for loop it creates both new row and new column for each company name in companies.
I can do this by locating each 'li' tag separately but that's not elegant solution as I don't know number of companies under each profile. Any help making this work via short code is appreciated. Thank you for reading!
My Code:
company_list = []
for profile in profiles:
page = requests.get(profile, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
companies = soup.find("div", class_="spec-list attributes-insurance").find_all("li")
for company in companies:
item = {f'company_{companies.index(company)}': company.text.strip()}
company_list.append(item)
pd.DataFrame(company_list).to_excel('company_list.xlsx')
What I get from this code: https://i.sstatic.net/C9q7J.png
What I want: https://i.sstatic.net/etgrH.png
Upvotes: 0
Views: 274
Reputation: 5601
try to store the company list to a dict.
# save data to a dict
data_list = dict()
for tag, profile in enumerate(profiles):
# every profile store a company_list
company_list = []
########### same code ###########
page = requests.get(profile, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
companies = soup.find("div", class_="spec-list attributes-insurance").find_all("li")
for company in companies:
item = {f'company_{companies.index(company)}': company.text.strip()}
company_list.append(item)
########### same code ###########
data_list[tag] = company_list
output sample:
data_list = {
0: ['1199SEIU', 'AMERIGROUP', 'Aetna', 'Anthem', 'Beacon'],
1: ['Aetna', 'Anthem', 'Beacon'],
2: ['Beacon', 'AMERIGROUP'],
}
dfn = pd.Series(data_list).explode().to_frame().reset_index()
dfn.columns = ['idx', 'company']
dfn['tag'] = dfn['company']
# print(dfn)
df_output = dfn.set_index(['idx', 'company'])['tag'].unstack()
df_output.columns = [f'company_{i}' for (i, col) in enumerate(df_output.columns)]
print(df_output)
company_0 company_1 company_2 company_3 company_4
idx
0 1199SEIU AMERIGROUP Aetna Anthem Beacon
1 NaN NaN Aetna Anthem Beacon
2 NaN AMERIGROUP NaN NaN Beacon
Upvotes: 1
Reputation: 173
Think you might just need to loop through each column and drop the empty rows from that.
I used a csv but you can change that line to read_excel
df_original = pd.read_csv("test.csv",index_col="index")
print(df_original)
df_new = pd.DataFrame()
for col in df_original.columns:
print(col)
df_tmp = df_original[col].dropna().reset_index(drop=True)
print(df_tmp)
df_new[col] = df_tmp
print(df_new)
Upvotes: 0