Reputation: 155
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
#put all item in this array
response = requests.get('http://smartcatalog.emo-milano.com/it/espositore/a-mannesmann-maschinenfabrik-gmbh')
soup = BeautifulSoup(response.content, 'html.parser')
table=soup.find_all('table', class_='expo-table general-color')
for row in table:
for up in row.find_all('td'):
text_list = [text for text in up.stripped_strings]
print(text_list)
These code is working good and they will get me the correct output but they will not give output in these format as you seen below I want output in these format can you help me
Indirizzo Bliedinghauserstrasse 27
Città Remscheid
Nazionalità Germania
Sito web www.amannesmann.de
Stand Pad. 3 E14 F11
Telefono +492191989-0
Fax +492191989-201
E-mail [email protected]
Membro di Cecimo
Social
Upvotes: 0
Views: 108
Reputation: 4779
Instead of selecting <td>
, select <tr>
and use .stripped_strings
on it to get the row wise data and then append them to the Dataframe.
Here is the code
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
#put all item in this array
temp = []
response = requests.get('http://smartcatalog.emo-milano.com/it/espositore/a-mannesmann-maschinenfabrik-gmbh')
soup = BeautifulSoup(response.content, 'html.parser')
table=soup.find_all('table', class_='expo-table general-color')
for row in table:
for up in row.find_all('tr'):
temp.append([text for text in up.stripped_strings])
df = pd.DataFrame(temp)
print(df)
0 1
0 Indirizzo Bliedinghauserstrasse 27
1 Città Remscheid
2 Nazionalità Germania
3 Sito web www.amannesmann.de
4 Stand Pad. 3 E14 F11
5 Telefono +492191989-0
6 Fax +492191989-201
7 E-mail [email protected]
8 Membro di None
9 Social None
Upvotes: 0
Reputation: 9619
pandas
has a builtin html table scraper, so you can run:
df = pd.read_html('http://smartcatalog.emo-milano.com/it/espositore/a-mannesmann-maschinenfabrik-gmbh')
This returns a list of all tables on the page as dataframes, you can access your data with df[0]
:
0 | 1 | |
---|---|---|
0 | Indirizzo | Bliedinghauserstrasse 27 |
1 | Città | Remscheid |
2 | Nazionalità | Germania |
3 | Sito web | www.amannesmann.de |
4 | Stand | Pad. 3 E14 F11 |
5 | Telefono | +492191989-0 |
6 | Fax | +492191989-201 |
7 | [email protected] | |
8 | Membro di | nan |
9 | Social | nan |
Upvotes: 1
Reputation: 3400
You can use .get_text()
method to extract text and use parameters to avoid whitespaces and give extra space using separator
data=table.find_all("tr")
for i in data:
print(i.get_text(strip=True,separator=" "))
Output:
Indirizzo Bliedinghauserstrasse 27
Città Remscheid
...
Upvotes: 0