Reputation: 1360
I want to get data located(name, city and address) in div
tag from a HTML file like this:
<div class="mainInfoWrapper">
<h4 itemprop="name">name</h4>
<div>
<a href="/Wiki/Province/Tehran"></a>
city
<a href="/Wiki/City/Tehran"></a>
Address
</div>
</div>
I don't know how can I get data that i want in that specific tag.
obviously I'm using python with beautifulsoup
library.
Upvotes: 0
Views: 7516
Reputation: 107347
You can do it with built-in lxml.html
module :
>>> s="""<div class="mainInfoWrapper">
... <h4 itemprop="name">name</h4>
... <div>
... <a href="/Wiki/Province/Tehran"></a>
... city
... <a href="/Wiki/City/Tehran"></a>
... Address
... </div>
... </div>"""
>>>
>>> import lxml.html
>>> document = lxml.html.document_fromstring(s)
>>> print document.text_content().split()
['name', 'city', 'Address']
And with BeautifulSoup
to get the text between your tags:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> print soup.text
And for get the text from a specific tag just use soup.find_all
:
soup = BeautifulSoup(your_HTML_source)
for line in soup.find_all('div',attrs={"class" : "mainInfoWrapper"}):
print line.text
Upvotes: 0
Reputation: 87134
There are several <h4>
tags in the source HTML, but only one <h4>
with the itemprop="name"
attribute, so you can search for that first. Then access the remaining values from there. Note that the following HTML is correctly reproduced from the source page, whereas the HTML in the question was not:
from bs4 import BeautifulSoup
html = '''<div class="mainInfoWrapper">
<h4 itemprop="name">
NAME
</h4>
<div>
<a href="/Wiki/Province/Tehran">PROVINCE</a> - <a href="/Wiki/City/Tehran">CITY</a> ADDRESS
</div>
</div>'''
soup = BeautifulSoup(html)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')
name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()
When run for the URL that you provided
import requests
from bs4 import BeautifulSoup
r = requests.get('http://goo.gl/sCXNp2')
soup = BeautifulSoup(r.content)
name_tag = soup.find('h4', itemprop='name')
addr_div = name_tag.find_next_sibling('div')
province_tag, city_tag = addr_div.find_all('a')
name, province, city = [t.text.strip() for t in name_tag, province_tag, city_tag]
address = city_tag.next_sibling.strip()
>>> print name
بیمارستان حضرت فاطمه (س)
>>> print province
تهران
>>> print city
تهران
>>> print address
یوسف آباد، خیابان بیست و یکم، جنب پارک شفق، بیمارستان ترمیمی پلاستیک فک و صورت
I'm not sure that the printed output is correct on my terminal, however, this code should produce the correct text for a properly configured terminal.
Upvotes: 2
Reputation: 6950
If h4 is used only once then you can do this -
name = soup.find('h4', attrs={'itemprop': 'name'})
print name.text
parentdiv = name.find_parent('div', class_='mainInfoWrapper')
cityaddressdiv = name.find_next_sibling('div')
Upvotes: -1