Reputation: 85
Here's the Thing
I want to crawl only these tags in the full of other messy html
<table bgcolor="FFFFFF" border="0" cellpadding="5" cellspacing="0" align="center">
<tr>
<td>
<a href="./index.html?id=subjective&page=2">
<img src='https://www.dogdrip.net/?module=file&act=procFileDownload&file_srl=224868098&sid=cc8c0afbb679bef6420500988a756054&module_srl=78' style='max-width:180px;max-height:270px' align='absmiddle' title="cutie cat">
</a>
</td>
</tr>
</table>
I tried for the first time with CSS selector selector was
#div_article_contents > tr:nth-child(1) > th:nth-child(1) > table > tbody > tr:nth-child(1) > td > table > tbody > tr > td > a > img
but soup.select('selector')
wasn't works. It output empty list.
I don't know why
Secondly I tried with tag every that I want to crawl have specific style so I tried:
soup.select('img[style = fixedstyle]')
but it wasn't works. It would be syntax error...
all I want to crawl is list of href links and list of img titles
please help me
Upvotes: 0
Views: 235
Reputation: 873
If the img
tag has a specific style value you can use what you tried just delete extra spaces:
from bs4 import BeautifulSoup
html='''
<a href='link'>
<img src='address' style='max-width:222px;max-height:222px' title='owntitle'>
</a>
<a href='link'>
<img src='address1' style='max-width:222px;max-height:222px' title='owntitle1'>
</a>
<a href='link'>
<img src='address2' style='max-width:222px;max-height:222px' title='owntitle2'>
</a>
'''
srcs=[]
titles=[]
soup=BeautifulSoup(html,'html.parser')
for img in soup.select('img["style=max-width:222px;max-height:222px"]'):
srcs.append(img['src'])
titles.append(img['title'])
print(srcs)
print(titles)
Other wise you can start with the a
tag and get down to the img
like this:
for a in soup.select('a'):
srcs.append(a.select_one('img')['src'])
titles.append(a.select_one('img')['title'])
print(srcs)
print(titles)
Upvotes: 1