user2859603
user2859603

Reputation: 255

Python webscraping and getting contents of first div tag of its class

I'm working with Python 3.3 and this website: http://www.nasdaq.com/markets/ipos/

My goal is to read only the companies that are in the Upcoming IPO. It is in the div tag with div class="genTable thin floatL" There are two with this class, and the target data is in the first one.

Here is my code

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class
    for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
        s = div.string
        if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
            div_next = div.find_next('div')
            print('{} - {}'.format(s, div_next.string))

I'd like it to return only

3/7/2014 - RECRO PHARMA, INC.
2/28/2014 - VARONIS SYSTEMS INC
2/27/2014 - LUMENIS LTD
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/21/2014 - SEMLER SCIENTIFIC, INC.

But it prints all div classes with the re.match specifications and multiple times as well. I tried inserting [0] on the for divparent loop to retrieve only the first one but this cause the repeating problem instead.

EDIT: Here is the updated code according to warunsl solution. This works.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)

divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
table= divparent.find('table')
for div in table.find_all('div', attrs={'class':'ipo-cell-height'}):
        s = div.string
        if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
            div_next = div.find_next('div')
            print('{} - {}'.format(s, div_next.string))

Upvotes: 1

Views: 11230

Answers (1)

shaktimaan
shaktimaan

Reputation: 12092

You mentioned that there are two elements that fit the 'class':'genTable thin floatL' criteria. So running a for loop for it's first element does not make sense.

So replace your outer for loop with

divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]

Now you need not do a soup.find_all again. Doing so will search the entire document. You need to restrict the search to the divparent. So, you do:

table = divparent.find('table')

The remainder of the code to extract the dates and the company name would be the same, except that they will be with reference to the table variable.

for row in table.find_all('tr'):
    for data in row.find_all('td'):
        print data.string

Hope it helps.

Upvotes: 2

Related Questions