Reputation: 255
I'm working with Python 3.3 and this website: http://www.nasdaq.com/markets/ipos/
My goal is to read only the companies that are in the Upcoming IPO. It is in the div tag with div class="genTable thin floatL" There are two with this class, and the target data is in the first one.
Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
for divparent in soup.find_all('div', attrs={'class':'genTable thin floatL'}) [0]: # I tried putting a [0] so it will only return divs in the first genTable thin floatL class
for div in soup.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
I'd like it to return only
3/7/2014 - RECRO PHARMA, INC.
2/28/2014 - VARONIS SYSTEMS INC
2/27/2014 - LUMENIS LTD
2/21/2014 - SUNDANCE ENERGY AUSTRALIA LTD
2/21/2014 - SEMLER SCIENTIFIC, INC.
But it prints all div classes with the re.match specifications and multiple times as well. I tried inserting [0] on the for divparent loop to retrieve only the first one but this cause the repeating problem instead.
EDIT: Here is the updated code according to warunsl solution. This works.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.nasdaq.com/markets/ipos/").read()
soup = BeautifulSoup(html)
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
table= divparent.find('table')
for div in table.find_all('div', attrs={'class':'ipo-cell-height'}):
s = div.string
if re.match(r'\d{1,2}/\d{1,2}/\d{4}$', s):
div_next = div.find_next('div')
print('{} - {}'.format(s, div_next.string))
Upvotes: 1
Views: 11230
Reputation: 12092
You mentioned that there are two elements that fit the 'class':'genTable thin floatL'
criteria. So running a for loop for it's first element does not make sense.
So replace your outer for loop with
divparent = soup.find_all('div', attrs={'class':'genTable thin floatL'})[0]
Now you need not do a soup.find_all
again. Doing so will search the entire document. You need to restrict the search to the divparent
. So, you do:
table = divparent.find('table')
The remainder of the code to extract the dates and the company name would be the same, except that they will be with reference to the table
variable.
for row in table.find_all('tr'):
for data in row.find_all('td'):
print data.string
Hope it helps.
Upvotes: 2