Reputation: 1495
I am stuck after looking through many other questions. My code currently is breaking the data into named rows, but is returning the entire line instead of just the text included, I am just looking for ASCO VALVE MFG., INC. from the following line: I am not sure how to pull out just that text from the row.
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">****ASCO VALVE MFG., INC.****</font></td>
My input looks like: Headers:
<tr>
<td align="center" id="ColHead_0"><font size="3" face="Arial,Helvetica,sans-serif"><b>WH</b></font></td>
<td align="center" id="ColHead_1"><font size="3" face="Arial,Helvetica,sans-serif"><b>OrderNo.</b></font></td>
<td align="center" id="ColHead_2"><font size="3" face="Arial,Helvetica,sans-serif"><b>Cust.</b></font></td>
<td align="left" id="ColHead_3"><font size="3" face="Arial,Helvetica,sans-serif"><b>Customer Name</b></font></td>
<td align="center" id="ColHead_4"><font size="3" face="Arial,Helvetica,sans-serif"><b>Item Number</b></font></td>
<td align="center" id="ColHead_5"><font size="3" face="Arial,Helvetica,sans-serif"><b>Item Description 1</b></font></td>
<td align="center" id="ColHead_6"><font size="3" face="Arial,Helvetica,sans-serif"><b>Item Description 2</b></font></td>
<td align="center" id="ColHead_7"><font size="3" face="Arial,Helvetica,sans-serif"><b>Qty</b></font></td>
<td align="center" id="ColHead_8"><font size="3" face="Arial,Helvetica,sans-serif"><b>S/N </b></font></td>
</tr>
Data rows are as below:
<tr>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">09</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">92427</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">20668</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">ASCO VALVE MFG., INC.</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">EQPRAN77333</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">RANPAK FILLPAK TT</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">S/N 50742543</font></td>
<td nowrap="nowrap" align="right"><font size="3" face="Arial,Helvetica,sans-serif">1</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">50742543</font></td>
</tr>
My code currently is breaking the data into named rows, but is returning the whole html line.
soup1 = BeautifulSoup(output, "html.parser")
find_string = soup1.body.find_all(text="-")
Customer_No = []
Serial_No = []
rows = soup1.find_all("tr")
title = rows[0]
headers = rows[1]
datarows = rows[2:]
for row in datarows :
if len(row)> 7:
WHID = row.contents[1]
ORNO = row.contents[3]
CSNO = row.contents[5]
CSNM = row.contents[7]
ITNO = row.contents[9]
DESC = row.contents[11]
DESC2 = row.contents[13]
QTY = row.contents[15]
SN = row.contents[17]
print ITNO
else:
continue
What I am trying to end up with is a dictionary I guess of [text in CSNO] and [text in SN] pairs to match with a 2nd CSV file. I hope that all makes sense.
Upvotes: 1
Views: 2039
Reputation: 11961
You can extract the text for each element using the .text
attribute. Something along the following lines should help you get the idea:
from bs4 import BeautifulSoup
content = '''
<tr>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">09</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">92427</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">20668</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">ASCO VALVE MFG., INC.</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">EQPRAN77333</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">RANPAK FILLPAK TT</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">S/N 50742543</font></td>
<td nowrap="nowrap" align="right"><font size="3" face="Arial,Helvetica,sans-serif">1</font></td>
<td nowrap="nowrap" align="left"><font size="3" face="Arial,Helvetica,sans-serif">50742543</font></td>
</tr>'''
soup = BeautifulSoup(content, 'html')
rows = soup.find_all('tr')
for row in rows:
td_cells = soup.find_all('td')
for td_cell in td_cells:
print td_cell.text
Output
09
92427
20668
ASCO VALVE MFG., INC.
EQPRAN77333
RANPAK FILLPAK TT
S/N 50742543
1
50742543
To store the text, you could do the following:
soup = BeautifulSoup(content, 'html')
rows = soup.find_all('tr')
table_text = []
for row in rows:
row_text = []
td_cells = soup.find_all('td')
for td_cell in td_cells:
row_text.append(td_cell.text)
table_text.append(row_text)
Upvotes: 3