Reputation: 435
I have the following tables:
<table width="100%" border="0" cellspacing="2" cellpadding="0">
<tbody><tr>
<td class="labelplain">ANGARA, EDGARDO J.<br>ENRILE, JUAN PONCE<br>MAGSAYSAY JR., RAMON B.<br>ROXAS, MAR<br>GORDON, RICHARD "DICK" J.<br>FLAVIER, JUAN M.<br>MADRIGAL, M. A.<br>ARROYO, JOKER P.<br>RECTO, RALPH G.<br></td>
</tr>
</tbody></table>
I can traverse towards this part of the HTML using the following code below:
soup.find('td', text = re.compile('Co-author\(s'), attrs={'class': 'labelplain'}).find_next('td')
coauthor = soup.find('td', text = re.compile('Co-author\(s'), attrs={'class': 'labelplain'}).find_next('td')
I am able to get the text using the one below:
for br in coauthor.find_all('br'):
firstcoauthor = (br.previousSibling)
print (firstcoauthor)
The output I want to arrive to is a result of all text then separated by a semicolon (;) like so: ANGARA, EDGARDO J.;ENRILE, JUAN PONCE;MAGSAYSAY JR., RAMON B.;ROXAS, MAR;GORDON, RICHARD "DICK" J.;FLAVIER, JUAN M.;MADRIGAL, M. A.;ARROYO, JOKER P.;RECTO, RALPH G.
But the code above gives me a result like below:
ANGARA, EDGARDO J.
ENRILE, JUAN PONCE
MAGSAYSAY JR., RAMON B.
ROXAS, MAR
GORDON, RICHARD "DICK" J.
FLAVIER, JUAN M.
MADRIGAL, M. A.
ARROYO, JOKER P.
RECTO, RALPH G.
I've tried the replace function but to no avail.
print (firstcoauthor.replace("\n", ";"))
and
print (firstcoauthor.replace("\r\n", ";"))
Even escape \r\n and \n like so:
print (firstcoauthor.replace("\\n", ";"))
How do I address my use case?
Upvotes: 1
Views: 1403
Reputation: 25196
Think it is much more simpler to get that result with setting join/delimiter parameter to get_text()
:
soup.find('td').get_text(';')
Based on your example you will get:
ANGARA, EDGARDO J.;ENRILE, JUAN PONCE;MAGSAYSAY JR., RAMON B.;ROXAS, MAR;GORDON, RICHARD "DICK" J.;FLAVIER, JUAN M.;MADRIGAL, M. A.;ARROYO, JOKER P.;RECTO, RALPH G.
Based on the behaviour, extra semicolons, mentioned in your comment, I suspect that the structure of the element is different from the one in the question and has extra breaks.
In that case, I would change the strategy and recommend to:
add additional strip
parameter to get_text()
:
soup.find('td').get_text(';', strip=True)
or use a join()
from stripped_strings
, what is doing almost the same:
';'.join(soup.find('td').stripped_strings)
Added additional <br>
, spaces and linebreaks to the HTML.
html = '''
<table width="100%" border="0" cellspacing="2" cellpadding="0">
<tbody><tr>
<br>
<td class="labelplain">
ANGARA, EDGARDO J.<br>ENRILE, JUAN PONCE<br>MAGSAYSAY JR., RAMON B.<br>ROXAS, MAR<br>GORDON, RICHARD "DICK" J.<br>FLAVIER, JUAN M.<br>MADRIGAL, M. A.<br>ARROYO, JOKER P.<br>RECTO, RALPH G.<br>
<br>
</td>
</tr>
</tbody></table>'''
ANGARA, EDGARDO J.;ENRILE, JUAN PONCE;MAGSAYSAY JR., RAMON B.;ROXAS, MAR;GORDON, RICHARD "DICK" J.;FLAVIER, JUAN M.;MADRIGAL, M. A.;ARROYO, JOKER P.;RECTO, RALPH G.
Upvotes: 5
Reputation: 142206
Select the element you want - I'm just picking the td
here... (but use whatever so you have the element with the embedded br
elements)
data = soup.select_one('td')
Then... replace all the br
elements with a semi-colon:
for br in data.select('br'):
br.replace_with(';')
Get your element's text:
output = data.get_text()
# 'ANGARA, EDGARDO J.;ENRILE, JUAN PONCE;MAGSAYSAY JR., RAMON B.;ROXAS, MAR;GORDON, RICHARD "DICK" J.;FLAVIER, JUAN M.;MADRIGAL, M. A.;ARROYO, JOKER P.;RECTO, RALPH G.;'
Upvotes: 4