Reputation: 85
I'm using Python to pull out the country of residence that somebody has. The lines where the country is in are (address faked):
<HR NOSHADE SIZE="1" COLOR="#000000"><B>Buyer Information</B><HR NOSHADE SIZE="1" COLOR="#000000">
<TABLE WIDTH="100%" BORDER="0" CELLPADDING="1" CELLSPACING="0" CLASS="ta"><TR BGCOLOR="#EEEEEE">
<TD WIDTH="25%"> Username:</TD>
<TD WIDTH="75%"><B>joedane</B> <A HREF="http://www.bricklink.com/feedback.asp?u=joedane">(6)</A><IMG BORDER=0 ALT="" SRC="/images/dot.gif" ALIGN="ABSMIDDLE" WIDTH="4" HEIGHT="16"></TD></TR><TR BGCOLOR="#EEEEEE">
<TD> E-Mail:</TD><TD><A HREF="mailto:[email protected]">[email protected]</A></TD></TR><TR BGCOLOR="#EEEEEE">
<TD WIDTH="25%" VALIGN="TOP"> Name & Address:</TD>
<TD WIDTH="75%">Joe Dane
<BR>XXXX 24
<BR>12345 QWERTY
<BR>Germany</TD>
</TR></TABLE>
<HR NOSHADE SIZE="1" COLOR="#000000"><B>Seller Information</B><HR NOSHADE SIZE="1" COLOR="#000000">
I need to get that 'Germany' on the third to last row. However, the country and address will be different each time, so I need a way to pull out the country, but not depending on the address before it.
I have tried:
#get Shipping Destination
shippingDest = order.split('</TD></TR></TABLE><HR NOSHADE SIZE="1" COLOR="#000000"><B>Seller Information</B>')[0].split('<BR>')[1]
But it doesn't stop on the first BR it finds before the line. Hopefully, my split concept is wrong. This should be an easy problem. Any help?
EDIT:
The actual code continues and after Seller information there is a similar code as in the buyer information with Germany, but with my own country. The script yields Spain, my own country. Can I somehow let it skip my country and go for the Second? Would be the one after Seller Information if you are going backwards.
This is the actual code until the end of the html. After Germany it's always the same.
<HR NOSHADE SIZE="1" COLOR="#000000"><B>Buyer Information</B><HR NOSHADE SIZE="1" COLOR="#000000">
<TABLE WIDTH="100%" BORDER="0" CELLPADDING="1" CELLSPACING="0" CLASS="ta"><TR BGCOLOR="#EEEEEE">
<TD WIDTH="25%"> Username:</TD>
<TD WIDTH="75%"><B>joedane</B> <A HREF="http://www.bricklink.com/feedback.asp?u=joedane">(6)</A><IMG BORDER=0 ALT="" SRC="/images/dot.gif" ALIGN="ABSMIDDLE" WIDTH="4" HEIGHT="16"></TD></TR><TR BGCOLOR="#EEEEEE">
<TD> E-Mail:</TD><TD><A HREF="mailto:[email protected]">[email protected]</A></TD></TR><TR BGCOLOR="#EEEEEE">
<TD WIDTH="25%" VALIGN="TOP"> Name & Address:</TD>
<TD WIDTH="75%">Joe Dane
<BR>XXXX 24
<BR>12345 QWERTY
<BR>Germany</TD>
</TR></TABLE>
<HR NOSHADE SIZE="1" COLOR="#000000"><B>Seller Information</B><HR NOSHADE SIZE="1" COLOR="#000000">
<TABLE WIDTH="100%" BORDER="0" CELLPADDING="1" CELLSPACING="0" CLASS="ta">
<TR BGCOLOR="#EEEEEE">
<TD WIDTH="25%"> Username:</TD><TD WIDTH="75%"><B>Brick_Top</B> <A HREF="http://www.bricklink.com/feedback.asp?u=Brick_Top">(466)</A>
<A HREF="http://www.bricklink.com/help.asp?helpID=54">
<IMG ALT="" WIDTH="16" HSPACE="3" ALIGN="ABSMIDDLE" HEIGHT="16" BORDER="0" SRC="/images/bricks/star2.png"></A>
<A HREF="http://www.bricklink.com/aboutMe.asp?u=Brick_Top">
<IMG ALT="" WIDTH="18" ALIGN="ABSMIDDLE" HEIGHT="16" BORDER="0" SRC="/images/bricks/me.png"></A></TD></TR><TR BGCOLOR="#EEEEEE">
<TD> Store Name:</TD><TD><B>Top Bricks from Brick Top</B></TD></TR><TR BGCOLOR="#EEEEEE">
<TD> Store Link:</TD><TD><A HREF="/store.asp?p=Brick_Top">http://www.bricklink.com/store.asp?p=Brick_Top</A></TD></TR><TR BGCOLOR="#EEEEEE">
<TD> E-Mail:</TD><TD><A HREF="mailto:[email protected]">[email protected]</A></TD></TR><TR BGCOLOR="#EEEEEE">
<TD WIDTH="25%" VALIGN="TOP"> Name & Address:</TD>
<TD WIDTH="75%">Gerald Me
<BR>qwerty 234
<BR>Sevilla 41500
<BR>Spain</TD></TR></TABLE>
All I want to get is that Germany (the first country from the two). Many, many thanks.
EDIT 2.0:
Interestingly enough, I was able to do it just adding that [-5]. I don't understand it well but my guess is that it find the fifth BR from the first table.
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
country = soup.find('table').find_all('br')[-5]
print(country.find_next(text=True).string)
Upvotes: 0
Views: 243
Reputation: 36262
I suggest you to use a html
parser like beautifulsoup. It finds the last <br>
of the table and from there search next sibling including text nodes, which returns the country:
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
country = soup.find('table').find_all('br')[-1]
print(country.find_next(text=True).string)
Run it like:
python3 script.py htmlfile
That yields:
Germany
Upvotes: 4
Reputation: 1472
you may use regular expression
target_str="<HR NOSHADE SI..."
results=re.findall(r"<BR>\w*{20}</TD>", target_str)
for country in results:
print country //the out put will be <BR>Germany</TD>
//you can do some other things
//to pull Germany out of <BR>Germany</TD>
Upvotes: 1