Reputation: 4940
I am using BeautifulSoup in Python to parse some HTML. One of the problems I am dealing with is that I have situations where the colspans are different across header rows. (Header rows are the rows that need to be combined to get the column headings in my jargon) That is one column may span a number of columns above or below it and the words need to be appended or prepended based on the spanning. Below is a routine to do this. I use BeautifulSoup to pull the colspans and to pull the contents of each cell in each row. longHeader is the contents of the header row with the most items, spanLong is a list with the colspans of each item in the row. This works but it is not looking very Pythonic.
Alos-it is not going to work if the diff is <0, I can fix that with the same approach I used to get this to work. But before I do I wonder if anyone can quickly look at this and suggest a more Pythonic approach. I am a long time SAS programmer and so I struggle to break the mold-well I will write code as if I am writing a SAS macro.
longHeader=['','','bananas','','','','','','','','','','trains','','planes','','','','']
shortHeader=['','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
combinedHeader=[]
sumSpanLong=0
sumSpanShort=0
spanDiff=0
longHeaderCount=0
for each in range(len(shortHeader)):
sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
sumSpanShort=sumSpanShort+spanShort[each]
spanDiff=sumSpanShort-sumSpanLong
if spanDiff==0:
combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
longHeaderCount=longHeaderCount+1
continue
for i in range(0,spanDiff):
combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
longHeaderCount=longHeaderCount+1
sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
spanDiff=sumSpanShort-sumSpanLong
if spanDiff==0:
combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
longHeaderCount=longHeaderCount+1
break
print combinedHeader
Upvotes: 1
Views: 863
Reputation: 4940
Well I have an answer now. I was thinking through this and decided that I needed to use parts of every answer. I still need to figure out if I want a class or a function. But I have the algorithm that I think is probably more Pythonic than any of the others. But, it borrows heavily from the answers that some very generous people provided. I appreciate those a lot because I have learned quite a bit.
To save the time of having to make test cases I am going to paste the the complete code I have been banging away with in IDLE and follow that with an HTML sample file. Other than making a decision about class/function (and I need to think about how I am using this code in my program) I would be happy to see any improvements that make the code more Pythonic.
from BeautifulSoup import BeautifulSoup
original=file(r"C:\testheaders.htm").read()
soupOriginal=BeautifulSoup(original)
all_Rows=soupOriginal.findAll('tr')
header_Rows=[]
for each in range(len(all_Rows)):
header_Rows.append(all_Rows[each])
header_Cells=[]
for each in header_Rows:
header_Cells.append(each.findAll('td'))
temp_Header_Row=[]
header=[]
for row in range(len(header_Cells)):
for column in range(len(header_Cells[row])):
x=int(header_Cells[row][column].get("colspan","1"))
if x==1:
temp_Header_Row.append( ' '.join(header_Cells[row][column]) )
else:
for item in range(x):
temp_Header_Row.append( ''.join(header_Cells[row][column]) )
header.append(temp_Header_Row)
temp_Header_Row=[]
combined_Header=zip(*header)
for each in combined_Header:
print each
Okay test file contents are below Sorry I tried to attach these but couldn't make it happen:
<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
<TR valign="bottom">
<TD width="40%"> </TD>
<TD width="5%"> </TD>
<TD width="3%"> </TD>
<TD width="3%"> </TD>
<TD width="1%"> </TD>
<TD width="5%"> </TD>
<TD width="3%"> </TD>
<TD width="3%"> </TD>
<TD width="1%"> </TD>
<TD width="5%"> </TD>
<TD width="3%"> </TD>
<TD width="1%"> </TD>
<TD width="1%"> </TD>
<TD width="5%"> </TD>
<TD width="3%"> </TD>
<TD width="1%"> </TD>
<TD width="1%"> </TD>
<TD width="5%"> </TD>
<TD width="3%"> </TD>
<TD width="3%"> </TD>
<TD width="1%"> </TD>
</TR>
<TR style="font-size: 10pt" valign="bottom">
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">FOODS WE LIKE</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2"> </TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2"> </TD>
<TD> </TD>
</TR>
<TR style="font-size: 10pt" valign="bottom">
<TD> </TD>
<TD> </TD>
<TD nowrap align="CENTER" colspan="6">SILLY STUFF</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">OTHER THAN</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="CENTER" colspan="6">FAVORITE PEOPLE</TD>
<TD> </TD>
</TR>
<TR style="font-size: 10pt" valign="bottom">
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">MONTY PYTHON</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">CHERRYPY</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">APPLE PIE</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">MOTHERS</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">FATHERS</TD>
<TD> </TD>
</TR>
<TR style="font-size: 10pt" valign="bottom">
<TD nowrap align="left">Name</TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">SHOWS</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">PROGRAMS</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">BANANAS</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">PERFUME</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">TOOLS</TD>
<TD> </TD>
</TR>
</TABLE>
Upvotes: 0
Reputation: 4940
I guess I am going to answer my own question but I did receive a lot of help. Thanks for all of the help. I made S.LOTT's answer work after a few small corrections. (They may be so small as to not be visible (inside joke)). So now the question is why is this more Pythonic? I think I see that it is less denser / works with the raw inputs instead of derivations / I cannot judge if it is easier to read ---> though it is easy to read
row1=headerCells[0]
row2=headerCells[1]
i1= 0
i2= 0
result= []
while i1 != len(row1) or i2 != len(row2):
if i1 == len(row1):
result.append( ' '.join(row1[i1]) )
i2 += 1
elif i2 == len(row2):
result.append( ' '.join(row2[i2]) )
i1 += 1
else:
if int(row1[i1].get("colspan","1")) < int(row2[i2].get("colspan","1")):
c1= int(row1[i1].get("colspan","1"))
while c1 != int(row2[i2].get("colspan","1")):
txt1= ' '.join(row1[i1]) # needed to add when working adjust opposing case
txt2= ' '.join(row2[i2]) # needed to add when working adjust opposing case
result.append( txt1 + " " + txt2 ) # needed to add when working adjust opposing case
print 'stayed in middle', 'i1=',i1,'i2=',i2, ' c1=',c1
c1 += 1
i1 += 1 # Is this the problem it
elif int(row1[i1].get("colspan","1"))> int(row2[i2].get("colspan","1")):
# Fill extra cols from row2 Make same adjustment as above
c2= int(row2[i2].get("colspan","1"))
while int(row1[i1].get("colspan","1")) != c2:
result.append( ' '.join(row1[i1]) )
c2 += 1
i2 += 1
else:
assert int(row1[i1].get("colspan","1")) == int(row2[i2].get("colspan","1"))
pass
txt1= ' '.join(row1[i1])
txt2= ' '.join(row2[i2])
result.append( txt1 + " " + txt2 )
print 'went to bottom', 'i1=',i1,'i2=',i2
i1 += 1
i2 += 1
print result
Upvotes: 0
Reputation: 34388
Maybe look at the zip function for parts of the problem:
>>> execfile('so_ques.py')
[[' '], [' '], ['bananas bunches'], [' '], [' cars'], [' cars'], [' cars'], [' '], [' trucks'], [' trucks'], [' trucks'], [' '], ['trains freight'], [' '], ['planes cargo'], [' '], [' all other'], [' '], [' ']]
>>> zip(long_header, short_header)
[('', ''), ('', ''), ('bananas', 'bunches'), ('', ''), ('', 'cars'), ('', ''), ('', 'trucks'), ('', ''), ('', 'freight'), ('', ''), ('', 'cargo'), ('', ''), ('trains', 'all other'), ('', ''), ('planes', '')]
>>>
enumerate
can help avoid some of the complex indexing with counters:
>>> diff_list = []
>>> for place, header in enumerate(short_header):
diff_list.append(abs(span_short[place] - span_long[place]))
>>> for place, num in enumerate(diff_list):
if num:
new_shortlist.extend(short_header[place] for item in range(num+1))
else:
new_shortlist.append(short_header[place])
>>> new_shortlist
['', '', 'bunches', '', 'cars', 'cars', 'cars', '', 'trucks', 'trucks', 'trucks', '',...
>>> z = zip(new_shortlist, long_header)
>>> z
[('', ''), ('', ''), ('bunches', 'bananas'), ('', ''), ('cars', ''), ('cars', ''), ('cars', '')...
Also more pythonic naming may add clarity:
for each in range(len(short_header)):
sum_span_long += span_long[long_header_count]
sum_span_short += span_short[each]
span_diff = sum_span_short - sum_span_long
if not span_diff:
combined_header.append...
Upvotes: 1
Reputation: 391856
You've actually got a lot going on in this example.
You've "over-processed" the Beautiful Soup Tag objects to make lists. Leave them as Tags.
All of these kinds of merge algorithms are hard. It helps to treat the two things being merged symmetrically.
Here's a version that should work directly with the Beautiful Soup Tag objects. Also, this version doesn't assume anything about the lengths of the two rows.
def merge3( row1, row2 ):
i1= 0
i2= 0
result= []
while i1 != len(row1) or i2 != len(row2):
if i1 == len(row1):
result.append( ' '.join(row1[i1].contents) )
i2 += 1
elif i2 == len(row2):
result.append( ' '.join(row2[i2].contents) )
i1 += 1
else:
if row1[i1]['colspan'] < row2[i2]['colspan']:
# Fill extra cols from row1
c1= row1[i1]['colspan']
while c1 != row2[i2]['colspan']:
result.append( ' '.join(row2[i2].contents) )
c1 += 1
elif row1[i1]['colspan'] > row2[i2]['colspan']:
# Fill extra cols from row2
c2= row2[i2]['colspan']
while row1[i1]['colspan'] != c2:
result.append( ' '.join(row1[i1].contents) )
c2 += 1
else:
assert row1[i1]['colspan'] == row2[i2]['colspan']
pass
txt1= ' '.join(row1[i1].contents)
txt2= ' '.join(row2[i2].contents)
result.append( txt1 + " " + txt2 )
i1 += 1
i2 += 1
return result
Upvotes: 2
Reputation: 86362
Here is a modified version of your algorithm. zip is used to iterate over short lengths and headers and a class object is used to count and iterate the long items, as well as combine the headers. while is more appropriate for the inner loop. (forgive the too short names).
class collector(object):
def __init__(self, header):
self.longHeader = header
self.combinedHeader = []
self.longHeaderCount = 0
def combine(self, shortValue):
self.combinedHeader.append(
[self.longHeader[self.longHeaderCount]+' '+shortValue] )
self.longHeaderCount += 1
return self.longHeaderCount
def main():
longHeader = [
'','','bananas','','','','','','','','','','trains','','planes','','','','']
shortHeader = [
'','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
sumSpanLong=0
sumSpanShort=0
combiner = collector(longHeader)
for sLen,sHead in zip(spanShort,shortHeader):
sumSpanLong += spanLong[combiner.longHeaderCount]
sumSpanShort += sLen
while sumSpanShort - sumSpanLong > 0:
combiner.combine(sHead)
sumSpanLong += spanLong[combiner.longHeaderCount]
combiner.combine(sHead)
return combiner.combinedHeader
Upvotes: 3