Reputation: 107
I am extracting data from pdfs into lists
list1=[]
for page in pages:
for lobj in element:
if isinstance(lobj, LTTextBox):
x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
if isinstance(lobj, LTTextContainer):
for text_line in lobj:
for character in text_line:
if isinstance(character, LTChar):
Font_size = character.size
list1.append([Font_size,(lobj.get_text())])
if isinstance(lobj, LTTextContainer):
for text_line in lobj:
for character in text_line:
if isinstance(character, LTChar):
font_name = character.fontname
list1.append(font_name)
print(list1)
gives me a list of lists that has the font_name not within each of the list with size and text.
list = [[12.0, 'aaa'], 'IJEAMP+Times-Bold', [12.0, 'bbb'], 'IJEAOO+Times-Roman', [12.0, 'ccc'], 'IJEAMP+Times-Bold', [10.0, 'ddd'], 'IJEAOO+Times-Roman', [10.0, 'eee'], 'IJEAOO+Times-Roman', [8.0, '2\n'], 'IJEAOO+Times-Roman', 'IJEAOO+Times-Roman']
How the list of lists should look like
list = [[12.0, 'aaa', 'IJEAMP+Times-Bold'], [12.0, 'bbb', 'IJEAOO+Times-Roman'], [12.0, 'ccc', 'IJEAMP+Times-Bold'], [10.0, 'ddd', 'IJEAOO+Times-Roman'], [10.0, 'eee', 'IJEAOO+Times-Roman'], [8.0, '2\n', 'IJEAOO+Times-Roman'], 'IJEAOO+Times-Roman']
If possible, i would like to ask for an answer to my problem that fixes my error in the code. I believe it is possible so that i dont need to create two lists and zip
them afterwards.
I tried list2.extend([list1, font_name])
but that doesent do it as the font_name
keeps getting split into individual letters
Upvotes: 1
Views: 24
Reputation: 51683
You are appending to the outer list, not the list you just added into it. This adds your inner list:
list1.append([Font_size,(lobj.get_text())])
if you want to extend that added list, you can do so by using
list1[-1].append(font_name)
instead of
list1.append(font_name)
Upvotes: 1