Reputation: 25
I am trying to loop through PDF bookmarks and save them in an object, including their parent bookmark.
Thanks to PyPDF4 i can get those bookmarks as arrays within arrays.
For example:
[]
0: Root 1
1: []
0: Child layer 1.1
1: Child layer 1.2
2: []
0: Child layer 1.2.1
1: Child layer 1.2.2
1: Child layer 1.2.3
2: Root 2
Since these arrays vary from file to file, I dont know how this array is structured. So I ended up choosing a recursive function to save them.
Function
def __iterate_through_bookmarks(outlines, bookmarks, layer = 0, parent = None):
layer += 1
print(layer)
if isinstance(outlines, list):
for item in outlines:
__iterate_through_bookmarks(item, bookmarks, layer, parent)
return bookmarks
print(outlines.title)
bookmarks.append(bookmark(outlines, parent))
I experimented a bit, but i could not get it right. But seeing how the layer counter in the function gets the layers right gives me the hope, that this is possible.
Layer output:
1
2
Root 1
2
3
Child layer 1.1
3
Child layer 1.2
3
4
Child layer 1.2.1
4
Child layer 1.2.2
4
Child layer 1.2.3
2
Root 2
The child layers 1.x have the same layer counter because of how the recursive function works. However, i cannot find a solution for how to save their (identical) parent in an object.
The end goal is to return an array with bookmark objects. A class that stores title, position and parent of the outlines.
Does anyone have a suggestion?
Upvotes: 2
Views: 490
Reputation: 15525
Recursively browse a nested list of dicts, where each dict has a "name" and a "children" field.
nested_list = [
{'name': 'act IV',
'children': [{'name': 'scene 4',
'children': [{'name': 'dual seduction',
'children': []},
]}
]},
{'name': 'act I',
'children': [{'name': 'scene 2',
'children': [{'name': 'inconstancy praise',
'children': []},
{'name': 'sganarelle monologue',
'children': []}
]}
]},
]
def get_list_of_bookmarks(nested_list, parent_name=None):
bookmark_list = []
for bookmark in nested_list:
bookmark_list.append({'name': bookmark['name'], 'parent': parent_name})
bookmark_list.extend(get_list_of_bookmarks(bookmark['children'], parent_name=bookmark['name']))
return bookmark_list
print(get_list_of_bookmarks(nested_list))
# [{'name': 'act IV', 'parent': None},
# {'name': 'scene 4', 'parent': 'act IV'},
# {'name': 'dual seduction', 'parent': 'scene 4'},
# {'name': 'act I', 'parent': None},
# {'name': 'scene 2', 'parent': 'act I'},
# {'name': 'inconstancy praise', 'parent': 'scene 2'},
# {'name': 'sganarelle monologue', 'parent': 'scene 2'}]
Or alternatively, if you want to store a reference to the parent, rather than just the parent's name:
def get_list_of_bookmarks(nested_list, parent=None):
bookmark_list = []
for bookmark_with_children in nested_list:
bookmark_with_parent = {'name': bookmark_with_children['name'], 'parent': parent}
bookmark_list.append(bookmark_with_parent)
bookmark_list.extend(get_list_of_bookmarks(bookmark_with_children['children'], parent=bookmark_with_parent))
return bookmark_list
print(get_list_of_bookmarks(nested_list))
# [{'name': 'act IV', 'parent': None},
# {'name': 'scene 4', 'parent': {'name': 'act IV', 'parent': None}},
# {'name': 'dual seduction', 'parent': {'name': 'scene 4', 'parent': {'name': 'act IV', 'parent': None}}},
# {'name': 'act I', 'parent': None},
# {'name': 'scene 2', 'parent': {'name': 'act I', 'parent': None}},
# {'name': 'inconstancy praise', 'parent': {'name': 'scene 2', 'parent': {'name': 'act I', 'parent': None}}},
# {'name': 'sganarelle monologue', 'parent': {'name': 'scene 2', 'parent': {'name': 'act I', 'parent': None}}}]
Note that the output of print
looks like it's very redundant and full of copies, but it's not. Every bookmark contains a reference to its parent; it does not contain of copy of its parent.
Now assuming you don't have an explicit 'children'
field, and your nested list is simply a list of lists; sublists are considered children of the previous element.
nested_list = [
'act IV',
['scene 4', ['dual seduction']],
'act I',
['scene 2', ['inconstancy praise', 'sganarelle monologue']],
]
We can draw inspiration from the related question: Flatten an irregular list of lists.
from collections.abc import Iterable
def flatten_with_depth(l, depth=0):
for el in l:
if isinstance(el, Iterable) and not isinstance(el, (str, bytes)):
yield from flatten_with_depth(el, depth=depth+1)
else:
yield (el, depth)
# flatten_with_depth() adapted from flatten() at https://stackoverflow.com/a/2158532/3080723
print(list(flatten_with_depth(nested_list)))
# [('act IV', 0), ('scene 4', 1), ('dual seduction', 2), ('act I', 0), ('scene 2', 1), ('inconstancy praise', 2), ('sganarelle monologue', 2)]
Now the list of bookmarks is flattened, and with each bookmark we have its depths. The parent of a given bookmark at depth depth
is the closest previous bookmark at depth depth-1
. An efficient way to easily find the parents is to maintain a stack of the ancestors in the current branch.
def get_parents_knowing_depths(list_with_depths):
ancestor_stack = []
result = []
for (bookmark_name, depth) in list_with_depths:
if depth == 0:
bookmark = {'name': bookmark_name, 'parent': None}
ancestor_stack = [bookmark]
else:
while len(ancestor_stack) > depth:
ancestor_stack.pop()
bookmark = {'name': bookmark_name, 'parent': ancestor_stack[-1]}
ancestor_stack.append(bookmark)
result.append(bookmark)
return result
print(get_parents_knowing_depths(flatten_with_depth(nested_list)))
# [{'name': 'act IV', 'parent': None},
# {'name': 'scene 4', 'parent': {'name': 'act IV', 'parent': None}},
# {'name': 'dual seduction', 'parent': {'name': 'scene 4', 'parent': {'name': 'act IV', 'parent': None}}},
# {'name': 'act I', 'parent': None},
# {'name': 'scene 2', 'parent': {'name': 'act I', 'parent': None}},
# {'name': 'inconstancy praise', 'parent': {'name': 'scene 2', 'parent': {'name': 'act I', 'parent': None}}},
# {'name': 'sganarelle monologue', 'parent': {'name': 'scene 2', 'parent': {'name': 'act I', 'parent': None}}}]
Upvotes: 1