Reputation: 475
Could someone recommend the best data structure for the FinalResults
described below:
I'm extracting various pieces of information from XML documents. Roughly, here's what I do: First use find_all to locate the text
elements that contain a keyword. Then for each result:
text
elementtext
element for additional text using regex. This last search yields a result with up to 6 match groups.
This whole operation could end up returning something like this:
FinalResult 1: [parent, parent-attr, match.group(1), match.group(2) ... ,match.group(6)]
FinalResult 2: [parent, parent-attr, match.group(1), match.group(2) ... ,match.group(6)]
There is no maximum number of FinalResults
that I might get. But on average I expect fewer than 10 from each XML doc. I plan to use each FinalResult
for other processing but won't be changing or adding anything in the FinalResults
. For example I might say: go back to the <parent>
with attribute XYZ and get other data, then go get a file by the name of match.group(2)
from elsewhere.
I'll probably be accessing each FinalResult only a few times. If it matters, some of the match.groups could be "None"
Here's an example. Assume this is FinalResult[0]: ['paragraph', '39871234', '42', '103', 'b', '1', None, None]
Paragraph would be the parent tag of the tag containing the keywords I found. 39871234 would be the id attribute of the paragraph tag 42 indicates a volume number 103 is a section in that volume b and 1 are subdivisions of that section
I would use 42/103/b/1 to extract info from another xml file.
Paragraph and the id would be used in case I need to tell one keyword search result from another because the file will have multiple text elements. (Ex. Paragraph id=39871234
text
[string containing keyword]
)
My question is should I store all the FinalResults as a dictionary, a list, a tuple, or something else?
Upvotes: 0
Views: 197
Reputation: 189648
A real data structure recommendation question would have some actual requirements for what the data structure should do, or help you achieve. In the absence of any such information in your question, I guess the simple and straightforward answer you are looking for is this:
In any modern object-oriented language, the standard way to represent a collection of related attributes is to create a simple class with getter and maybe setter methods (except if the objects are immutable after creation, in which the only way to set an attribute is when you first instantiate its containing object).
Your example suggests a class with attribute()
, parent_attribute()
, and matches()
methods, where the first two would apparently return simple strings, and the last, a list of strings. Your main program would probably have one or more lists of these objects, or perhaps a dictionary keyed on a feature you want to use for accessing previous objects (the identifying attribute?)
class Match (object):
def __init__ (self, attrib, parent_attr, matches):
self.attrib, self.parent_attr, self.matches = attrib, parent_attr, matches
def attribute (self):
return self.attrib
def parent_attribute (self):
return self.parent_attr
def matches (self):
return self.matches
The benefit over a list should be immediately obvious: instead of match[0]
your code says match.attribute()
which immediately conveys what is going on.
The benefit over a dict is less obvious, but is frequent enough in practice that you want to be prepared for it: When you want to refactor the code, changing the class implementation is a much simpler task than changing every place your code uses one of these instances.
So for example, if for some odd reason you realize you wanted to use a list after using the class for a while, you would only have to change the initialization code and the getters, rather than every piece of code which manipulates these instances; and the details of the list implementation behind the scenes would be completely transparent to any code using this class.
There are many additional benefits of modular design; find a good intro to OOP if you want more details.
If the performance of this design is not satisfactory, a new question with some actual requirements (speed, memory, etc) might be the way to go.
Upvotes: 1