Reputation: 1214
I have a python dictionary that I ultimately want to insert into a mysql database. I'm parsing data from something called "entries" which goes like (the # symbolize numbers):
entries = [ "['data'] runtime: ###, scan: ###",
"['data'] ctime: ###, scan: ###",
"['data'] runtime: ###", ... ]
Each thing in the "" is a separate entry. Now I use regex to extract the runtimes, ctimes, and scans associated with each entry like so:
import re
terms = (["runtime", "runtime\s?:\s?(\d+)"],
["ctime", "ctime\s?:\s?(\d+)"],
["scan", "scan\s?:\s?(\d+)"])
d = {}
for i in range(len(terms)):
def getTerm(term, entries):
pattern = re.compile(term)
output = pattern.findall(str(entries))
return output
d[terms[i][0]] = getTerm(terms[i][1], entries)
This works -- however, as you can see, not all of the entries have a runtime, ctime, and scan. If a value doesn't appear in an entry, I want it to be entered into my dictionary as [] or NULL (or None), because in the future if I look at a specific # element of each key in my dictionary, I want all that data to be associated with one specific entry. I want my dictionary to then look like this:
d = {'ctime': [None, '###', None], 'runtime': ['###', None, '###'], 'scan': ['###', '###', None]}
How do I do this?
Upvotes: 1
Views: 204
Reputation: 3493
If entries
is a list of strings that may or may not contain the keywords and order is important then we'll need to iterate over the entries:
First option:
import re
entries = [ "['data'] runtime: ###, scan: ###",
"['data'] ctime: ###, scan: ###",
"['data'] runtime: ###" ]
allterms = (["runtime", "runtime\s?:\s?([a-zA-Z0-9_#]*)"],
["ctime", "ctime\s?:\s?([a-zA-Z0-9_#]*)"],
["scan", "scan\s?:\s?([a-zA-Z0-9_#]*)"])
terms = [allterms[i][0] for i in range(len(allterms))]
patterns = [allterms[i][1] for i in range(len(allterms))]
def get_terms(entry):
for i in range(len(terms)):
term = re.search(patterns[i], entry)
term = term.groups()[0] if term else None
d[terms[i]] += [term]
pass
d = {t: [] for t in allterms}
for entry in entries:
get_terms(entry)
Second option with async:
# pip install futures # if using Python 2
from concurrent.futures import ThreadPoolExecutor
d = {t: [] for t in allterms}
with ThreadPoolExecutor() as executor:
for entry in entries:
get_terms(entry)
Edit: Solution developed in chat collab with @Wynne :)
Upvotes: 1
Reputation: 2525
re.findall()
return an empty list ([]
) when no match is found, so you don't need an empty fallback. If you want to have None
when no term is found, as Brennan said, user findall(string) or None
.
Consider using list comprehension to loop over all your entries, and dict comprehension to apply your regex patterns over the same entry and save the result in a dict.
import re
terms = (["runtime", re.compile("runtime\s?:\s?(\d+)")],
["ctime", re.compile("ctime\s?:\s?(\d+)")],
["scan", re.compile("scan\s?:\s?(\d+)")])
results = [{property: pattern.findall(entry) for property, pattern in terms} for entry in entries]
now you have something like:
[{"runtime": None, "scan": ###, "ctime": ###}, {"runtime": ###, "scan": ###, "ctime": ###}, {"runtime": ###, "scan": None, "ctime": None}, ...]
The above code is equivalent (but more performant) to:
results = []
for entry in entries:
entry_dict = {}
for term, regex_pattern in terms:
entry_dict[term] = regex_pattern.findall(entry) or None
results.append(entry_dict)
Upvotes: 0