Reputation: 809
For a Django application, I need to turn all occurrences of a pattern in a string into a link if I have the resource related to the match in my database.
Right now, here's the process: - I use re.sub to process a very long string of text - When re.sub finds a pattern match, it runs a function that looks up whether that pattern matches an entry in the database - If there is a match, it wraps the link wraps a link around the match.
The problem is that there are sometimes hundreds of hits on the database. What I'd like to be able to do is a single bulk query to the database.
So: can you do a bulk find and replace using regular expressions in Python?
For reference, here's the code (for the curious, the patterns I'm looking up are for legal citations):
def add_linked_citations(text):
linked_text = re.sub(r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))', create_citation_link, text)
return linked_text
def create_citation_link(match_object):
volume = None
reporter = None
page = None
if match_object.group("volume") not in [None, '']:
volume = match_object.group("volume")
if match_object.group("reporter") not in [None, '']:
reporter = match_object.group("reporter")
if match_object.group("page") not in [None, '']:
page = match_object.group("page")
if volume and reporter and page: # These should all be here...
# !!! Here's where I keep hitting the database
citations = Citation.objects.filter(volume=volume, reporter=reporter, page=page)
if citations.exists():
citation = citations[0]
document = citation.document
url = document.url()
return '<a href="%s">%s %s %s</a>' % (url, volume, reporter, page)
else:
return '%s %s %s' % (volume, reporter, page)
Upvotes: 1
Views: 750
Reputation: 38247
You can do it with a single regexp pass, by using finditer
which returns match objects.
The match object have:
groupdict()
span()
group()
So I would suggest that you:
finditer
I've implemented the database lookup by combining a list of Q(volume=foo1,reporter=bar2,page=baz3)|Q(volume=foo1,reporter=bar2,page=baz3)...
. There maybe be more efficient approaches.
Here's an untested implementation:
from django.db.models import Q
from collections import namedtuple
Triplet = namedtuple('Triplet',['volume','reporter','page'])
def lookup_references(matches):
match_to_triplet = {}
triplet_to_url = {}
for m in matches:
group_dict = m.groupdict()
if any(not(x) for x in group_dict.values()): # Filter out matches we don't want to lookup
continue
match_to_triplet[m] = Triplet(**group_dict)
# Build query
unique_triplets = set(match_to_triplet.values())
# List of Q objects
q_list = [Q(**trip._asdict()) for trip in unique_triplets]
# Consolidated Q
single_q = reduce(Q.__or__,q_list)
for row in Citations.objects.filter(single_q).values('volume','reporter','page','url'):
url = row.pop('url')
triplet_to_url[Triplet(**row)] = url
# Now pair original match objects with URL where found
lookups = {}
for match, triplet in match_to_triplet.items():
if triplet in triplet_to_url:
lookups[match] = triplet_to_url[triplet]
return lookups
def interpolate_citation_matches(text,matches,lookups):
result = []
prev = m_start = 0
last = m_end = len(text)
for m in matches:
m_start, m_end = m.span()
if prev != m_start:
result.append(text[prev:m_start])
# Now check match
if m in lookups:
result.append('<a href="%s">%s</a>' % (lookups[m],m.group()))
else:
result.append(m.group())
if m_end != last:
result.append(text[m_end:last])
return ''.join(result)
def process_citations(text):
citation_regex = r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))'
matches = list(re.finditer(citation_regex,text))
lookups = lookup_references(matches)
new_text = interpolate_citation_matches(text,matches,lookups)
return new_text
Upvotes: 1
Reputation: 46882
Sorry if this is obvious and wrong (that no-one has suggested it in 4 hours is worrying!), but why not search for all matches, do a batch query for everything (easy once you have all matches), and then call sub with the dictionary of results (so the function pulls the data from the dict)?
You have to run the regexp twice, but it seems like the database access is the expensive part anyway.
Upvotes: 1