Jake
Jake

Reputation: 809

Bulk replace with regular expressions in Python

For a Django application, I need to turn all occurrences of a pattern in a string into a link if I have the resource related to the match in my database.

Right now, here's the process: - I use re.sub to process a very long string of text - When re.sub finds a pattern match, it runs a function that looks up whether that pattern matches an entry in the database - If there is a match, it wraps the link wraps a link around the match.

The problem is that there are sometimes hundreds of hits on the database. What I'd like to be able to do is a single bulk query to the database.

So: can you do a bulk find and replace using regular expressions in Python?

For reference, here's the code (for the curious, the patterns I'm looking up are for legal citations):

def add_linked_citations(text):
    linked_text = re.sub(r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))', create_citation_link, text)
    return linked_text

def create_citation_link(match_object):
    volume = None
    reporter = None
    page = None
    if match_object.group("volume") not in [None, '']:
        volume = match_object.group("volume")
    if match_object.group("reporter") not in [None, '']:
        reporter = match_object.group("reporter")
    if match_object.group("page") not in [None, '']:
        page = match_object.group("page")

    if volume and reporter and page: # These should all be here...
        # !!! Here's where I keep hitting the database
        citations = Citation.objects.filter(volume=volume, reporter=reporter, page=page)
        if citations.exists():
            citation = citations[0] 
            document = citation.document
            url = document.url()
            return '<a href="%s">%s %s %s</a>' % (url, volume, reporter, page)
        else:
            return '%s %s %s' % (volume, reporter, page)

Upvotes: 1

Views: 750

Answers (2)

MattH
MattH

Reputation: 38247

You can do it with a single regexp pass, by using finditer which returns match objects.

The match object have:

  • a method returning a dict of the named groups, groupdict()
  • the start and the end positions of the match in the original text, span()
  • the original matching text, group()

So I would suggest that you:

  • Make a list of all the matches in your text using finditer
  • Make a list of all the unique volume, reporter, page triplets in the matches
  • Lookup those triplets
  • Correlate each match object with the result of the triplet lookup if found
  • Process the original text, splitting by the match spans and interpolating lookup results.

I've implemented the database lookup by combining a list of Q(volume=foo1,reporter=bar2,page=baz3)|Q(volume=foo1,reporter=bar2,page=baz3).... There maybe be more efficient approaches.

Here's an untested implementation:

from django.db.models import Q
from collections import namedtuple

Triplet = namedtuple('Triplet',['volume','reporter','page'])

def lookup_references(matches):
  match_to_triplet = {}
  triplet_to_url = {}
  for m in matches:
    group_dict = m.groupdict()
    if any(not(x) for x in group_dict.values()): # Filter out matches we don't want to lookup
      continue
    match_to_triplet[m] = Triplet(**group_dict)
  # Build query
  unique_triplets = set(match_to_triplet.values())
  # List of Q objects
  q_list = [Q(**trip._asdict()) for trip in unique_triplets]
  # Consolidated Q
  single_q = reduce(Q.__or__,q_list)
  for row in Citations.objects.filter(single_q).values('volume','reporter','page','url'):
    url = row.pop('url')
    triplet_to_url[Triplet(**row)] = url
  # Now pair original match objects with URL where found
  lookups = {}
  for match, triplet in match_to_triplet.items():
    if triplet in triplet_to_url:
      lookups[match] = triplet_to_url[triplet]
  return lookups

def interpolate_citation_matches(text,matches,lookups):
  result = []
  prev = m_start = 0
  last = m_end = len(text)
  for m in matches:
    m_start, m_end = m.span()
    if prev != m_start:
      result.append(text[prev:m_start])
    # Now check match
    if m in lookups:
      result.append('<a href="%s">%s</a>' % (lookups[m],m.group()))
    else:
      result.append(m.group())
  if m_end != last:
    result.append(text[m_end:last])
  return ''.join(result)

def process_citations(text):
  citation_regex = r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))'
  matches = list(re.finditer(citation_regex,text))
  lookups = lookup_references(matches)
  new_text = interpolate_citation_matches(text,matches,lookups)
  return new_text

Upvotes: 1

andrew cooke
andrew cooke

Reputation: 46882

Sorry if this is obvious and wrong (that no-one has suggested it in 4 hours is worrying!), but why not search for all matches, do a batch query for everything (easy once you have all matches), and then call sub with the dictionary of results (so the function pulls the data from the dict)?

You have to run the regexp twice, but it seems like the database access is the expensive part anyway.

Upvotes: 1

Related Questions