Philipp Chapkovski
Philipp Chapkovski

Reputation: 2069

how to find the shortest distance between two regex

I have a set of documents, where I can search for specific entities, and I need to find the shortest distance between two. Let's say I have a document where I search for Trump and Ukraine, and I get the list of mentions, with their start and end positions:

import re

text = """
 Three constitutional scholars invited by Democrats to testify at Wednesday’s impeachment hearings said that President Trump’s efforts to pressure Ukraine for political gain clearly meet the historical definition of impeachable offenses, according to copies of their opening statements.
 ˜Noah Feldman, a professor at Harvard, argued that attempts by Mr. Trump to withhold a White House meeting and military assistance from Ukraine as leverage for political favors constitute impeachable conduct, as does the act of soliciting foreign assistance on a phone call with Ukraine’s leader.
"""
p1 = re.compile("Trump")
p2 = re.compile("Ukraine")
res1 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p1.finditer(text)]
res2 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p2.finditer(text)]
print(res1)
print(res2)

output:

[{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

In this specific case the answer is 148 - 125 = 23. How would you recommend to do it in the most pythonic way?

Upvotes: 0

Views: 212

Answers (3)

Anteino
Anteino

Reputation: 1136

Don't forget to take the absolute value of the distance between two points, otherwise the shortest distance will become negative, which is I assume not what you want:

dict = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}, {'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

shortest = 99999999
start = -1
end = -1

for i in range(len(dict)):
    for j in range(len(dict)):
        if(i != j):
            dist = abs(dict[i]['start'] - dict[j]['end'])
            if(dist < shortest):
                shortest = dist
                start = i
                end = j

print("Start: {}, end: {}, distance: {}\n".format(dict[start]['name'], dict[end]['name'], shortest))

Upvotes: 0

Austin
Austin

Reputation: 26039

Use itertools.product:

min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0)

Or with latest Python 3.8+ leveraging the walrus operator, I guess you can also do (untested):

min(res for x, y in product(res2, res1) if res := x['start'] - y['end'] > 0)

Code:

from itertools import product

res1 = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
res2 =[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]

print(min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0))
# 23

Upvotes: 2

Prince Francis
Prince Francis

Reputation: 3097

one solutions is to extract the match and find it's length as below

min([len(x) for x in re.findall(r'Trump(.*?)Ukraine', text)])

Here it prints 23

Upvotes: 2

Related Questions