Reputation: 2069
I have a set of documents, where I can search for specific entities, and I need to find the shortest distance between two.
Let's say I have a document where I search for Trump
and Ukraine
, and I get the list of mentions, with their start and end positions:
import re
text = """
Three constitutional scholars invited by Democrats to testify at Wednesday’s impeachment hearings said that President Trump’s efforts to pressure Ukraine for political gain clearly meet the historical definition of impeachable offenses, according to copies of their opening statements.
˜Noah Feldman, a professor at Harvard, argued that attempts by Mr. Trump to withhold a White House meeting and military assistance from Ukraine as leverage for political favors constitute impeachable conduct, as does the act of soliciting foreign assistance on a phone call with Ukraine’s leader.
"""
p1 = re.compile("Trump")
p2 = re.compile("Ukraine")
res1 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p1.finditer(text)]
res2 = [{'name':m.group(), 'start': m.start(), "end":m.end()} for m in p2.finditer(text)]
print(res1)
print(res2)
output:
[{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]
In this specific case the answer is 148 - 125 = 23
. How would you recommend to do it in the most pythonic way?
Upvotes: 0
Views: 212
Reputation: 1136
Don't forget to take the absolute value of the distance between two points, otherwise the shortest distance will become negative, which is I assume not what you want:
dict = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}, {'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]
shortest = 99999999
start = -1
end = -1
for i in range(len(dict)):
for j in range(len(dict)):
if(i != j):
dist = abs(dict[i]['start'] - dict[j]['end'])
if(dist < shortest):
shortest = dist
start = i
end = j
print("Start: {}, end: {}, distance: {}\n".format(dict[start]['name'], dict[end]['name'], shortest))
Upvotes: 0
Reputation: 26039
Use itertools.product
:
min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0)
Or with latest Python 3.8+ leveraging the walrus operator, I guess you can also do (untested):
min(res for x, y in product(res2, res1) if res := x['start'] - y['end'] > 0)
Code:
from itertools import product
res1 = [{'name': 'Trump', 'start': 120, 'end': 125}, {'name': 'Trump', 'start': 356, 'end': 361}]
res2 =[{'name': 'Ukraine', 'start': 148, 'end': 155}, {'name': 'Ukraine', 'start': 425, 'end': 432}, {'name': 'Ukraine', 'start': 568, 'end': 575}]
print(min(x['start'] - y['end'] for x, y in product(res2, res1) if x['start'] - y['end'] > 0))
# 23
Upvotes: 2
Reputation: 3097
one solutions is to extract the match and find it's length as below
min([len(x) for x in re.findall(r'Trump(.*?)Ukraine', text)])
Here it prints 23
Upvotes: 2