Reputation: 2366
So I may have a string 'Bank of China', or 'Embassy of China', and 'International China'
I want to replace all country instances except when we have an 'of ' or 'of the '
Clearly this can be done by iterating through a list of countries, checking if the name contains a country, then checking if before the country 'of ' or 'of the ' exists.
If these do exist then we do not remove the country, else we do remove country. The examples will become:
'Bank of China', or 'Embassy of China', and 'International'
However iteration can be slow, particularly when you have a large list of countries and a large lists of texts for replacement.
Is there a faster and more conditionally based way of replacing the string? So that I can still use a simple pattern match using the Python re library?
My function is along these lines:
def removeCountry(name):
for country in countries:
if country in name:
if 'of ' + country in name:
return name
if 'of the ' + country in name:
return name
else:
name = re.sub(country + '$', '', name).strip()
return name
return name
EDIT: I did find some info here. This does describe how to do an if, but I really want a if not 'of ' if not 'of the ' then replace...
Upvotes: 2
Views: 211
Reputation: 101969
The re.sub
function accepts a function as replacement text, which is called in order to get the text that should be substituted in the given match. So you could do this:
import re
def make_regex(countries):
escaped = (re.escape(country) for country in countries)
states = '|'.join(escaped)
return re.compile(r'\s+(of(\sthe)?\s)?(?P<state>{})'.format(states))
def remove_name(match):
name = match.group()
if name.lstrip().startswith('of'):
return name
else:
return name.replace(match.group('state'), '').strip()
regex = make_regex(['China', 'Italy', 'America'])
regex.sub(remove_name, 'Embassy of China, International Italy').strip()
# result: 'Embassy of China, International'
The result might contain some spurious space (in the above case a last strip()
is needed). You can fix this modifying the regex to:
\s*(of(\sthe)?\s)?(?P<state>({}))
To catch the spaces before of
or before the country name and avoid the bad spacing in the output.
Note that this solution can handle a whole text, not just text of the form Something of Country
and Something Country
. For example:
In [38]: regex = make_regex(['China'])
...: text = '''This is more complex than just "Embassy of China" and "International China"'''
In [39]: regex.sub(remove_name, text)
Out[39]: 'This is more complex than just "Embassy of China" and "International"'
an other example usage:
In [33]: countries = [
...: 'China', 'India', 'Denmark', 'New York', 'Guatemala', 'Sudan',
...: 'France', 'Italy', 'Australia', 'New Zealand', 'Brazil',
...: 'Canada', 'Japan', 'Vietnam', 'Middle-Earth', 'Russia',
...: 'Spain', 'Portugal', 'Argentina', 'San Marino'
...: ]
In [34]: template = 'Embassy of {0}, International {0}, Language of {0} is {0}, Government of {0}, {0} capital, Something {0} and something of the {0}.'
In [35]: text = 100 * '\n'.join(template.format(c) for c in countries)
In [36]: regex = make_regex(countries)
...: result = regex.sub(remove_name, text)
In [37]: result[:150]
Out[37]: 'Embassy of China, International, Language of China is, Government of China, capital, Something and something of the China.\nEmbassy of India, Internati'
Upvotes: 0
Reputation: 1072
Not tested:
def removeCountry(name):
for country in countries:
name = re.sub('(?<!of (the )?)' + country + '$', '', name).strip()
Using negative lookbehind re.sub just matches and replaces when country is not preceded by of or of the
Upvotes: 0
Reputation: 3929
You could compile a few sets of regular expressions, then pass your list of input through them. Something like: import re
countries = ['foo', 'bar', 'baz']
takes = [re.compile(r'of\s+(the)?\s*%s$' % (c), re.I) for c in countries]
subs = [re.compile(r'%s$' % (c), re.I) for c in countries]
def remove_country(s):
for regex in takes:
if regex.search(s):
return s
for regex in subs:
s = regex.sub('', s)
return s
print remove_country('the bank of foo')
print remove_country('the bank of the baz')
print remove_country('the nation bar')
''' Output:
the bank of foo
the bank of the baz
the nation
'''
It doesn't look like anything faster than linear time complexity is possible here. At least you can avoid recompiling the regular expressions a million times and improve the constant factor.
Edit: I had a few typos, bu the basic idea is sound and it works. I've added an example.
Upvotes: 1
Reputation: 56654
I think you could use the approach in Python: how to determine if a list of words exist in a string to find any countries mentioned, then do further processing from there.
Something like
countries = [
"Afghanistan",
"Albania",
"Algeria",
"Andorra",
"Angola",
"Anguilla",
"Antigua",
"Arabia",
"Argentina",
"Armenia",
"Aruba",
"Australia",
"Austria",
"Azerbaijan",
"Bahamas",
"Bahrain",
"China",
"Russia"
# etc
]
def find_words_from_set_in_string(set_):
set_ = set(set_)
def words_in_string(s):
return set_.intersection(s.split())
return words_in_string
get_countries = find_words_from_set_in_string(countries)
then
get_countries("The Embassy of China in Argentina is down the street from the Consulate of Russia")
returns
set(['Argentina', 'China', 'Russia'])
... which obviously needs more post-processing, but very quickly tells you exactly what you need to look for.
As pointed out in the linked article, you must be wary of words ending in punctuation - which could be handled by something like s.split(" \t\r\n,.!?;:'\"")
. You may also want to look for adjectival forms, ie "Russian", "Chinese", etc.
Upvotes: 1