Reputation: 1237
I have a pandas dataframe where one column is a bunch of strings with certain travel details. My goal is to parse each string to extract the city of origin and destination city (I would like to ultimately have two new columns titled 'origin' and 'destination').
The data:
df_col = [
'new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags'
]
This should result in:
Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)
Thus far I have tried:
A variety of NLTK methods, but what has gotten me closest is using the nltk.pos_tag
method to tag each word in the string. The result is a list of tuples with each word and associated tag. Here's an example...
[('Fly', 'NNP'), ('to', 'TO'), ('Australia', 'NNP'), ('&', 'CC'), ('New', 'NNP'), ('Zealand', 'NNP'), ('from', 'IN'), ('Paris', 'NNP'), ('from', 'IN'), ('€422', 'NNP'), ('return', 'NN'), ('including', 'VBG'), ('2', 'CD'), ('checked', 'VBD'), ('bags', 'NNS'), ('!', '.')]
I am stuck at this stage and am unsure how to best implement this. Can anyone point me in the right direction, please? Thanks.
Upvotes: 33
Views: 9848
Reputation: 122032
It's the post-COVID / post-ChatGPT era of computing, so lets revisit the comment:
I think you're asking for magic here =)
And to quote Arthur C. Clarke:
Any sufficiently advanced technology is indistinguishable from magic
Lets start with some prompting the open-source (though not that transparent) Mistral-7B model.
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
prompt = "Extract the origin and destination city in the sentence: "
def magic_wand(sent):
payload = f"{prompt} [INST] {sent} [/INST]"
encoded = tokenizer(payload, return_tensors="pt", add_special_tokens=False)
model_inputs = encoded.to(device)
model.to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
return decoded
for sent in texts:
print(magic_wand(sent))
[out]:
['Extract the origin and destination city in the sentence: [INST] new york to venice, italy for usd271 [/INST] The origin city in the sentence is "New York" and the destination city is "Venice, Italy". The cost of the trip is USD 271.</s>']
['Extract the origin and destination city in the sentence: [INST] return flights from brussels to bangkok with etihad from €407 [/INST] The origin city is Brussels and the destination city is Bangkok.</s>']
['Extract the origin and destination city in the sentence: [INST] from los angeles to guadalajara, mexico for usd191 [/INST] The origin city is Los Angeles, and the destination city is Guadalajara, Mexico.</s>']
['Extract the origin and destination city in the sentence: [INST] fly to australia new zealand from paris from €422 return including 2 checked bags [/INST] The destination city is Australia and New Zealand, and the origin city is Paris.</s>']
Note: You'll have to somehow get your hands on a Colab notebook with A100 instance which cost Usage rate: approximately 15.04 compute units per hour
and 500 compute units cost US$49.99. You do the math =)
A: Okay, fine, lets do some string magic...
def magic_wand(sent):
payload = f"{prompt} [INST] {sent} [/INST]"
encoded = tokenizer(payload, return_tensors="pt", add_special_tokens=False)
model_inputs = encoded.to(device)
model.to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)[0]
decoded[len(payload):]
return decoded
for sent in texts:
print(magic_wand(sent))
[out]:
The origin city is New York and the destination city is Venice, Italy. The cost of the trip is USD271.</s>
The origin city in the sentence is "Brussels" and the destination city is "Bangkok". The airline company is "Etihad" and the price of the flights is "€407".
The origin city is Los Angeles and the destination city is Guadalajara, Mexico. The cost of the trip is USD191.</s>
The origin city is "Paris" and the destination cities are "Australia" and "New Zealand".</s>
A: Really that lazy? Fine, lets do more string munging and get the answer into the right shape.
texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
prompt = "Extract the origin and destination city in the sentence: "
import re
def magic_wand(sent):
payload = f"{prompt} [INST] {sent} [/INST]"
encoded = tokenizer(payload, return_tensors="pt", add_special_tokens=False)
model_inputs = encoded.to(device)
model.to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)[0]
answer = decoded[len(payload):].strip()
expected_pattern = r"The origin city is (.*) and the destination city is (.*?)\.<\/s>"
matches = list(re.finditer(expected_pattern, answer, re.MULTILINE))
print(answer)
return {"from": matches[0].groups()[0], "to": matches[0].groups()[1]}
for sent in texts:
print(magic_wand(sent))
[out]:
The origin city is "new york" and the destination city is "venice, italy".</s>
{'from': '"new york"', 'to': '"venice, italy"'}
The origin city is Brussels and the destination city is Bangkok. The airline is Etihad. The price of the flights is around 407 euros.</s>
{'from': 'Brussels', 'to': 'Bangkok. The airline is Etihad. The price of the flights is around 407 euros'}
The origin city is Los Angeles, and the destination city is Guadalajara, Mexico.</s>
{'from': 'Los Angeles,', 'to': 'Guadalajara, Mexico'}
The origin city is Paris, and the destination cities are Australia and New Zealand.</s>
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-29-37b2fdd228da> in <cell line: 29>()
28
29 for sent in texts:
---> 30 print(magic_wand(sent))
<ipython-input-29-37b2fdd228da> in magic_wand(sent)
24 matches = list(re.finditer(expected_pattern, answer, re.MULTILINE))
25 print(answer)
---> 26 print(matches[0].groups())
27 return {"from": matches[0].groups()[0], "to": matches[0].groups()[1]}
28
IndexError: list index out of range
Seems like the pattern of LLM's answer isn't that fixed as we think it would be...
The origin city is Paris, and the destination cities are Australia and New Zealand.
Originally, the first time we ran:
[in]:
Extract the origin and destination city in the sentence: [INST] new york to venice, italy for usd271 [/INST]
[out]:
The origin city is New York and the destination city is Venice, Italy. The cost of the trip is USD271.
But the 2nd time you ran the function with the regex, it outputs:
[out]:
The origin city is "new york" and the destination city is "venice, italy".
A: Maybe we can, lets try this:
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
import re
texts = [#'new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
prompt = "Extract the origin and destination city in the sentence: "
one_shot = f"""{prompt} <s>[INST] new york to venice, italy for usd271 [/INST]("New York", "Venice, Italy")</s>"""
def magic_wand(sent):
payload = "\n".join([one_shot, f"{prompt} [INST] {sent} [/INST]"])
encoded = tokenizer(payload, return_tensors="pt", add_special_tokens=False)
model_inputs = encoded.to(device)
model.to(device)
generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)[0]
return list(re.findall(r'(\(.*?\))',decoded))[-1]
for sent in texts:
print(magic_wand(sent))
[out]:
("Brussels", "Bangkok")
("Los Angeles", "Guadalajara, Mexico")
("Paris", "Australia, Australia/New Zealand")
A: We took one example and give it to the model as a context, and then ask the model to repeat the pattern when giving us the answer when extracting the origin and destination countries. Aka. one-shot/few-shot prompting
See https://www.promptingguide.ai/techniques/fewshot
A: We have to define what "this" is first.
We have solved an efficiency problem in coding where we no longer have to write regex in a messy way with dictionaries to extract the origin/destination country
But we create new problems:
A: I guess, we can call it, productivity improvements =)
Upvotes: 4
Reputation: 122032
Pretty much impossible at first glance, unless you have access to some API that contains pretty sophisticated components.
From first look, it seems like you're asking to solve a natural language problem magically. But lets break it down and scope it to a point where something is buildable.
First, to identify countries and cities, you need data that enumerates them, so lets try: https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json
And top of the search results, we find https://datahub.io/core/world-cities that leads to the world-cities.json file. Now we load them into sets of countries and cities.
import requests
import json
cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))
countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])
Lets put them together.
import requests
import json
from flashtext import KeywordProcessor
cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))
countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
keyword_processor.extract_keywords(texts[0])
[out]:
['York', 'Venice', 'Italy']
Doing due diligence, first hunch is that "new york" is not in the data,
>>> "New York" in cities
False
What the?! #$%^&* For sanity sake, we check these:
>>> len(countries)
244
>>> len(cities)
21940
Yes, you cannot just trust a single data source, so lets try to fetch all data sources.
From https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json, you find another link https://github.com/dr5hn/countries-states-cities-database Lets munge this...
import requests
import json
cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities1_json = json.loads(requests.get(cities_url).content.decode('utf8'))
countries1 = set([city['country'] for city in cities1_json])
cities1 = set([city['name'] for city in cities1_json])
dr5hn_cities_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/cities.json"
dr5hn_countries_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/countries.json"
cities2_json = json.loads(requests.get(dr5hn_cities_url).content.decode('utf8'))
countries2_json = json.loads(requests.get(dr5hn_countries_url).content.decode('utf8'))
countries2 = set([c['name'] for c in countries2_json])
cities2 = set([c['name'] for c in cities2_json])
countries = countries2.union(countries1)
cities = cities2.union(cities1)
>>> len(countries)
282
>>> len(cities)
127793
Wow, that's a lot more cities than previously.
Lets try the flashtext
code again.
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
keyword_processor.extract_keywords(texts[0])
[out]:
['York', 'Venice', 'Italy']
Okay, for more sanity checks, lets just look for "york" in the list of cities.
>>> [c for c in cities if 'york' in c.lower()]
['Yorklyn',
'West York',
'West New York',
'Yorktown Heights',
'East Riding of Yorkshire',
'Yorke Peninsula',
'Yorke Hill',
'Yorktown',
'Jefferson Valley-Yorktown',
'New York Mills',
'City of York',
'Yorkville',
'Yorkton',
'New York County',
'East York',
'East New York',
'York Castle',
'York County',
'Yorketown',
'New York City',
'York Beach',
'Yorkshire',
'North Yorkshire',
'Yorkeys Knob',
'York',
'York Town',
'York Harbor',
'North York']
You: What kind of prank is this?!
Linguist: Welcome to the world of natural language processing, where natural language is a social construct subjective to communal and idiolectal variant.
You: Cut the crap, tell me how to solve this.
NLP Practitioner (A real one that works on noisy user-generate texts): You just have to add to the list. But before that, check your metric given the list you already have.
from itertools import zip_longest
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris'))]
# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0
for text, label in texts_labels:
extracted = keyword_processor.extract_keywords(text)
# We're making some assumptions here that the order of
# extracted and the truth must be the same.
true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
total_truth += len(label)
# Just visualization candies.
print(text)
print(extracted)
print(label)
print()
Actually, it doesn't look that bad. We get an accuracy of 90%:
>>> true_positives / total_truth
0.9
Alright, alright, so look at the "only" error that the above approach is making, it's simply that "New York" isn't in the list of cities.
You: Why don't we just add "New York" to the list of cities, i.e.
keyword_processor.add_keyword('New York')
print(texts[0])
print(keyword_processor.extract_keywords(texts[0]))
[out]:
['New York', 'Venice', 'Italy']
You: See, I did it!!! Now I deserve a beer.
Linguist: How about 'I live in Marawi'
?
>>> keyword_processor.extract_keywords('I live in Marawi')
[]
NLP Practitioner (chiming in): How about 'I live in Jeju'
?
>>> keyword_processor.extract_keywords('I live in Jeju')
[]
A Raymond Hettinger fan (from farway): "There must be a better way!"
Yes, there is what if we just try something silly like adding keywords of cities that ends with "City" into our keyword_processor
?
for c in cities:
if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
if c[:-5].strip():
keyword_processor.add_keyword(c[:-5])
print(c[:-5])
Now lets retry our regression test examples:
from itertools import zip_longest
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
for c in cities:
if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
if c[:-5].strip():
keyword_processor.add_keyword(c[:-5])
texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris')),
('I live in Florida', ('Florida')),
('I live in Marawi', ('Marawi')),
('I live in jeju', ('Jeju'))]
# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0
for text, label in texts_labels:
extracted = keyword_processor.extract_keywords(text)
# We're making some assumptions here that the order of
# extracted and the truth must be the same.
true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
total_truth += len(label)
# Just visualization candies.
print(text)
print(extracted)
print(label)
print()
[out]:
new york to venice, italy for usd271
['New York', 'Venice', 'Italy']
('New York', 'Venice', 'Italy')
return flights from brussels to bangkok with etihad from €407
['Brussels', 'Bangkok']
('Brussels', 'Bangkok')
from los angeles to guadalajara, mexico for usd191
['Los Angeles', 'Guadalajara', 'Mexico']
('Los Angeles', 'Guadalajara')
fly to australia new zealand from paris from €422 return including 2 checked bags
['Australia', 'New Zealand', 'Paris']
('Australia', 'New Zealand', 'Paris')
I live in Florida
['Florida']
Florida
I live in Marawi
['Marawi']
Marawi
I live in jeju
['Jeju']
Jeju
But seriously, this is only the tip of the problem. What happens if you have a sentence like this:
>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
['Adam', 'Bangkok', 'Singapore', 'China']
WHY is Adam
extracted as a city?!
Then you do some more neurotic checks:
>>> 'Adam' in cities
Adam
Congratulations, you've jumped into another NLP rabbit hole of polysemy where the same word has different meaning, in this case, Adam
most probably refer to a person in the sentence but it is also coincidentally the name of a city (according to the data you've pulled from).
[in]:
['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags'
]
[out]:
Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)
Linguist: Even with the assumption that the preposition (e.g. from
, to
) preceding the city gives you the "origin" / "destination" tag, how are you going to handle the case of "multi-leg" flights, e.g.
>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
What's the desired output of this sentence:
> Adam flew to Bangkok from Singapore and then to China
Perhaps like this? What is the specification? How (un-)structured is your input text?
> Origin: Singapore
> Departure: Bangkok
> Departure: China
Lets take that assumption you have and try some hacks to the same flashtext
methods.
What if we add to
and from
to the list?
from itertools import zip_longest
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
for c in cities:
if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
if c[:-5].strip():
keyword_processor.add_keyword(c[:-5])
keyword_processor.add_keyword('to')
keyword_processor.add_keyword('from')
texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
for text in texts:
extracted = keyword_processor.extract_keywords(text)
print(text)
print(extracted)
print()
[out]:
new york to venice, italy for usd271
['New York', 'to', 'Venice', 'Italy']
return flights from brussels to bangkok with etihad from €407
['from', 'Brussels', 'to', 'Bangkok', 'from']
from los angeles to guadalajara, mexico for usd191
['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']
fly to australia new zealand from paris from €422 return including 2 checked bags
['to', 'Australia', 'New Zealand', 'from', 'Paris', 'from']
Okay, lets work with the above output and see what we do about the problem 1. Maybe check if the term after the from is city, if not, remove the to/from?
from itertools import zip_longest
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))
for c in cities:
if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
if c[:-5].strip():
keyword_processor.add_keyword(c[:-5])
keyword_processor.add_keyword('to')
keyword_processor.add_keyword('from')
texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from €407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from €422 return including 2 checked bags']
for text in texts:
extracted = keyword_processor.extract_keywords(text)
print(text)
new_extracted = []
extracted_next = extracted[1:]
for e_i, e_iplus1 in zip_longest(extracted, extracted_next):
if e_i == 'from' and e_iplus1 not in cities and e_iplus1 not in countries:
print(e_i, e_iplus1)
continue
elif e_i == 'from' and e_iplus1 == None: # last word in the list.
continue
else:
new_extracted.append(e_i)
print(new_extracted)
print()
That seems to do the trick and remove the from
that doesn't precede a city/country.
[out]:
new york to venice, italy for usd271
['New York', 'to', 'Venice', 'Italy']
return flights from brussels to bangkok with etihad from €407
from None
['from', 'Brussels', 'to', 'Bangkok']
from los angeles to guadalajara, mexico for usd191
['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']
fly to australia new zealand from paris from €422 return including 2 checked bags
from None
['to', 'Australia', 'New Zealand', 'from', 'Paris']
Linguist: Think carefully, should ambiguity be resolved by making an informed decision to make ambiguous phrase obvious? If so, what is the "information" in the informed decision? Should it follow a certain template first to detect the information before filling in the ambiguity?
You: I'm losing my patience with you... You're bringing me in circles and circles, where's that AI that can understand human language that I keep hearing from the news and Google and Facebook and all?!
You: What you gave me are rule based and where's the AI in all these?
NLP Practitioner: Didn't you wanted 100%? Writing "business logics" or rule-based systems would be the only way to really achieve that "100%" given a specific data set without any preset data set that one can use for "training an AI".
You: What do you mean by training an AI? Why can't I just use Google or Facebook or Amazon or Microsoft or even IBM's AI?
NLP Practitioner: Let me introduce you to
Welcome to the world of Computational Linguistics and NLP!
Yes, there's no real ready-made magical solution and if you want to use an "AI" or machine learning algorithm, most probably you would need a lot more training data like the texts_labels
pairs shown in the above example.
Upvotes: 159