Reputation: 84465
Pretty sure there must be an answer for this on SO but my google fu is failing me.
I have a js file which contains a javascript array of dictionaries which starts as:
var a = t.locales = [{
countryCode: "AF",
countryName: "Afghanistan"
}, {
countryCode: "AL",
countryName: "Albania"
},
In the return there aren't spaces (versus the layout shown above) so the part of the script with country names will be a long version of the following:
[{countryCode:"AF",countryName:"Afghanistan"},{countryCode:"AL",countryName:"Albania"},{countryCode:"DZ",countryName:"Algeria"},{countryCode:"AS",countryName:"American Samoa"},{countryCode:"AD",countryName:"Andorra"},{countryCode:"AO",countryName:"Angola"},{countryCode:"AI",countryName:"Anguilla"},{countryCode:"AG",countryName:"Antigua & Barbuda"},{countryCode:"AR",countryName:"Argentina"},{countryCode:"AM",countryName:"Armenia"},{countryCode:"AW",countryName:"Aruba"},{countryCode:"AU",countryName:"Australia"},{countryCode:"AT",countryName:"Austria"},{countryCode:"AZ",countryName:"Azerbaijan"},{countryCode:"BS",countryName:"Bahamas"},{countryCode:"BH",countryName:"Bahrain"},{countryCode:"BD",countryName:"Bangladesh"},{countryCode:"BB",countryName:"Barbados"},{countryCode:"BY",countryName:"Belarus"},{countryCode:"BE",countryName:"Belgium"},{countryCode:"BZ",countryName:"Belize"},{countryCode:"BJ",countryName:"Benin"},{countryCode:"BM",countryName:"Bermuda"},{countryCode:"BT",countryName:"Bhutan"},{countryCode:"BO",countryName:"Bolivia"},{countryCode:"BQ",countryName:"Bonaire"},{countryCode:"BA",countryName:"Bosnia & Herzegovina"},{countryCode:"BW",countryName:"Botswana"}]
I want to regex out the country names e.g. 'Afghanistan','Albania'...... I can't seem to write the regex pattern which will return me a list of matches rather than one big long match.
For example,
countryName:"(.*)"
This returns a greedy single match that is not the a list of the individual countries.
I appreciate this is probably a very simple thing but all the different regex's I have tried fail e.g. p = re.compile(r'(?<=countryCode:")(.*)[^"]')
. Can anyone provide an appropriate regex with explanation?
N.B. This is a specific how do I do with regex question rather than whether it is the right tool for the job.
Essentially, I think I need to define a pattern that before the " after the country name each time (rather than the " after the last country name, for example, or much further on in some cases)
Expected result is list of countries from that object e.g.
['Afghanistan','Albania',.....]
Python:
import re, requests
r = requests.get('https://www.nexmo.com/static/bundle.js')
p = re.compile(r'(?<=countryCode:")(.*)[^"]')
countries = p.findall(r.text)
print(countries)
Upvotes: 0
Views: 63
Reputation: 52009
Use a non-greedy version of your first variant:
p = re.compile(r'countryName:"(.*?)"')
countries = p.findall(text)
The problem with using a greedy match like "(.*)"
is that it will match up until the end of the last "
.
{countryCode:"AF",countryName:"Afghanistan"},{countryCode:"AL",countryName:"Albania"}
^match ^ capture start ^ still matches .* final match of " ^
However, you want it to end on the smallest match - which is expressed by a non-greedy match
{countryCode:"AF",countryName:"Afghanistan"},{countryCode:"AL",countryName:"Albania"}
^match ^ capture start ^ first match of "
Upvotes: 1
Reputation: 18357
You need to change your regex to use (?<=countryName: ")[^"]+
instead of your current one. As your current one uses .*
which greedily matches everything and hence will match everything possible, which is what is happening in your case.
Try these Python codes,
import re
s = '''[{
countryCode: "AF",
countryName: "Afghanistan"
}, {
countryCode: "AL",
countryName: "Albania"
},'''
print(re.findall(r'(?<=countryName: ")[^"]+', s))
Prints,
['Afghanistan', 'Albania']
Upvotes: 1
Reputation: 82785
Use pattern r'countryName:\"(.*?)\"'
Ex:
import re
data = '[{countryCode:"AF",countryName:"Afghanistan"},{countryCode:"AL",countryName:"Albania"},{countryCode:"DZ",countryName:"Algeria"},{countryCode:"AS",countryName:"American Samoa"},{countryCode:"AD",countryName:"Andorra"},{countryCode:"AO",countryName:"Angola"},{countryCode:"AI",countryName:"Anguilla"},{countryCode:"AG",countryName:"Antigua & Barbuda"},{countryCode:"AR",countryName:"Argentina"},{countryCode:"AM",countryName:"Armenia"},{countryCode:"AW",countryName:"Aruba"},{countryCode:"AU",countryName:"Australia"},{countryCode:"AT",countryName:"Austria"},{countryCode:"AZ",countryName:"Azerbaijan"},{countryCode:"BS",countryName:"Bahamas"},{countryCode:"BH",countryName:"Bahrain"},{countryCode:"BD",countryName:"Bangladesh"},{countryCode:"BB",countryName:"Barbados"},{countryCode:"BY",countryName:"Belarus"},{countryCode:"BE",countryName:"Belgium"},{countryCode:"BZ",countryName:"Belize"},{countryCode:"BJ",countryName:"Benin"},{countryCode:"BM",countryName:"Bermuda"},{countryCode:"BT",countryName:"Bhutan"},{countryCode:"BO",countryName:"Bolivia"},{countryCode:"BQ",countryName:"Bonaire"},{countryCode:"BA",countryName:"Bosnia & Herzegovina"},{countryCode:"BW",countryName:"Botswana"}]'
countries = re.findall(r'countryName:\"(.*?)\"', data)
print(countries)
Output:
['Afghanistan',
'Albania',
'Algeria',
'American Samoa',
'Andorra',
'Angola',
'Anguilla',
'Antigua & Barbuda',
'Argentina',
'Armenia',
'Aruba',
'Australia',
'Austria',
'Azerbaijan',
'Bahamas',
'Bahrain',
'Bangladesh',
'Barbados',
'Belarus',
'Belgium',
'Belize',
'Benin',
'Bermuda',
'Bhutan',
'Bolivia',
'Bonaire',
'Bosnia & Herzegovina',
'Botswana']
Upvotes: 1