QHarr
QHarr

Reputation: 84465

Return individual matches not one long match regex

Pretty sure there must be an answer for this on SO but my google fu is failing me.

I have a js file which contains a javascript array of dictionaries which starts as:

var a = t.locales = [{
        countryCode: "AF",
        countryName: "Afghanistan"
    }, {
        countryCode: "AL",
        countryName: "Albania"
    },

In the return there aren't spaces (versus the layout shown above) so the part of the script with country names will be a long version of the following:

[{countryCode:"AF",countryName:"Afghanistan"},{countryCode:"AL",countryName:"Albania"},{countryCode:"DZ",countryName:"Algeria"},{countryCode:"AS",countryName:"American Samoa"},{countryCode:"AD",countryName:"Andorra"},{countryCode:"AO",countryName:"Angola"},{countryCode:"AI",countryName:"Anguilla"},{countryCode:"AG",countryName:"Antigua & Barbuda"},{countryCode:"AR",countryName:"Argentina"},{countryCode:"AM",countryName:"Armenia"},{countryCode:"AW",countryName:"Aruba"},{countryCode:"AU",countryName:"Australia"},{countryCode:"AT",countryName:"Austria"},{countryCode:"AZ",countryName:"Azerbaijan"},{countryCode:"BS",countryName:"Bahamas"},{countryCode:"BH",countryName:"Bahrain"},{countryCode:"BD",countryName:"Bangladesh"},{countryCode:"BB",countryName:"Barbados"},{countryCode:"BY",countryName:"Belarus"},{countryCode:"BE",countryName:"Belgium"},{countryCode:"BZ",countryName:"Belize"},{countryCode:"BJ",countryName:"Benin"},{countryCode:"BM",countryName:"Bermuda"},{countryCode:"BT",countryName:"Bhutan"},{countryCode:"BO",countryName:"Bolivia"},{countryCode:"BQ",countryName:"Bonaire"},{countryCode:"BA",countryName:"Bosnia & Herzegovina"},{countryCode:"BW",countryName:"Botswana"}]

I want to regex out the country names e.g. 'Afghanistan','Albania'...... I can't seem to write the regex pattern which will return me a list of matches rather than one big long match.

For example,

countryName:"(.*)"

This returns a greedy single match that is not the a list of the individual countries.

I appreciate this is probably a very simple thing but all the different regex's I have tried fail e.g. p = re.compile(r'(?<=countryCode:")(.*)[^"]') . Can anyone provide an appropriate regex with explanation?

N.B. This is a specific how do I do with regex question rather than whether it is the right tool for the job.

Essentially, I think I need to define a pattern that before the " after the country name each time (rather than the " after the last country name, for example, or much further on in some cases)

Expected result is list of countries from that object e.g.

['Afghanistan','Albania',.....]

Python:

import re, requests

r = requests.get('https://www.nexmo.com/static/bundle.js')
p = re.compile(r'(?<=countryCode:")(.*)[^"]')     
countries = p.findall(r.text)
print(countries)

Upvotes: 0

Views: 63

Answers (3)

MisterMiyagi
MisterMiyagi

Reputation: 52009

Use a non-greedy version of your first variant:

p = re.compile(r'countryName:"(.*?)"')     
countries = p.findall(text)

The problem with using a greedy match like "(.*)" is that it will match up until the end of the last ".

{countryCode:"AF",countryName:"Afghanistan"},{countryCode:"AL",countryName:"Albania"}
                  ^match  ^ capture start ^ still matches .*      final match of " ^

However, you want it to end on the smallest match - which is expressed by a non-greedy match

{countryCode:"AF",countryName:"Afghanistan"},{countryCode:"AL",countryName:"Albania"}
                  ^match  ^ capture start ^ first match of "

Upvotes: 1

Pushpesh Kumar Rajwanshi
Pushpesh Kumar Rajwanshi

Reputation: 18357

You need to change your regex to use (?<=countryName: ")[^"]+ instead of your current one. As your current one uses .* which greedily matches everything and hence will match everything possible, which is what is happening in your case.

Try these Python codes,

import re

s = '''[{
        countryCode: "AF",
        countryName: "Afghanistan"
    }, {
        countryCode: "AL",
        countryName: "Albania"
    },'''

print(re.findall(r'(?<=countryName: ")[^"]+', s))

Prints,

['Afghanistan', 'Albania']

Upvotes: 1

Rakesh
Rakesh

Reputation: 82785

Use pattern r'countryName:\"(.*?)\"'

Ex:

import re
data = '[{countryCode:"AF",countryName:"Afghanistan"},{countryCode:"AL",countryName:"Albania"},{countryCode:"DZ",countryName:"Algeria"},{countryCode:"AS",countryName:"American Samoa"},{countryCode:"AD",countryName:"Andorra"},{countryCode:"AO",countryName:"Angola"},{countryCode:"AI",countryName:"Anguilla"},{countryCode:"AG",countryName:"Antigua & Barbuda"},{countryCode:"AR",countryName:"Argentina"},{countryCode:"AM",countryName:"Armenia"},{countryCode:"AW",countryName:"Aruba"},{countryCode:"AU",countryName:"Australia"},{countryCode:"AT",countryName:"Austria"},{countryCode:"AZ",countryName:"Azerbaijan"},{countryCode:"BS",countryName:"Bahamas"},{countryCode:"BH",countryName:"Bahrain"},{countryCode:"BD",countryName:"Bangladesh"},{countryCode:"BB",countryName:"Barbados"},{countryCode:"BY",countryName:"Belarus"},{countryCode:"BE",countryName:"Belgium"},{countryCode:"BZ",countryName:"Belize"},{countryCode:"BJ",countryName:"Benin"},{countryCode:"BM",countryName:"Bermuda"},{countryCode:"BT",countryName:"Bhutan"},{countryCode:"BO",countryName:"Bolivia"},{countryCode:"BQ",countryName:"Bonaire"},{countryCode:"BA",countryName:"Bosnia & Herzegovina"},{countryCode:"BW",countryName:"Botswana"}]'
countries = re.findall(r'countryName:\"(.*?)\"', data)
print(countries)

Output:

['Afghanistan',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antigua & Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia',
 'Bonaire',
 'Bosnia & Herzegovina',
 'Botswana']

Upvotes: 1

Related Questions