Raj Kumar
Raj Kumar

Reputation: 3

Python Regex non greedy way to match/select string within quotes but string sometimes contains brackets, comma, backslash and pull stops

I want to match these lines line1: ,["0x3bad08fb87bc906f:0x74d6f6242d49ab18","SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education)",null,[null,null,12.784799699999999,78.7137085] line2: ,["0x3bad08e4f337028d:0x5635e172ff9d7570","Sudha Nursery \u0026 Primary School",null,[null,null,12.7849528,78.7159848] line3: ,["0x3bad08e6a3dfe635:0x4ea2fcc42c9f7ce","As-Shukoor School",null,[null,null,12.7854174,78.7196367]

My observation is that each line starts with comma(,), ends with square bracket closed (]), three occurrences of "null" and then two numbers with decimal places from 5 to 16. All I want is to extract the string within quotes and the two numbers at the end with decimal places.

I figured little bit but confused how to match withing quotes which sometimes include brackets, pullstops, backslash, spaces, commas, minus sign, * Here is my half completed expression/pattern

(r'^\,\[\"0x[0-9a-z]{16}:0x[0-9a-z]{16}\"\,\"(.*?)\"\,null\,\[null\,null\,(\d\d\.\d{5,16})\,(\d\d\.\d{5,16})\]')

but this doesn't work. Any help is much appreciated.

Upvotes: 0

Views: 154

Answers (1)

Booboo
Booboo

Reputation: 44293

Use this regex with flags re.M:

^,\["0x[a-f0-9]{16}:0x[a-f0-9]{16}","([^"]*)",null,\[null,null,(\d+\.\d{5,16}),(\d+\.\d{5,16})\]$

See Regex Demo

Most everything in the above regex is straightforward. To match the quoted string, I am assuming that the string itself does not contain a " character. So I used ...

"([^"]*)"

... which matches 0 or more non-# characters within double-quotes and places those characters in capture group 1. This is a much more efficient alternative to "(.*?)"

import re

lines = """,["0x3bad08fb87bc906f:0x74d6f6242d49ab18","SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education)",null,[null,null,12.784799699999999,78.7137085]
,["0x3bad08e4f337028d:0x5635e172ff9d7570","Sudha Nursery \u0026 Primary School",null,[null,null,12.7849528,78.7159848]
,["0x3bad08e6a3dfe635:0x4ea2fcc42c9f7ce","As-Shukoor School",null,[null,null,12.7854174,78.7196367]
"""

rex = re.compile(r'^,\["0x[a-f0-9]{16}:0x[a-f0-9]{16}","([^"]*)",null,\[null,null,(\d+\.\d{5,16}),(\d+\.\d{5,16})\]$', re.M)
for m in rex.finditer(lines):
    print(m[1], m[2], m[3])

Prints:

SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education) 12.784799699999999 78.7137085
Sudha Nursery & Primary School 12.7849528 78.7159848

It will not match line 3 since 0x4ea2fcc42c9f7ce only contains 15 "nibbles" (half-bytes).

Update

If you assume that each line should be matched and you would like a more lenient regex because there could be some variation in the lines (for example, inserted whitespace), then you might wish to use this (with flags re.M):

^,[^[]*\["0x[a-f0-9]+:0x[a-f0-9]+"[^"]*"([^"]*)"\D*(\d+\.\d{5,16}),(\d+\.\d{5,16})
  1. ^ match start of line
  2. [^[]* match 0 or more non [ characters
  3. [ match a [
  4. "0x[a-f0-9]+:0x[a-f0-9]+" match quoted hex strings of arbitrary length separated by :
  5. [^"]* match 0 or more non " characters
  6. "([^"]*)" match quoted string in capture group 1
  7. \D* match 0 or more non-digits
  8. (\d+\.\d{5,16}) match decimal number in capture group 2
  9. , match a ,
  10. (\d+\.\d{5,16}) match decimal number in capture group 3

See Regex Demo

import re

lines = """,["0x3bad08fb87bc906f:0x74d6f6242d49ab18","SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education)",null,[null,null,12.784799699999999,78.7137085]
,["0x3bad08e4f337028d:0x5635e172ff9d7570","Sudha Nursery \u0026 Primary School",null,[null,null,12.7849528,78.7159848]
,["0x3bad08e6a3dfe635:0x4ea2fcc42c9f7ce","As-Shukoor School",null,[null,null,12.7854174,78.7196367]
"""

rex = re.compile(r'^,[^[]*\["0x[a-f0-9]+:0x[a-f0-9]+"[^"]*"([^"]*)"\D*(\d+\.\d{5,16}),(\d+\.\d{5,16})', re.M)
for m in rex.finditer(lines):
    print(m[1], m[2], m[3])

Prints

SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education) 12.784799699999999 78.7137085
Sudha Nursery & Primary School 12.7849528 78.7159848
As-Shukoor School 12.7854174 78.7196367

Update 2

And if you want to be really lenient, assuming every line should be matched:

^[^"]*"[^"]*"[^"]*"([^"]*)"\D*(\d+\.\d+)\D*(\d+\.\d+)
  1. ^ match start of line
  2. [^"]*"[^"]*" skip to and match first string
  3. [^"]*"([^"]*)" skip to and match second string and place in capture group 1
  4. \D*(\d+\.\d+) skip to next digit and capture decimal number in capture group 2.
  5. \D*(\d+\.\d+) skip to next digit and capture decimal number in capture group 3.
import re

lines = """,["0x3bad08fb87bc906f:0x74d6f6242d49ab18","SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education)",null,[null,null,12.784799699999999,78.7137085]
,["0x3bad08e4f337028d:0x5635e172ff9d7570","Sudha Nursery \u0026 Primary School",null,[null,null,12.7849528,78.7159848]
,["0x3bad08e6a3dfe635:0x4ea2fcc42c9f7ce","As-Shukoor School",null,[null,null,12.7854174,78.7196367]
"""

rex = re.compile(r'^[^"]*"[^"]*"[^"]*"([^"]*)"\D*(\d+\.\d+)\D*(\d+\.\d+).*$', re.M)
for m in rex.finditer(lines):
    print(m[1], m[2], m[3])

Upvotes: 2

Related Questions