Reputation: 3
I want to match these lines
line1:
,["0x3bad08fb87bc906f:0x74d6f6242d49ab18","SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education)",null,[null,null,12.784799699999999,78.7137085]
line2:
,["0x3bad08e4f337028d:0x5635e172ff9d7570","Sudha Nursery \u0026 Primary School",null,[null,null,12.7849528,78.7159848]
line3:
,["0x3bad08e6a3dfe635:0x4ea2fcc42c9f7ce","As-Shukoor School",null,[null,null,12.7854174,78.7196367]
My observation is that each line starts with comma(,), ends with square bracket closed (]), three occurrences of "null" and then two numbers with decimal places from 5 to 16. All I want is to extract the string within quotes and the two numbers at the end with decimal places.
I figured little bit but confused how to match withing quotes which sometimes include brackets, pullstops, backslash, spaces, commas, minus sign, * Here is my half completed expression/pattern
(r'^\,\[\"0x[0-9a-z]{16}:0x[0-9a-z]{16}\"\,\"(.*?)\"\,null\,\[null\,null\,(\d\d\.\d{5,16})\,(\d\d\.\d{5,16})\]')
but this doesn't work. Any help is much appreciated.
Upvotes: 0
Views: 154
Reputation: 44293
Use this regex with flags re.M
:
^,\["0x[a-f0-9]{16}:0x[a-f0-9]{16}","([^"]*)",null,\[null,null,(\d+\.\d{5,16}),(\d+\.\d{5,16})\]$
Most everything in the above regex is straightforward. To match the quoted string, I am assuming that the string itself does not contain a "
character. So I used ...
"([^"]*)"
... which matches 0 or more non-# characters within double-quotes and places those characters in capture group 1. This is a much more efficient alternative to "(.*?)"
import re
lines = """,["0x3bad08fb87bc906f:0x74d6f6242d49ab18","SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education)",null,[null,null,12.784799699999999,78.7137085]
,["0x3bad08e4f337028d:0x5635e172ff9d7570","Sudha Nursery \u0026 Primary School",null,[null,null,12.7849528,78.7159848]
,["0x3bad08e6a3dfe635:0x4ea2fcc42c9f7ce","As-Shukoor School",null,[null,null,12.7854174,78.7196367]
"""
rex = re.compile(r'^,\["0x[a-f0-9]{16}:0x[a-f0-9]{16}","([^"]*)",null,\[null,null,(\d+\.\d{5,16}),(\d+\.\d{5,16})\]$', re.M)
for m in rex.finditer(lines):
print(m[1], m[2], m[3])
Prints:
SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education) 12.784799699999999 78.7137085
Sudha Nursery & Primary School 12.7849528 78.7159848
It will not match line 3 since 0x4ea2fcc42c9f7ce
only contains 15 "nibbles" (half-bytes).
Update
If you assume that each line should be matched and you would like a more lenient regex because there could be some variation in the lines (for example, inserted whitespace), then you might wish to use this (with flags re.M
):
^,[^[]*\["0x[a-f0-9]+:0x[a-f0-9]+"[^"]*"([^"]*)"\D*(\d+\.\d{5,16}),(\d+\.\d{5,16})
^
match start of line[^[]*
match 0 or more non [
characters[
match a [
"0x[a-f0-9]+:0x[a-f0-9]+"
match quoted hex strings of arbitrary length separated by :
[^"]*
match 0 or more non "
characters"([^"]*)"
match quoted string in capture group 1\D*
match 0 or more non-digits(\d+\.\d{5,16})
match decimal number in capture group 2,
match a ,
(\d+\.\d{5,16})
match decimal number in capture group 3import re
lines = """,["0x3bad08fb87bc906f:0x74d6f6242d49ab18","SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education)",null,[null,null,12.784799699999999,78.7137085]
,["0x3bad08e4f337028d:0x5635e172ff9d7570","Sudha Nursery \u0026 Primary School",null,[null,null,12.7849528,78.7159848]
,["0x3bad08e6a3dfe635:0x4ea2fcc42c9f7ce","As-Shukoor School",null,[null,null,12.7854174,78.7196367]
"""
rex = re.compile(r'^,[^[]*\["0x[a-f0-9]+:0x[a-f0-9]+"[^"]*"([^"]*)"\D*(\d+\.\d{5,16}),(\d+\.\d{5,16})', re.M)
for m in rex.finditer(lines):
print(m[1], m[2], m[3])
Prints
SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education) 12.784799699999999 78.7137085
Sudha Nursery & Primary School 12.7849528 78.7159848
As-Shukoor School 12.7854174 78.7196367
Update 2
And if you want to be really lenient, assuming every line should be matched:
^[^"]*"[^"]*"[^"]*"([^"]*)"\D*(\d+\.\d+)\D*(\d+\.\d+)
^
match start of line[^"]*"[^"]*"
skip to and match first string[^"]*"([^"]*)"
skip to and match second string and place in capture group 1\D*(\d+\.\d+)
skip to next digit and capture decimal number in capture group 2.\D*(\d+\.\d+)
skip to next digit and capture decimal number in capture group 3.import re
lines = """,["0x3bad08fb87bc906f:0x74d6f6242d49ab18","SRI VIVEKANANDA MATRIC HIGHER SECONDARY SCHOOL Ambur (Spiritual, Modern Scientific Education)",null,[null,null,12.784799699999999,78.7137085]
,["0x3bad08e4f337028d:0x5635e172ff9d7570","Sudha Nursery \u0026 Primary School",null,[null,null,12.7849528,78.7159848]
,["0x3bad08e6a3dfe635:0x4ea2fcc42c9f7ce","As-Shukoor School",null,[null,null,12.7854174,78.7196367]
"""
rex = re.compile(r'^[^"]*"[^"]*"[^"]*"([^"]*)"\D*(\d+\.\d+)\D*(\d+\.\d+).*$', re.M)
for m in rex.finditer(lines):
print(m[1], m[2], m[3])
Upvotes: 2