GwydionFR
GwydionFR

Reputation: 787

Get named list of subgroup in golang regex

I'm looking for a function that returns a map[string]interface{} where interface{} can be a slice, a a map[string]interface{} or a value.

My use case is to parse WKT geometry like the following and retrieves point values; Example for a donut polygon:

POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

The regex (I voluntary set \d that matches only integers for readability purpose):

(POLYGON \(
    (?P<polygons>\(
        (?P<points>(?P<point>(\d \d), ){3,})
        (?P<last_point>\d \d )\),)*
    (?P<last_polygon>\(
        (?P<points>(?P<point>(\d \d), ){3,})
        (?P<last_point>\d \d)\))\)
)

I have a function (copied from SO) that retrieves some informations but it's not that good for nested groups and list of groups:

func getRegexMatchParams(reg *regexp.Regexp, url string) (paramsMap map[string]string) {
    match := reg.FindStringSubmatch(url)
    paramsMap = make(map[string]string)
    for i, name := range reg.SubexpNames() {
        if i > 0 && i <= len(match) {
            paramsMap[name] = match[i]
        }
    }
    return match
}

It seems that the group point gets only 1 point. example on playground

[EDIT] The result I want is something like this:

map[string]interface{}{
    "polygons": map[string]interface{} {
        "points": []interface{}{
            {map[string]string{"point": "0 0"}},     
            {map[string]string{"point": "0 10"}},        
            {map[string]string{"point": "10 10"}},        
            {map[string]string{"point": "10 0"}},
        },
        "last_point": "0 0",
    },
    "last_polygon": map[string]interface{} {
        "points": []interface{}{
            {map[string]string{"point": "3 3"}},     
            {map[string]string{"point": "3 7"}},        
            {map[string]string{"point": "7 7"}},        
            {map[string]string{"point": "7 3"}},
        },
        "last_point": "3 3",
    }
}

So I can use it further for different purposes like querying databases and validate that last_point = points[0] for each polygon.

Upvotes: 0

Views: 542

Answers (1)

user557597
user557597

Reputation:

Try to add some whitespace to the regex.

Also note that this engine won't retain all capture group values that are
within a quantified outer grouping like (a|b|c)+ where this group will only contain the last a or b or c it finds.

And, your regex can be reduced to this

(POLYGON\s*\((?P<polygons>\(\s*(?P<points>(?P<point>\s*(\d+\s+\d+)\s*,){3,})\s*(?P<last_point>\d+\s+\d+)\s*\)(?:\s*,\s*|\s*\)))+)

https://play.golang.org/p/rLaaEa_7GX


The original:

(POLYGON\s*\((?P<polygons>\(\s*(?P<points>(?P<point>\s*(\d+\s+\d+)\s*,){3,})\s*(?P<last_point>\d+\s+\d+)\s*\),)*(?P<last_polygon>\(\s*(?P<points>(?P<point>\s*(\d+\s+\d+)\s*,){3,})\s*(?P<last_point>\d+\s+\d+)\s*\))\s*\))

https://play.golang.org/p/rZgJYPDMzl

See below for what the groups contain.

 (                             # (1 start)
      POLYGON \s* \(
      (?P<polygons>                 # (2 start)
           \( \s* 
           (?P<points>                   # (3 start)
                (?P<point>                    # (4 start)
                     \s* 
                     ( \d+ \s+ \d+ )               # (5)
                     \s* 
                     , 
                ){3,}                         # (4 end)
           )                             # (3 end)
           \s*            
           (?P<last_point> \d+ \s+ \d+ )  # (6)
           \s* \),
      )*                            # (2 end)
      (?P<last_polygon>             # (7 start)
           \( \s* 
           (?P<points>                   # (8 start)
                (?P<point>                    # (9 start)
                     \s* 
                     ( \d+ \s+ \d+ )               # (10)
                     \s* 
                     , 
                ){3,}                         # (9 end)
           )                             # (8 end)
           \s* 
           (?P<last_point> \d+ \s+ \d+ )  # (11)
           \s* \)
      )                             # (7 end)
      \s* \)
 )                             # (1 end)

Input

POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

Output

 **  Grp 0                -  ( pos 0 , len 65 ) 
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))  
 **  Grp 1                -  ( pos 0 , len 65 ) 
POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))  
 **  Grp 2 [polygons]     -  ( pos 9 , len 30 ) 
(0 0, 0 10, 10 10, 10 0, 0 0),  
 **  Grp 3 [points]       -  ( pos 10 , len 23 ) 
0 0, 0 10, 10 10, 10 0,  
 **  Grp 4 [point]        -  ( pos 27 , len 6 ) 
 10 0,  
 **  Grp 5                -  ( pos 28 , len 4 ) 
10 0  
 **  Grp 6 [last_point]   -  ( pos 34 , len 3 ) 
0 0  
 **  Grp 7 [last_polygon] -  ( pos 39 , len 25 ) 
(3 3, 3 7, 7 7, 7 3, 3 3)  
 **  Grp 8 [points]       -  ( pos 40 , len 19 ) 
3 3, 3 7, 7 7, 7 3,  
 **  Grp 9 [point]        -  ( pos 54 , len 5 ) 
 7 3,  
 **  Grp 10                -  ( pos 55 , len 3 ) 
7 3  
 **  Grp 11 [last_point]   -  ( pos 60 , len 3 ) 
3 3  

Possible Solution

It's not impossible. It just takes a few extra steps.
(As an aside, isn't there a library for WKT that can parse this for you ?)

Now, I don't know your language capabilities, so this is just a general approach.

1. Validate the form you're parsing.
This will validate and return all polygon sets as a single string in All_Polygons group.

Target POLYGON ((0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3))

POLYGON\s*\((?P<All_Polygons>(?:\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))(?:\s*,\(\s*\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,}\s*\))*)\s*\)

 **  Grp 1 [All_Polygons] -  ( pos 9 , len 55 ) 
(0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)

2. If 1 was successful, set up a loop match using the output of All_Polygons string.

Target (0 0, 0 10, 10 10, 10 0, 0 0),(3 3, 3 7, 7 7, 7 3, 3 3)

(?:\(\s*(?P<Single_Poly_All_Pts>\d+\s+\d+(?:\s*,\s*\d+\s+\d+){2,})\s*\))

This step is equivalent of a find all type of match. It should match successive values of all the points of a single polygon, returned in Single_Poly_All_Pts group string.

This will give you these 2 separate matches, which can be put into a temp array having 2 value strings:

 **  Grp 1 [Single_Poly_All_Pts] -  ( pos 1 , len 27 ) 
0 0, 0 10, 10 10, 10 0, 0 0  

 **  Grp 1 [Single_Poly_All_Pts] -  ( pos 31 , len 23 ) 
3 3, 3 7, 7 7, 7 3, 3 3  

3. If 2 was successful, set up a loop match using the temp array output of step 2.
This will give you the individual points of each polygon.

(?P<Single_Point>\d+\s+\d+)

Again this is a loop match (or a find all type of match). For each array element
(Polygon), this will produce the individual points.

Target[element 1] 0 0, 0 10, 10 10, 10 0, 0 0

 **  Grp 1 [Single_Point] -  ( pos 0 , len 3 ) 
0 0  
 **  Grp 1 [Single_Point] -  ( pos 5 , len 4 ) 
0 10  
 **  Grp 1 [Single_Point] -  ( pos 11 , len 5 ) 
10 10  
 **  Grp 1 [Single_Point] -  ( pos 18 , len 4 ) 
10 0  
 **  Grp 1 [Single_Point] -  ( pos 24 , len 3 ) 
0 0  

And,

Target[element 2] 3 3, 3 7, 7 7, 7 3, 3 3

 **  Grp 1 [Single_Point] -  ( pos 0 , len 3 ) 
3 3  
 **  Grp 1 [Single_Point] -  ( pos 5 , len 3 ) 
3 7  
 **  Grp 1 [Single_Point] -  ( pos 10 , len 3 ) 
7 7  
 **  Grp 1 [Single_Point] -  ( pos 15 , len 3 ) 
7 3  
 **  Grp 1 [Single_Point] -  ( pos 20 , len 3 ) 
3 3  

Upvotes: 2

Related Questions