horse
horse

Reputation: 61

Regex: match if group is only once in text

I need to match if the second group matches only once in all text. For example, the 1st, 4th, 6th strings should not be matched: Regex101

AS "(.+)\.(.+)"

table1.field1 AS "table1.field1",
table2.field2 AS "table2.field2",
table3.field3 AS "table3.field3",
table4.field1 AS "table4.field1",
table4.field1 AS "table5.field5",
table6.field6 AS "table6.field1",

Upvotes: 3

Views: 468

Answers (2)

41686d6564
41686d6564

Reputation: 19641

You could just use your original pattern and then filter the matches:

regex = r"\bAS \"(.+)\.(.+)\""
matches = re.findall(regex, test_str, re.MULTILINE)

filtered = [x for x in matches 
            if sum(y[1] == x[1] for y in matches) == 1]

Output:

[('table2', 'field2'), ('table3', 'field3'), ('table5', 'field5')]

Try it online.

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626825

You can actually do what you want with PyPi regex module that allows infinite width lookbehind patterns:

AS "([^.]+)\.([^.]+)"(?<!AS "[^.]+\.\2"[\s\S]*AS "\1\.\2")(?![\s\S]*AS "[^.]+\.\2")

See the regex demo. Details:

  • AS "([^.]+)\.([^.]+)":
    • AS " - a AS " string
    • ([^.]+) - Group 1: any one or more chars other than .
    • \. - a . char
    • ([^.]+) - Group 2: any one or more chars other than .
    • " - a " char
  • (?<!AS "[^.]+\.\2"[\s\S]*AS "\1\.\2") - a negative lookbehind that cancels the match if, immediately to the left of the current position, there is
    • AS " - a literal string
    • [^.]+ - one or more chars other than .
    • \. - a dot
    • \2 - same value as in Group 2
    • " - a double quote
    • [\s\S]* - 0 or more chars as many as possible
    • AS "\1\.\2" - AS ", same value as in Group 1, ., same value as in Group 2 (it is necessary here to make sure we match the part of string matched with the above consuming pattern part)
  • (?![\s\S]*AS "[^.]+\.\2") - a negative lookahead that fails the match if, immediately to the right of the current location, there are any zero or more chars as many as possible, AS ", one or more chars other than ., ., the same value as in Group 2 and a ".

Python demo:

import regex
text = """table1.field1 AS "table1.field1",
table2.field2 AS "table2.field2",
table3.field3 AS "table3.field3",
table4.field1 AS "table4.field1",
table4.field1 AS "table5.field5",
table6.field6 AS "table6.field1","""
rx = r'AS "([^.]+)\.([^.]+)"(?<!AS "[^.]+\.\2"[\s\S]*AS "\1\.\2")(?![\s\S]*AS "[^.]+\.\2")'
print( regex.findall(rx, text) )
# => [('table2', 'field2'), ('table3', 'field3'), ('table5', 'field5')]

Upvotes: 1

Related Questions