Reputation: 61
I need to match if the second group matches only once in all text. For example, the 1st, 4th, 6th strings should not be matched: Regex101
AS "(.+)\.(.+)"
table1.field1 AS "table1.field1",
table2.field2 AS "table2.field2",
table3.field3 AS "table3.field3",
table4.field1 AS "table4.field1",
table4.field1 AS "table5.field5",
table6.field6 AS "table6.field1",
Upvotes: 3
Views: 468
Reputation: 19641
You could just use your original pattern and then filter the matches:
regex = r"\bAS \"(.+)\.(.+)\""
matches = re.findall(regex, test_str, re.MULTILINE)
filtered = [x for x in matches
if sum(y[1] == x[1] for y in matches) == 1]
Output:
[('table2', 'field2'), ('table3', 'field3'), ('table5', 'field5')]
Upvotes: 2
Reputation: 626825
You can actually do what you want with PyPi regex module that allows infinite width lookbehind patterns:
AS "([^.]+)\.([^.]+)"(?<!AS "[^.]+\.\2"[\s\S]*AS "\1\.\2")(?![\s\S]*AS "[^.]+\.\2")
See the regex demo. Details:
AS "([^.]+)\.([^.]+)"
:
AS "
- a AS "
string([^.]+)
- Group 1: any one or more chars other than .
\.
- a .
char([^.]+)
- Group 2: any one or more chars other than .
"
- a "
char(?<!AS "[^.]+\.\2"[\s\S]*AS "\1\.\2")
- a negative lookbehind that cancels the match if, immediately to the left of the current position, there is
AS "
- a literal string[^.]+
- one or more chars other than .
\.
- a dot\2
- same value as in Group 2"
- a double quote[\s\S]*
- 0 or more chars as many as possibleAS "\1\.\2"
- AS "
, same value as in Group 1, .
, same value as in Group 2 (it is necessary here to make sure we match the part of string matched with the above consuming pattern part)(?![\s\S]*AS "[^.]+\.\2")
- a negative lookahead that fails the match if, immediately to the right of the current location, there are any zero or more chars as many as possible, AS "
, one or more chars other than .
, .
, the same value as in Group 2 and a "
.import regex
text = """table1.field1 AS "table1.field1",
table2.field2 AS "table2.field2",
table3.field3 AS "table3.field3",
table4.field1 AS "table4.field1",
table4.field1 AS "table5.field5",
table6.field6 AS "table6.field1","""
rx = r'AS "([^.]+)\.([^.]+)"(?<!AS "[^.]+\.\2"[\s\S]*AS "\1\.\2")(?![\s\S]*AS "[^.]+\.\2")'
print( regex.findall(rx, text) )
# => [('table2', 'field2'), ('table3', 'field3'), ('table5', 'field5')]
Upvotes: 1