Reputation: 596
I have a many Bigquery tables that contain passwords. I am trying to find reuse between passwords. For example, I want to be able to find when passwords are an extension of another
|---------------------|------------------|
| Password From | Password From |
| Table 1 | Table 2 |
|---------------------|------------------|
| computer | computer77 |
|---------------------|------------------|
To do this I join my tables and then use the function REGEXP_CONTAINS(value, regex) where I would specify a WHERE
and have REGEXP_CONTAINS(Table1.password, Table2.password)
The overall query would go something like this (my passwords are stored as BYTES
):
SELECT *
FROM (
SELECT safe_cast(A.password as STRING) as Astr, safe_cast(B.password as STRING) as Bstr
FROM Passwords.table1 as A
INNER JOIN Passwords.table2 as B
USING (email)
WHERE A.password != B.password)
WHERE (REGEXP_CONTAINS(Astr, Bstr) OR
REGEXP_CONTAINS(Bstr, Astr))
The problem with such an approach is that my password field contain characters that are not escaped when interpreted as a regular expression. It gives me an error like:
Cannot parse regular expression: no argument for repetition operator: ?
My first approach is to simply remove all the characters that are problematic with:
regexp_replace(Astr, r'[~\[\]?\(\)*\\+]', '')
This solves the problem in most cases but will change the actual number of matches when passwords contain such characters. Either it will make two passwords match when they should not or it will remove matches that should be present.
My next solution is trying to escape all these characters with the same REGEXP_REPLACE
function. I test the replacement with a query like so:
SELECT SAFE_CAST(A.password as STRING),
REGEXP_REPLACE(SAFE_CAST(A.password as STRING), r'[\.\*\+\?\^\$\{\}\(\)\|\[\]]', 'AAAAAAAAAAAAA\\0')
FROM Passwords.yahoo as A
WHERE REGEXP_CONTAINS(SAFE_CAST(A.password as STRING), r'[\.\*\+\?\^\$\{\}\(\)\|\[\]]')
Now this almost does what I want where field goes from test$
to testAAAAAAAAAAAAA$
Now if I switch the replacement from AAAAAAAAAAAAA\\0
to \\\\0
it changes test$
to test\0
. I want to end up with test$
becoming test\$
.
How can I add a \
before all .*+?^${}[]()
characters using the REGEXP_REPLACE function.
Examples:
|---------------------|------------------|
| Original | Transformed |
|---------------------|------------------|
| test$pp^ | test\$pp\^ |
|---------------------|------------------|
| pa$$word1 | pa\$\$word1 |
|---------------------|------------------|
| ^^(..)^^ | \^\^\(\.\.\)\^\^ |
|---------------------|------------------|
In the end I would like to be able to use this REGEXP_REPLACE to perform queries of reuse without getting errors. Example:
select *
from(
select A.password, B.password as Bpass, safe_cast(A.password as string) as Astr, safe_cast(B.password as string) as Bstr
from Passwords.Table1 as A
inner join Passwords.Table2 as B
using (email)
where A.password != B.password)
where REGEXP_CONTAINS(Bstr, regexp_replace(Astr, r'[\.\*\+\?\^\$\{\}\(\)\|\[\]]', !REPLACEMENT!)) or REGEXP_CONTAINS(Astr, regexp_replace(Bstr, r'[~\[\]?\(\)*\\+]', !REPLACEMENT!))
Upvotes: 2
Views: 4729
Reputation: 1
I have a similar kind of Problem where
Table A contains the value "computer77" Table B Contains the pattern "computer(\d{2})"
I try to join both the table using REGEXP_CONTAINS but getting the error "Cannot parse regular expression: invalid escape sequence: \d"
Upvotes: 0
Reputation: 666
\\\\\\0
We can use the matching string in replacement using '\0'. Add '\' before it. With escape characters, the overall replacement will look like '\\\\\\0
'
regexp_replace(Astr, r'[\.\*\+\?\^\$\{\}\(\)\|\[\]]', '\\\\\\0')
Upvotes: 1