Reputation: 2881
I have a list of inputs in google sheets,
Input | Desired Output | "To demonstrate only not an input" The repeated letters |
---|---|---|
Outdoors | Match | o |
dog | No Match | |
step | No Match | |
bee | Match | e |
Chessboard | Match | s |
Cookbooks | Match | o, k |
How do I verify if all letters are unique in a string without splitting it?
In other words if the string has one letter or more occurred twice or more, return TRUE
My process so far
I tried this solution in addition to splitting the string and dividing the length of the string on the COUNTA
of unique letters of the string, if = 1 "Match", else "No match"
Or using regex
I found a method to match a letter is occure in a string 2 times this demonstration with REGEXEXTRACT
But wait what needed is get TRUE when the letters are not unique in the string
=REGEXEXTRACT(A1,"o{2}?")
Returns:
oo
Something like this would do
=REGEXMATCH(Input,"(anyletter){2}?")
OR like this
=REGEXMATCH(lower(A6),"[a-zA-Z]{2}?")
Notes
TRUE
or FALSE
instead of Match
or No Match
to keep it simple.More examples
Input | Desired Output |
---|---|
Professionally | Match |
Attractiveness | Match |
Uncontrollably | Match |
disreputably | No Match |
Recommendation | Match |
Interrogations | Match |
Aggressiveness | Match |
doublethinks | No Match |
Upvotes: 4
Views: 1511
Reputation: 1307
Coming after the battle ^^ Why not simply compare the number of unique letters in the string and its original length ?
=COUNTUNIQUE(split(regexreplace(A2;"(.)"; "$1_"); "_")) < LEN(A2)
All my tests seem fine.
(split()
provided by this answer)
Upvotes: 0
Reputation: 19050
As Daniel Cruz said, Google Sheets functions such as regexmatch()
, regexextract()
and regexreplace()
use RE2 regexes that do not support backreferences. However, the query()
function uses Perl Compatible Regular Expressions that do support named capture groups and backreferences:
=arrayformula(
iferror( not( iserror(
match(
to_text(A3:A),
query(lower(unique(A3:A)), "where Col1 matches '.*?(?<char>.).*?\k<char>.*' ", 0),
0
)
) / (A3:A <> "") ) )
)
In my limited testing with a sample size of 1000 heterograms, pangrams, words with diacritic letters, and 10-character pseudo-random unique values from TheMaster's corpus, this PREG formula ran at about half the speed of the JvdV2 RE2 regex.
With Osm's sample of 50,000 highly repetitive sample values, the formula ran at 8x the speed of JvdV2.
A PREG regex is slower than a RE2 regex, but has the benefit that you can more easily check all characters for repeats. This lets you work with corpuses that include diacritic letters, numbers and other non-English alphabet characters:
Input | Output |
---|---|
Professionally | TRUE |
disreputably | FALSE |
Abacus | TRUE |
Élysée | TRUE |
naïve Ï | TRUE |
määräävä | TRUE |
121 | TRUE |
123 | FALSE |
You can also easily state which specific characters to check by replacing <char>.
with something like <char>[\wéäåö]
or <char>[^-;,.\s\d]
.
Upvotes: 5
Reputation: 50654
Created a benchmark here.
NOW()
to create a timestamp, when checkbox is clicked.NOW()
to create another timestamp, when the last row is filled and the checkbox is on.Math.random
between [A-Za-z]
with 10 characters per word.Formula | Round1 | Round2 | Avg | % Slower than best |
---|---|---|---|---|
Sample size | 10006 | |||
[re2](a.*a|b.*b)JvDv | 0:00:19 | 0:00:19 | 0:00:19 | -15.15% |
[re2+recursion]MASTERMATCH_RE2 | 0:00:27 | 0:00:24 | 0:00:26 | -54.55% |
[Find+recursion]MASTERMATCH | 0:00:17 | 0:00:16 | 0:00:17 | 0.00% |
[PREG]Doubleunary | 0:00:57 | 0:00:53 | 0:00:55 | -233.33% |
This varies greatly based on browser/device/mobile app and on non-randomized sample data. But I found PREG to be consistently slower than re2
Use recursion.
This seems extremely faster than the regex based approach. Create a named function:
MASTERMATCH
word
The word to check
start
Starting at
=IF(
MID(word,start,1)="",
FALSE,
IF(
ISERROR(FIND(MID(word,start,1),word,start+1)),
MASTERMATCH(word,start+1),
TRUE
)
)
=ARRAYFORMULA(MASTERMATCH(A2:INDEX(A2:A,COUNTA(A2:A)),1))
Or without case sensitivity
=ARRAYFORMULA(MASTERMATCH(lower(A2:A),1))
It recurses through each character using MID
and checks whether the same character is available after this position using FIND
. If so, returns true
and doesn't check anymore. If not, keeps checking until the last character using recursion.
Or with regex, Create a named function:
MASTERMATCH_RE2
word
The word to check
start
Starting at
IF(
MID(word,start,1)="",
FALSE,
IF(
REGEXMATCH(word,MID(word, start, 1)&"(?i).*"&MID(word,start,1)),
TRUE,
MASTERMATCH_RE2(word,start+1)
)
)
=ARRAYFORMULA(MASTERMATCH_RE2(A2:A,1))
Or
=ARRAYFORMULA(MASTERMATCH_RE2(lower(A2:A),1))
It recurses through each character and creates a regex for that character. Instead of a.*a
, b.*b
,..., it takes the first character(using MID
), eg: o
in outdoor
and creates a regex o.*o
. If regex is positive for that regex (using REGEXMATCH
), returns true and doesn't check for other letters or create other regexes.
Uses lambda, but it's efficient. Loop through each row and every character with MAP
and REDUCE
. REPLACE
each character in the word and find the difference in length. If more than 1, don't check length anymore and return Match
=MAP(
A2:INDEX(A2:A,COUNTA(A2:A)),
LAMBDA(_,
REDUCE(
"No Match",
SEQUENCE(LEN(_)),
LAMBDA(a,c,
IF(a="Match",a,
IF(
LEN(_)-LEN(
REGEXREPLACE(_,"(?i)"&MID(_,c,1),)
)>1,
"Match",a
)
)
)
)
)
)
If you do run into lambda limitations, remove the MAP
and drag fill the REDUCE
formula.
=REDUCE("No Match",SEQUENCE(LEN(A2)),LAMBDA(a,c,IF(a="Match",a,IF(LEN(A2)-LEN(REGEXREPLACE(A2, "(?i)"&MID(A2,c,1),))>1,"Match",a))))
The latter is preferred for conditional formatting as well.
Upvotes: 5
Reputation: 34275
Just to illustrate another method - not likely to be scaleable - try to substitute the second occurrence of the letter:
=ArrayFormula(if(isnumber(xmatch(len(A2)-1,len(substitute(upper(A2),char(sequence(1,26,65)),"",2)))),"Match","No match"))
If splitting were permitted, I would favour use of Frequency for speed, e.g.
=ArrayFormula(max(frequency(code(mid(upper(A2),sequence(len(A2)),1)),sequence(1,26,65)))>1)
Upvotes: 2
Reputation: 75910
You are explicitly asking for an answer using a single regular expression. Unfortunately there is no such thing as a backreference to a former capture group using RE2. So if you'd spell out the answer to your problem it would look like:
=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)")))
Since you are looking for case-insensitive matching (?i)
modifier will help to cut down the options to just the 26 letters of the alphabet. I suppose the above can be written a bit neater like:
=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")")))
EDIT 1:
The only other reasonable way to do this (untill I learned about the PREG supported syntax of the matches clause in QUERY()
by @DoubleUnary) with a single regex other than the above is to create your own UDF in GAS (AFAIK). It's going to be JavaScript based thus supporting a backreferences. GAS is not my forte, but a simple example could be:
function REGEXMATCH_JS(s) {
if (s.map) {
return s.map(REGEXMATCH_JS);
} else {
return /([a-z]).*?\1/gi.test(s);
}
}
The pattern ([a-z]).*?\1
means:
([a-z])
- Capture a single character in range a-z;.*?\1
- Look for 0+ (lazy) characters up to a copy of this 1st captured character with a backreference.The match is global and case-insensitive. You can now call:
=INDEX(IF(A2:A="","",REGEXMATCH_JS(A2:A)))
EDIT 2:
For those that are benchmarking speed, I am not testing this myself but maybe this would speed things up:
=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)"))
Or:
=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")"))
Or:
=REGEXMATCH_JS(A2:INDEX(A:A,COUNTA(A:A)))
Respectively. Knowing there is a header in 1st row.
Upvotes: 6
Reputation: 27222
You can give a try by using this RegEx
: /(\w).*?\1/g
in the REGEXMATCH
function in google sheets.
Explanation :
(\w)
- matches word characters (a-z
, A-Z
, 0-9
, _
), If you are sure that input will contain only alphabets then you can also use ([a-zA-Z])
; then
.*?
- zero or more characters (the ? denotes as optional that means it can match for consecutive as well as non-consecutive); until
\1
- it finds a repeat of the first matched character.
Live Demo : regex101
Upvotes: 0
Reputation: 1
try:
=INDEX(IF(IFERROR(LEN(REGEXREPLACE(A1:A6, "[^"&C1:C6&"]", )), -1)>=
(LEN(SUBSTITUTE(C1:C6, "|", ))*2), "Match", "No Match"))
create a query heat map, filter it and vlookup back row position
=INDEX(LAMBDA(a, IF(""<>IFNA(VLOOKUP(ROW(a),
SPLIT(QUERY(QUERY(FLATTEN(ROW(a)&""®EXEXTRACT(a, REPT("(.)", LEN(a)))),
"select Col1,count(Col1) where Col1 matches '.*\w+$' group by Col1"),
"select Col1 where Col2 > 1", ), ""), 2, )), "Match", "No Match"))
(A2:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
case insensitive would be:
=INDEX(LAMBDA(a, IF(""<>IFNA(VLOOKUP(ROW(a),
SPLIT(QUERY(QUERY(FLATTEN(ROW(a)&""&LOWER(REGEXEXTRACT(a, REPT("(.)", LEN(a))))),
"select Col1,count(Col1) where Col1 matches '.*\w+$' group by Col1"),
"select Col1 where Col2 > 1", ), ""), 2, )), "Match", "No Match"))
(A2:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
Upvotes: 4