user2494770
user2494770

Reputation:

Does Go have a case-insensitive string contains() function?

I would like to be able to determine whether stringB is a case-insensitive substring of stringA. Looking through Go's strings pkg, the closest I can get is strings.Contains(strings.ToLower(stringA), strings.ToLower(stringB). Is there a less wordy alternative that I'm not seeing?

Upvotes: 30

Views: 28971

Answers (5)

Charlie Vieth
Charlie Vieth

Reputation: 21

(Disclosure: I am the owner of the repo)

github.com/charlievieth/strcase is a drop-in replacement for the strings package that uses Unicode simple-folding for all case comparisons (the Contains example is relevant here).

Performance

strcase will be at least 3-4x faster than using ToUpper/ToLower and often much faster than regexp (25x on average).

Below is the benchmark from Zyl with the strcase package added (here strcase is 4x faster since the text is mostly ASCII - if the text contained more multi-byte UTF-8 encoded sequences then strcase would be significantly faster):

package main

import (
    "regexp"
    "strings"
    "testing"

    "github.com/charlievieth/strcase"
)

const checkStringLen38 = "Hello RiCHard McCliNTock. How are you?"
const checkStringLen3091 = `What is Lorem Ipsum?

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Why do we use it?

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).

Where does it come from?

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. RiCHard McCliNTock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.

The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.
Where can I get some?

There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.`
const searchQuery = "richard mcclintock"

// Only ASCII matching here so we can be simple
func setBytes(b *testing.B, s, sep string) {
    n := strings.Index(strings.ToLower(s), strings.ToLower(sep))
    if n < 0 {
        b.Fatalf("strings.Index(%q, %q) = %d", strings.ToLower(s), strings.ToLower(sep), n)
    }
    b.SetBytes(int64(n + len(sep)))
}

func BenchmarkContainsLowerLowerShort(b *testing.B) {
    setBytes(b, checkStringLen38, searchQuery)
    for n := 0; n < b.N; n++ {
        strings.Contains(strings.ToLower(checkStringLen38), strings.ToLower(searchQuery))
    }
}

func BenchmarkContainsLowerLowerLong(b *testing.B) {
    setBytes(b, checkStringLen3091, searchQuery)
    for n := 0; n < b.N; n++ {
        strings.Contains(strings.ToLower(checkStringLen3091), strings.ToLower(searchQuery))
    }
}

func BenchmarkRegexpShort(b *testing.B) {
    setBytes(b, checkStringLen38, searchQuery)
    for n := 0; n < b.N; n++ {
        regexp.MustCompile("(?i)" + regexp.QuoteMeta(searchQuery)).MatchString(checkStringLen38)
    }
}

func BenchmarkRegexpLong(b *testing.B) {
    setBytes(b, checkStringLen3091, searchQuery)
    for n := 0; n < b.N; n++ {
        regexp.MustCompile("(?i)" + regexp.QuoteMeta(searchQuery)).MatchString(checkStringLen3091)
    }
}

func BenchmarkRegexpShortPrebuilt(b *testing.B) {
    setBytes(b, checkStringLen38, searchQuery)
    prebuiltRegExp := regexp.MustCompile("(?i)" + regexp.QuoteMeta(searchQuery))
    for n := 0; n < b.N; n++ {
        prebuiltRegExp.MatchString(checkStringLen38)
    }
}

func BenchmarkRegexpLongPrebuilt(b *testing.B) {
    setBytes(b, checkStringLen3091, searchQuery)
    prebuiltRegExp := regexp.MustCompile("(?i)" + regexp.QuoteMeta(searchQuery))
    for n := 0; n < b.N; n++ {
        prebuiltRegExp.MatchString(checkStringLen3091)
    }
}

func BenchmarkStrcaseShort(b *testing.B) {
    setBytes(b, checkStringLen38, searchQuery)
    for n := 0; n < b.N; n++ {
        strcase.Contains(checkStringLen38, searchQuery)
    }
}

func BenchmarkStrcaseLong(b *testing.B) {
    setBytes(b, checkStringLen3091, searchQuery)
    for n := 0; n < b.N; n++ {
        strcase.Contains(checkStringLen3091, searchQuery)
    }
}
goos: darwin
goarch: arm64
pkg: repl/repl-33
cpu: Apple M1 Max
BenchmarkContainsLowerLowerShort-10     10275684           114.9 ns/op   208.90 MB/s          48 B/op          1 allocs/op
BenchmarkContainsLowerLowerLong-10        269336          4444 ns/op     323.80 MB/s        3200 B/op          1 allocs/op
BenchmarkRegexpShort-10                   599764          1758 ns/op      13.66 MB/s        3732 B/op         20 allocs/op
BenchmarkRegexpLong-10                     57133         20870 ns/op      68.95 MB/s        3709 B/op         20 allocs/op
BenchmarkRegexpShortPrebuilt-10          5817914           208.4 ns/op   115.17 MB/s           0 B/op          0 allocs/op
BenchmarkRegexpLongPrebuilt-10             62002         19207 ns/op      74.92 MB/s           0 B/op          0 allocs/op
BenchmarkStrcaseShort-10                35489919            32.73 ns/op  733.25 MB/s           0 B/op          0 allocs/op
BenchmarkStrcaseLong-10                  1000000          1178 ns/op    1221.54 MB/s           0 B/op          0 allocs/op
PASS
ok      repl/repl-33    10.230s

Upvotes: 2

Zyl
Zyl

Reputation: 3310

Expanding on Zombo's answer with a benchmark comparing the use of strings.Contains(strings.ToLower(s), strings.ToLower(substr)) to the use of regexp.MustCompile("(?i)" + regexp.QuoteMeta(substr)).MatchString(s).

Code

import (
    "regexp"
    "strings"
    "testing"
)

const checkStringLen38 = "Hello RiCHard McCliNTock. How are you?"
const checkStringLen3091 = `What is Lorem Ipsum?

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
Why do we use it?

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).

Where does it come from?

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. RiCHard McCliNTock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.

The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.
Where can I get some?

There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.`
const searchQuery = "richard mcclintock"

func BenchmarkContainsLowerLowerShort(b *testing.B) {
    for n := 0; n < b.N; n++ {
        strings.Contains(strings.ToLower(checkStringLen38), strings.ToLower(searchQuery))
    }
}

func BenchmarkContainsLowerLowerLong(b *testing.B) {
    for n := 0; n < b.N; n++ {
        strings.Contains(strings.ToLower(checkStringLen3091), strings.ToLower(searchQuery))
    }
}

func BenchmarkRegexpShort(b *testing.B) {
    for n := 0; n < b.N; n++ {
        regexp.MustCompile("(?i)" + regexp.QuoteMeta(searchQuery)).MatchString(checkStringLen38)
    }
}

func BenchmarkRegexpLong(b *testing.B) {
    for n := 0; n < b.N; n++ {
        regexp.MustCompile("(?i)" + regexp.QuoteMeta(searchQuery)).MatchString(checkStringLen3091)
    }
}

func BenchmarkRegexpShortPrebuilt(b *testing.B) {
    prebuiltRegExp := regexp.MustCompile("(?i)" + regexp.QuoteMeta(searchQuery))
    for n := 0; n < b.N; n++ {
        prebuiltRegExp.MatchString(checkStringLen38)
    }
}

func BenchmarkRegexpLongPrebuilt(b *testing.B) {
    prebuiltRegExp := regexp.MustCompile("(?i)" + regexp.QuoteMeta(searchQuery))
    for n := 0; n < b.N; n++ {
        prebuiltRegExp.MatchString(checkStringLen3091)
    }
}

Results

>go test -bench=. ./...
goos: windows
goarch: amd64
cpu: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
BenchmarkContainsLowerLowerShort-8       9147040               130.3 ns/op
BenchmarkContainsLowerLowerLong-8         158318              7594 ns/op
BenchmarkRegexpShort-8                    364604              3262 ns/op
BenchmarkRegexpLong-8                      40394             29851 ns/op
BenchmarkRegexpShortPrebuilt-8           3741936               328.8 ns/op
BenchmarkRegexpLongPrebuilt-8              44394             27264 ns/op

Interpretation of Results

When only searching short strings, use of regexp.MustCompile("(?i)" + regexp.QuoteMeta(substr)).MatchString(s) benefits greatly (one order of magnitude) from building *regexp.Regexp only once. However, even then it takes about three times as long to execute as strings.Contains(strings.ToLower(s), strings.ToLower(substr)) for both long as well as short input strings (and we did not even check how much faster the ToLower()-variant would be if we assumed that the query string already was lower-cased) and even that is only the case under the constraint that substr is always the same, because building the regular expression only once is not an option otherwise.

tl;dr

There is nothing to be gained by using regexp.MustCompile("(?i)" + regexp.QuoteMeta(substr)).MatchString(s) over strings.Contains(strings.ToLower(s), strings.ToLower(substr)).

Upvotes: 2

payneio
payneio

Reputation: 453

If it's just the wordiness that you dislike, you could try making your code formatting cleaner, e.g.:

strings.Contains(
    strings.ToLower(stringA),
    strings.ToLower(stringB),
)

Or hiding it in a function in your own utils (or whatever) package:

package utils

import "strings"

func ContainsI(a string, b string) bool {
    return strings.Contains(
        strings.ToLower(a),
        strings.ToLower(b),
    )
}

Upvotes: 15

Tinkerer
Tinkerer

Reputation: 1068

I don't see one in the standard packages. How about this?

package main

import (
    "fmt"
    "strings"
)

func strcasestr(a, b string) bool {
    d := len(a)
    if d == 0 {
        return true
    }
    xx := strings.ToLower(a[0:1]) + strings.ToUpper(a[0:1])
    for i := 0; i <= len(b)-len(a); i++ {
        i = strings.IndexAny(b, xx)
        if i == -1 || i+d > len(b) {
            break
        }
        if d == 1 {
            return true
        }
        if strings.EqualFold(a[1:], b[i+1:i+d]) {
            return true
        }
    }
    return false
}

func main() {
    examples := []struct {
        a, b string
    }{
        {"APP", "apple pie"},
        {"Read", "banana bread"},
        {"ISP", "cherry crisp"},
        {"ago", "dragonfruit tart"},
        {"INC", "elderberry wine"},
        {"M", "Feijoa jam"},
    }
    for i, e := range examples {
        fmt.Println(i, ":", e.a, " in ", e.b, "? ", strcasestr(e.a, e.b))
    }
}

Upvotes: 0

Zombo
Zombo

Reputation: 1

Another option:

package main
import "regexp"

func main() {
   b := regexp.MustCompile("(?i)we").MatchString("West East")
   println(b)
}

https://golang.org/pkg/regexp/syntax

Upvotes: 1

Related Questions