con-f-use
con-f-use

Reputation: 4028

Longest common prefix of two strings in bash

I have two strings. For the sake of the example they are set like this:

string1="test toast"
string2="test test"

What I want is to find the overlap starting at the beginning of the strings. With overlap I mean the string "test t" in my above example.

# So I look for the command 
command "$string1" "$string2"
# that outputs:
"test t"

If the strings were string1="atest toast"; string2="test test" they would have no overlap since the check starts from the beginning and the "a" at the start of string1 .

Upvotes: 40

Views: 12882

Answers (15)

izissise
izissise

Reputation: 953

Another pure bash answer, works for N strings, outputs in $__ This cut with / delimiter for path, this behavior can be remove by removing the if.

longest_common_prefix() {
    __="$1"
    shift 1
    if (( $# == 0 )); then return; fi
    local i=0
    while (( $# > 0 )); do
        for ((i=0; i<${#__}; i++)); do
            if [[ "${__:i:1}" != "${1:i:1}" ]]; then
                __=${__:0:i}
            fi
        done
        if [[ "${1:i:1}" == "/" ]]; then
            __="${__}/" # common path in arg list
        fi
        shift 1
    done
    __=${__%/*} # cut to the last /
}

Upvotes: 0

Robin A. Meade
Robin A. Meade

Reputation: 2474

I've generalized @ack's answer to accommodate embedded newlines.

I'll use the following array of strings as a test case:

a=(
  $'/a\n/b/\nc  d\n/\n\ne/f'
  $'/a\n/b/\nc  d\n/\ne/f'
  $'/a\n/b/\nc  d\n/\ne\n/f'
  $'/a\n/b/\nc  d\n/\nef'
)

By inspection we can see that the longest common prefix is

$'/a\n/b/\nc  d\n/\n'

We can compute this and save the result into a variable with the following:

longest_common_prefix=$(
  printf '%s\0' "${a[@]}" \
  | sed -zE '$!{N;s/^(.*).*\x00\1.*$/\1\x00\1/;D;}' \
  | tr \\0 x # replace trailing NUL with a dummy character ①
)
longest_common_prefix=${longest_common_prefix%x} # Remove the dummy character
echo "${longest_common_prefix@Q}" # ②

Result:

$'/a\n/b/\nc  d\n/\n'

as expected. ✔️

I applied this technique in the context of path specs here: https://unix.stackexchange.com/a/639813


① To preserve any trailing newlines in this command substitution, we used the usual technique of appending a dummy character that is chopped off afterwards. We combined the removal of the trailing NUL with the addition of the dummy character (we chose x) in one step using tr \\0 x.

② The ${parameter@Q} expansion results in "a string that is the value of parameter quoted in a format that can be reused as input." – bash reference manual. Requires bash 4.4+ (discussion). Otherwise, you can inspect the result using one of the following:

Upvotes: 0

user5538922
user5538922

Reputation:

If you have an option to install a python package, you can use this python utility

# install pythonp
pythonp -m pip install pythonp

echo -e "$string1\n$string2" | pythonp 'l1,l2=lines
res=itertools.takewhile(lambda a: a[0]==a[1], zip(l1,l2)); "".join(r[0] for r in res)'

Upvotes: 1

ThorSummoner
ThorSummoner

Reputation: 18119

Another python-based answer, this one based on the os.path module's native commonprefix function

#!/bin/bash
cat mystream | python -c $'import sys, os; sys.stdout.write(os.path.commonprefix(sys.stdin.readlines()) + b\'\\n\')'

Longform, that's

import sys
import os
sys.stdout.write(
    os.path.commonprefix(sys.stdin.readlines()) + b'\n'
)

/!\ Note: the entire text of the stream will be loaded into memory as python string objects before being crunched with this method


If not buffering the entire stream in memory is a requirement, we can use the communicative property and to the prefix commonality check between every input pair

$!/bin/bash
cat mystream | python -c $'import sys\nimport os\nfor line in sys.stdin:\n\tif not os.path.isfile(line.strip()):\n\t\tcontinue\n\tsys.stdout.write(line)\n') | pythoin sys.stdin:\n\tprefix=os.path.commonprefix([line] + ([prefix] if prefix else []))\nsys.stdout.write(prefix)''

Long form

import sys
import os
prefix = None
for line in sys.stdin:
    prefix=os.path.commonprefix(
        [line] + ([prefix] if prev else [])
    )
sys.stdout.write(prefix)

Both of these methods should be binary-safe, as in they don't need input/output data to be ascii or utf-8 encoded, if you run into encoding errors, python 3 renamed sys.stdin to sys.stdin.buffer and sys.stdout to sys.stdout.buffer, which will not automatically decode/encode input/output streams on use

Upvotes: 1

Hubbitus
Hubbitus

Reputation: 5359

Grep short variant (idea borrowed from sed one):

$ echo -e "String1\nString2" | grep -zoP '^(.*)(?=.*?\n\1)'
String

Assumes string have no new line character. But easy may be tuned to use any delimiter.

Update at 2016-10-24: On modern versions of grep you may receive complain grep: unescaped ^ or $ not supported with -Pz, just use \A instead of ^:

$ echo -e "String1\nString2" | grep -zoP '\A(.*)(?=.*?\n\1)'
String

Upvotes: 7

qneill
qneill

Reputation: 1704

If using other languages, how about python:

cmnstr() { python -c "from difflib import SequenceMatcher
s1, s2 = ('''$1''', '''$2''')
m = SequenceMatcher(None,s1,s2).find_longest_match(0,len(s1),0,len(s2))
if m.a == 0: print(s1[m.a: m.a+m.size])"
}
$ cmnstr x y
$ cmnstr asdfas asd
asd

(h/t to @RickardSjogren's answer to stack overflow 18715688)

Upvotes: 0

Eugene Yarmash
Eugene Yarmash

Reputation: 149893

Yet another variant, using GNU grep:

$ string1="test toast"
$ string2="test test"
$ grep -zPo '(.*).*\n\K\1' <<< "$string1"$'\n'"$string2"
test t

Upvotes: 13

ack
ack

Reputation: 7580

An improved version of the sed example, this finds the common prefix of N strings (N>=0):

string1="test toast"
string2="test test"
string3="teaser"
{ echo "$string1"; echo "$string2"; echo "$string3"; } | sed -e 'N;s/^\(.*\).*\n\1.*$/\1\n\1/;D'

If the strings are stored in an array, they can be piped to sed with printf:

strings=("test toast" "test test" "teaser")
printf "%s\n" "${strings[@]}" | sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}'

You can also use a here-string:

strings=("test toast" "test test" "teaser")
oIFS=$IFS
IFS=$'\n'
<<<"${strings[*]}" sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}'
IFS=$oIFS
# for a local IFS:
(IFS=$'\n'; sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}' <<<"${strings[*]}")

The here-string (as with all redirections) can go anywhere within a simple command.

Upvotes: 24

Tanktalus
Tanktalus

Reputation: 22274

Ok, in bash:

#!/bin/bash

s="$1"
t="$2"
l=1

while [ "${t#${s:0:$l}}" != "$t" ]
do
  (( l = l + 1 ))
done
(( l = l - 1 ))

echo "${s:0:$l}"

It's the same algorithm as in other languages, but pure bash functionality. And, might I say, a bit uglier, too :-)

Upvotes: 7

chad
chad

Reputation: 21

Just yet another way using Bash only.

string1="test toast"
string2="test test"
len=${#string1}

for ((i=0; i<len; i++)); do
   if [[ "${string1:i:1}" == "${string2:i:1}" ]]; then
      continue
   else
      echo "${string1:0:i}"                       
      i=len
   fi
done

Upvotes: 2

This can be done entirely inside bash. Although doing string manipulation in a loop in bash is slow, there is a simple algorithm that is logarithmic in the number of shell operations, so pure bash is a viable option even for long strings.

longest_common_prefix () {
  local prefix= n
  ## Truncate the two strings to the minimum of their lengths
  if [[ ${#1} -gt ${#2} ]]; then
    set -- "${1:0:${#2}}" "$2"
  else
    set -- "$1" "${2:0:${#1}}"
  fi
  ## Binary search for the first differing character, accumulating the common prefix
  while [[ ${#1} -gt 1 ]]; do
    n=$(((${#1}+1)/2))
    if [[ ${1:0:$n} == ${2:0:$n} ]]; then
      prefix=$prefix${1:0:$n}
      set -- "${1:$n}" "${2:$n}"
    else
      set -- "${1:0:$n}" "${2:0:$n}"
    fi
  done
  ## Add the one remaining character, if common
  if [[ $1 = $2 ]]; then prefix=$prefix$1; fi
  printf %s "$prefix"
}

The standard toolbox includes cmp to compare binary files. By default, it indicates the byte offset of the first differing bytes. There is a special case when one string is a prefix of the other: cmp produces a different message on STDERR; an easy way to deal with this is to take whichever string is the shortest.

longest_common_prefix () {
  local LC_ALL=C offset prefix
  offset=$(export LC_ALL; cmp <(printf %s "$1") <(printf %s "$2") 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

Note that cmp operates on bytes, but bash's string manipulation operates on characters. This makes a difference in multibyte locales, for examples locales using the UTF-8 character set. The function above prints the longest prefix of a byte string. To handle character strings with this method, we can first convert the strings to a fixed-width encoding. Assuming the locale's character set is a subset of Unicode, UTF-32 fits the bill.

longest_common_prefix () {
  local offset prefix LC_CTYPE="${LC_ALL:=LC_CTYPE}"
  offset=$(unset LC_ALL; LC_MESSAGES=C cmp <(printf %s "$1" | iconv -t UTF-32)
                                           <(printf %s "$2" | iconv -t UTF-32) 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset/4-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

Upvotes: 11

jfg956
jfg956

Reputation: 16750

In sed, assuming the strings don't contain any newline characters:

string1="test toast"
string2="test test"
printf "%s\n%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n\1.*$/\1/'

Upvotes: 36

Karoly Horvath
Karoly Horvath

Reputation: 96266

Man, this is tough. It's an extremely trivial task, yet I don't know how to do this with the shell :)

here is an ugly solution:

echo "$2" | awk 'BEGIN{FS=""} { n=0; while(n<=NF) {if ($n == substr(test,n,1)) {printf("%c",$n);} n++;} print ""}' test="$1"

Upvotes: 1

jfg956
jfg956

Reputation: 16750

Without sed, using the cmp utility to get the index of the 1st different character, and using process substitution to get the 2 strings to cmp:

string1="test toast"
string2="test test"
first_diff_char=$(cmp <( echo "$string1" ) <( echo "$string2" ) | cut -d " " -f 5 | tr -d ",")
echo ${string1:0:$((first_diff_char-1))}

Upvotes: 5

Tanktalus
Tanktalus

Reputation: 22274

It's probably simpler in another language. Here's my solution:

common_bit=$(perl -le '($s,$t)=@ARGV;for(split//,$s){last unless $t=~/^\Q$z$_/;$z.=$_}print $z' "$string1" "$string2")

If this weren't a one-liner, I'd use longer variable names, more whitespace, more braces, etc. I'm also sure there's a faster way, even in perl, but, again, it's a trade-off between speed and space: this uses less space on what is already a long one-liner.

Upvotes: 3

Related Questions