Reputation: 4028
I have two strings. For the sake of the example they are set like this:
string1="test toast"
string2="test test"
What I want is to find the overlap starting at the beginning of the strings. With overlap I mean the string "test t" in my above example.
# So I look for the command
command "$string1" "$string2"
# that outputs:
"test t"
If the strings were string1="atest toast"; string2="test test"
they would have no overlap since the check starts from the beginning and the "a" at the start of string1
.
Upvotes: 40
Views: 12882
Reputation: 953
Another pure bash answer, works for N strings, outputs in $__
This cut with /
delimiter for path, this behavior can be remove by removing the if.
longest_common_prefix() {
__="$1"
shift 1
if (( $# == 0 )); then return; fi
local i=0
while (( $# > 0 )); do
for ((i=0; i<${#__}; i++)); do
if [[ "${__:i:1}" != "${1:i:1}" ]]; then
__=${__:0:i}
fi
done
if [[ "${1:i:1}" == "/" ]]; then
__="${__}/" # common path in arg list
fi
shift 1
done
__=${__%/*} # cut to the last /
}
Upvotes: 0
Reputation: 2474
I've generalized @ack's answer to accommodate embedded newlines.
I'll use the following array of strings as a test case:
a=(
$'/a\n/b/\nc d\n/\n\ne/f'
$'/a\n/b/\nc d\n/\ne/f'
$'/a\n/b/\nc d\n/\ne\n/f'
$'/a\n/b/\nc d\n/\nef'
)
By inspection we can see that the longest common prefix is
$'/a\n/b/\nc d\n/\n'
We can compute this and save the result into a variable with the following:
longest_common_prefix=$(
printf '%s\0' "${a[@]}" \
| sed -zE '$!{N;s/^(.*).*\x00\1.*$/\1\x00\1/;D;}' \
| tr \\0 x # replace trailing NUL with a dummy character ①
)
longest_common_prefix=${longest_common_prefix%x} # Remove the dummy character
echo "${longest_common_prefix@Q}" # ②
Result:
$'/a\n/b/\nc d\n/\n'
as expected. ✔️
I applied this technique in the context of path specs here: https://unix.stackexchange.com/a/639813
① To preserve any trailing newlines in this command substitution, we used the usual technique of appending a dummy character that is chopped off afterwards. We combined the removal of the trailing NUL with the addition of the dummy character (we chose x
) in one step using tr \\0 x
.
② The ${parameter@Q}
expansion results in "a string that is the value of parameter quoted in a format that can be reused as input." – bash reference manual. Requires bash 4.4+ (discussion). Otherwise, you can inspect the result using one of the following:
Upvotes: 0
Reputation:
If you have an option to install a python package, you can use this python utility
# install pythonp
pythonp -m pip install pythonp
echo -e "$string1\n$string2" | pythonp 'l1,l2=lines
res=itertools.takewhile(lambda a: a[0]==a[1], zip(l1,l2)); "".join(r[0] for r in res)'
Upvotes: 1
Reputation: 18119
Another python-based answer, this one based on the os.path
module's native commonprefix
function
#!/bin/bash
cat mystream | python -c $'import sys, os; sys.stdout.write(os.path.commonprefix(sys.stdin.readlines()) + b\'\\n\')'
Longform, that's
import sys
import os
sys.stdout.write(
os.path.commonprefix(sys.stdin.readlines()) + b'\n'
)
/!\ Note: the entire text of the stream will be loaded into memory as python string objects before being crunched with this method
If not buffering the entire stream in memory is a requirement, we can use the communicative property and to the prefix commonality check between every input pair
$!/bin/bash
cat mystream | python -c $'import sys\nimport os\nfor line in sys.stdin:\n\tif not os.path.isfile(line.strip()):\n\t\tcontinue\n\tsys.stdout.write(line)\n') | pythoin sys.stdin:\n\tprefix=os.path.commonprefix([line] + ([prefix] if prefix else []))\nsys.stdout.write(prefix)''
Long form
import sys
import os
prefix = None
for line in sys.stdin:
prefix=os.path.commonprefix(
[line] + ([prefix] if prev else [])
)
sys.stdout.write(prefix)
Both of these methods should be binary-safe, as in they don't need input/output data to be ascii or utf-8 encoded, if you run into encoding errors, python 3 renamed sys.stdin to sys.stdin.buffer and sys.stdout to sys.stdout.buffer, which will not automatically decode/encode input/output streams on use
Upvotes: 1
Reputation: 5359
Grep short variant (idea borrowed from sed one):
$ echo -e "String1\nString2" | grep -zoP '^(.*)(?=.*?\n\1)'
String
Assumes string have no new line character. But easy may be tuned to use any delimiter.
Update at 2016-10-24: On modern versions of grep you may receive complain grep: unescaped ^ or $ not supported with -Pz
, just use \A
instead of ^
:
$ echo -e "String1\nString2" | grep -zoP '\A(.*)(?=.*?\n\1)'
String
Upvotes: 7
Reputation: 1704
If using other languages, how about python:
cmnstr() { python -c "from difflib import SequenceMatcher
s1, s2 = ('''$1''', '''$2''')
m = SequenceMatcher(None,s1,s2).find_longest_match(0,len(s1),0,len(s2))
if m.a == 0: print(s1[m.a: m.a+m.size])"
}
$ cmnstr x y
$ cmnstr asdfas asd
asd
(h/t to @RickardSjogren's answer to stack overflow 18715688)
Upvotes: 0
Reputation: 149893
Yet another variant, using GNU grep:
$ string1="test toast"
$ string2="test test"
$ grep -zPo '(.*).*\n\K\1' <<< "$string1"$'\n'"$string2"
test t
Upvotes: 13
Reputation: 7580
An improved version of the sed example, this finds the common prefix of N strings (N>=0):
string1="test toast"
string2="test test"
string3="teaser"
{ echo "$string1"; echo "$string2"; echo "$string3"; } | sed -e 'N;s/^\(.*\).*\n\1.*$/\1\n\1/;D'
If the strings are stored in an array, they can be piped to sed with printf:
strings=("test toast" "test test" "teaser")
printf "%s\n" "${strings[@]}" | sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}'
You can also use a here-string:
strings=("test toast" "test test" "teaser")
oIFS=$IFS
IFS=$'\n'
<<<"${strings[*]}" sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}'
IFS=$oIFS
# for a local IFS:
(IFS=$'\n'; sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}' <<<"${strings[*]}")
The here-string (as with all redirections) can go anywhere within a simple command.
Upvotes: 24
Reputation: 22274
Ok, in bash:
#!/bin/bash
s="$1"
t="$2"
l=1
while [ "${t#${s:0:$l}}" != "$t" ]
do
(( l = l + 1 ))
done
(( l = l - 1 ))
echo "${s:0:$l}"
It's the same algorithm as in other languages, but pure bash functionality. And, might I say, a bit uglier, too :-)
Upvotes: 7
Reputation: 21
Just yet another way using Bash only.
string1="test toast"
string2="test test"
len=${#string1}
for ((i=0; i<len; i++)); do
if [[ "${string1:i:1}" == "${string2:i:1}" ]]; then
continue
else
echo "${string1:0:i}"
i=len
fi
done
Upvotes: 2
Reputation: 107799
This can be done entirely inside bash. Although doing string manipulation in a loop in bash is slow, there is a simple algorithm that is logarithmic in the number of shell operations, so pure bash is a viable option even for long strings.
longest_common_prefix () {
local prefix= n
## Truncate the two strings to the minimum of their lengths
if [[ ${#1} -gt ${#2} ]]; then
set -- "${1:0:${#2}}" "$2"
else
set -- "$1" "${2:0:${#1}}"
fi
## Binary search for the first differing character, accumulating the common prefix
while [[ ${#1} -gt 1 ]]; do
n=$(((${#1}+1)/2))
if [[ ${1:0:$n} == ${2:0:$n} ]]; then
prefix=$prefix${1:0:$n}
set -- "${1:$n}" "${2:$n}"
else
set -- "${1:0:$n}" "${2:0:$n}"
fi
done
## Add the one remaining character, if common
if [[ $1 = $2 ]]; then prefix=$prefix$1; fi
printf %s "$prefix"
}
The standard toolbox includes cmp
to compare binary files. By default, it indicates the byte offset of the first differing bytes. There is a special case when one string is a prefix of the other: cmp
produces a different message on STDERR; an easy way to deal with this is to take whichever string is the shortest.
longest_common_prefix () {
local LC_ALL=C offset prefix
offset=$(export LC_ALL; cmp <(printf %s "$1") <(printf %s "$2") 2>/dev/null)
if [[ -n $offset ]]; then
offset=${offset%,*}; offset=${offset##* }
prefix=${1:0:$((offset-1))}
else
if [[ ${#1} -lt ${#2} ]]; then
prefix=$1
else
prefix=$2
fi
fi
printf %s "$prefix"
}
Note that cmp
operates on bytes, but bash's string manipulation operates on characters. This makes a difference in multibyte locales, for examples locales using the UTF-8 character set. The function above prints the longest prefix of a byte string. To handle character strings with this method, we can first convert the strings to a fixed-width encoding. Assuming the locale's character set is a subset of Unicode, UTF-32 fits the bill.
longest_common_prefix () {
local offset prefix LC_CTYPE="${LC_ALL:=LC_CTYPE}"
offset=$(unset LC_ALL; LC_MESSAGES=C cmp <(printf %s "$1" | iconv -t UTF-32)
<(printf %s "$2" | iconv -t UTF-32) 2>/dev/null)
if [[ -n $offset ]]; then
offset=${offset%,*}; offset=${offset##* }
prefix=${1:0:$((offset/4-1))}
else
if [[ ${#1} -lt ${#2} ]]; then
prefix=$1
else
prefix=$2
fi
fi
printf %s "$prefix"
}
Upvotes: 11
Reputation: 16750
In sed, assuming the strings don't contain any newline characters:
string1="test toast"
string2="test test"
printf "%s\n%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n\1.*$/\1/'
Upvotes: 36
Reputation: 96266
Man, this is tough. It's an extremely trivial task, yet I don't know how to do this with the shell :)
here is an ugly solution:
echo "$2" | awk 'BEGIN{FS=""} { n=0; while(n<=NF) {if ($n == substr(test,n,1)) {printf("%c",$n);} n++;} print ""}' test="$1"
Upvotes: 1
Reputation: 16750
Without sed, using the cmp utility to get the index of the 1st different character, and using process substitution to get the 2 strings to cmp:
string1="test toast"
string2="test test"
first_diff_char=$(cmp <( echo "$string1" ) <( echo "$string2" ) | cut -d " " -f 5 | tr -d ",")
echo ${string1:0:$((first_diff_char-1))}
Upvotes: 5
Reputation: 22274
It's probably simpler in another language. Here's my solution:
common_bit=$(perl -le '($s,$t)=@ARGV;for(split//,$s){last unless $t=~/^\Q$z$_/;$z.=$_}print $z' "$string1" "$string2")
If this weren't a one-liner, I'd use longer variable names, more whitespace, more braces, etc. I'm also sure there's a faster way, even in perl, but, again, it's a trade-off between speed and space: this uses less space on what is already a long one-liner.
Upvotes: 3