Reputation: 2610

Filter list of strings, ignoring substrings of other items

How can I filter through a list which containing strings and substrings to return only the longest strings. (If any item in the list is a substring of another, return only the longer string.)

I have this function. Is there a faster way?

def filterSublist(lst):
    uniq = lst
    for elem in lst:
        uniq = [x for x in uniq if (x == elem) or (x not in elem)]
    return uniq

lst = ["a", "abc", "b", "d", "xy", "xyz"]
print filterSublist(lst)

> ['abc', 'd', 'xyz']
> Function time: 0.000011

Upvotes: 18

Answers (4)

bcorso

Reputation: 47128

O(n) solution:

import collections
def filterSublist1(words):
    longest = collections.defaultdict(str)
    for word in words:                                 # O(n)
        for i in range(1, len(word)+1):                # O(k)
            for j in range(len(word) - i + 1):         # O(k)
                subword = word[j:j+i]                  # O(1)
                if len(longest[subword]) < len(word):  # O(1)
                    longest[subword] = word            # O(1)

    return list(set(longest.values()))                 # O(n)
                                                       # Total: O(nk²)

Explanation:

To understand the time complexity, I've given an upper bound for the complexity at each line in the above code. The bottleneck occurs at the for loops where because they are nested the overall time complexity will be O(nk²), where n is the number of words in the list and k is the length of the average/longest word (e.g. in the above code n = 6 and k = 3). However, assuming that the words are not arbitrarily long strings, we can bound k by some small value -- e.g. k=5 if we consider the average word length in the english dictionary. Therefore, because k is bounded by a value, it is not included in the time complexity, and we get the running time to be O(n). Of course, the size of k will add to the constant factor, especially if k is not smaller than n. For the English dictionary, this would mean that when n >> k² = 25 you would start to see better results than with other algorithms (see the graph below).

The algorithm works by mapping each unique subword to the longest string that contains that subword. So for example, 'xyz' would lookup longest['x'], longest['y'], longest['z'], longest['xy'], longest['yz'], and longest['xyz'] and set them all equal to 'xyz'. When this is done for every word in the list, longest.keys() will be a set of all unique subwords of all words, and longest.values() will be a list of only the words that are not subwords of other words. Finally, longest.values() can contain duplicates, so they are removed by wrapping in set.

Visualizing Time Complexity

Below I've tested the algorithm above (along with your original algorithm) to show that this solution is indeed O(n) when using English words. I've tested this using timeit on a list of up to 69000 English words. I've labeled this algorithm filterSublist1 and your original algorithm filterSublist2.

enter image description here

The plot is shown on a log-log axis, meaning that the time complexity of the algorithm for this input set is given by the slope of the lines. For filterSublist1 the slope is ~1 meaning it is O(n¹) and for filterSublist2 the slope is ~2 meaning it is O(n²).

NOTE: I'm missing the filterSublist2() time for 69000 words because I didnt feel like waiting.

Upvotes: 1

Niklas B.

Reputation: 95318

A simple quadratic time solution would be this:

res = []
n = len(lst)
for i in xrange(n):
    if not any(i != j and lst[i] in lst[j] for j in xrange(n)):
        res.append(lst[i])

But we can do much better:

Let $ be a character that does not appear in any of your strings and has a lower value than all your actual characters.

Let S be the concatenation of all your strings, with $ in between. In your example, S = a$abc$b$d$xy$xyz.

You can build the suffix array of S in linear time. You can also use a much simpler O(n log^2 n) construction algorithm that I described in another answer.

Now for every string in lst, check if it occurs in the suffix array exactly once. You can do two binary searches to find the locations of the substring, they form a contiguous range in the suffix array. If the string occurs more than once, you remove it.

With LCP information precomputed, this can be done in linear time as well.

Example O(n log^2 n) implementation, adapted from my suffix array answer:

def findFirst(lo, hi, pred):
  """ Find the first i in range(lo, hi) with pred(i) == True.
  Requires pred to be a monotone. If there is no such i, return hi. """
  while lo < hi:
    mid = (lo + hi) // 2
    if pred(mid): hi = mid;
    else: lo = mid + 1
  return lo

# uses the algorithm described in https://stackoverflow.com/a/21342145/916657
class SuffixArray(object):
  def __init__(self, s):
    """ build the suffix array of s in O(n log^2 n) where n = len(s). """
    n = len(s)
    log2 = 0
    while (1<<log2) < n:
      log2 += 1
    rank = [[0]*n for _ in xrange(log2)]
    for i in xrange(n):
      rank[0][i] = s[i]
    L = [0]*n
    for step in xrange(1, log2):
      length = 1 << step
      for i in xrange(n):
        L[i] = (rank[step - 1][i],
                rank[step - 1][i + length // 2] if i + length // 2 < n else -1,
                i)
      L.sort()
      for i in xrange(n):
        rank[step][L[i][2]] = \
          rank[step][L[i - 1][2]] if i > 0 and L[i][:2] == L[i-1][:2] else i
    self.log2 = log2
    self.rank = rank
    self.sa = [l[2] for l in L]
    self.s = s
    self.rev = [0]*n
    for i, j in enumerate(self.sa):
      self.rev[j] = i

  def lcp(self, x, y):
    """ compute the longest common prefix of s[x:] and s[y:] in O(log n). """
    n = len(self.s)
    if x == y:
      return n - x
    ret = 0
    for k in xrange(self.log2 - 1, -1, -1):
      if x >= n or y >= n:
        break
      if self.rank[k][x] == self.rank[k][y]:
        x += 1<<k
        y += 1<<k
        ret += 1<<k
    return ret

  def compareSubstrings(self, x, lx, y, ly):
    """ compare substrings s[x:x+lx] and s[y:y+yl] in O(log n). """
    l = min((self.lcp(x, y), lx, ly))
    if l == lx == ly: return 0
    if l == lx: return -1
    if l == ly: return 1
    return cmp(self.s[x + l], self.s[y + l])

  def count(self, x, l):
    """ count occurences of substring s[x:x+l] in O(log n). """
    n = len(self.s)
    cs = self.compareSubstrings
    lo = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) >= 0)
    hi = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) > 0)
    return hi - lo

  def debug(self):
    """ print the suffix array for debugging purposes. """
    for i, j in enumerate(self.sa):
      print str(i).ljust(4), self.s[j:], self.lcp(self.sa[i], self.sa[i-1]) if i >0 else "n/a"

def filterSublist(lst):
  splitter = "\x00"
  s = splitter.join(lst) + splitter
  sa = SuffixArray(s)
  res = []
  offset = 0
  for x in lst:
    if sa.count(offset, len(x)) == 1:
      res.append(x)
    offset += len(x) + 1
  return res

However, the interpretation overhead likely causes this to be slower than the O(n^2) approaches unless S is really large (in the order of 10^5 characters or more).

Upvotes: 8

gog

Reputation: 11347

Does the order matter? If not,

a = ["a", "abc", "b", "d", "xy", "xyz"]

a.sort(key=len, reverse=True)
n = len(a)

for i in range(n - 1):
    if a[i]:
        for j in range(i + 1, n):
            if a[j] in a[i]:
                a[j] = ''


print filter(len, a)  # ['abc', 'xyz', 'd']

Not very efficient, but simple.

Upvotes: 2

Saullo G. P. Castro

Reputation: 58955

You can build your problem in matrix form as:

import numpy as np

lst = np.array(["a", "abc", "b", "d", "xy", "xyz"], object)
out = np.zeros((len(lst), len(lst)), dtype=int)
for i in range(len(lst)):
    for j in range(len(lst)):
        out[i,j] = lst[i] in lst[j]

from where you get out as:

array([[1, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 1]])

then, the answer will be the indices of lst where the sum of òut along the columns is 1 (the string is only in itself):

lst[out.sum(axis=1)==1]
#array(['abc', 'd', 'xyz'], dtype=object)

EDIT: You can do it much more efficiently with:

from numpy.lib.stride_tricks import as_strided
from string import find

size = len(lst)
a = np.char.array(lst)
a2 = np.char.array(as_strided(a, shape=(size, size),
                                 strides=(a.strides[0], 0)))

out = find(a2, a)
out[out==0] = 1
out[out==-1] = 0
print a[out.sum(axis=0)==1]
# chararray(['abc', 'd', 'xyz'], dtype='|S3')

a[out.sum(axis=0)==1]