joefromct
joefromct

Reputation: 1556

list of strings - remove commonalities of common strings

I'm struggling to come up with a way to solve this problem (and how to come up with a name for this question on stackoverflow).

I need to somehow remove commonalities of strings preserving the remainder.

Given a list such as the following:

l = ('first',
     'first.second',
     'first.second.third',
     'a.b.c',
     'x.y.z',
     'x.y') 

I'm hoping to have a list output as:

('first',
 'second',
 'third', 
 'a.b.c',
 'x.y',
 'z' ) 

as you can see, when first.second subtracts first and remains second. first.second is subtracted from first.second.third we get third. a.b.c doesn't have anything to subtract from itself, so it remains. x.y is subtracted from x.y.z and z remains.

I'm thinking maybe a sort(key=len) will be part of the solution, as well as some sort of recursion to end up with the string remainder. I'm hoping for a clean way to remove commonalities of each string in the list.

Upvotes: 1

Views: 128

Answers (4)

dkasak
dkasak

Reputation: 2703

I've thought about this interesting problem some more and came up with a solution.

The problem is fundamentally tree structured, regardless of which tree-like technique you end up using:

  • an actual tree datatype (which is how I initially solved it, but it was much more verbose)
  • recursion
  • simulating recursion using stacks (which is what I ultimately ended doing).

I'll use the following list of expanded list of words since it points out some edge cases that make other solutions fail:

li = ['First',
      'FirstSecond',
      'FirstSecondFirst',
      'FirstSecondFirstly',
      'FirstSecondThird',
      'FirstFoo',
      'FirstBarracudaSphyraena',
      'xyz',
      'xy',
      '12345',
      '123',
      '1',
      'FireTruckGarage',
      'FireTruck',
      'Fire']

The trick is in noticing that, every time there's a potential lengthening of the prefix, we must save the previous prefix on the prefix stack (called prefixes here), which contains all prefixes seen so far that haven't yet been exhausted. After we've done with some of the "deeper" words—in the sense of being nodes deeper down the tree—we may need to backtrack and use an old prefix for some shorter word yet again.

After encountering a word that is not prefixed by the current prefix, we must pop the stack of prefixes until we reach one that does prefix the word and continue from there.

def ambiguate(words):
    output = {}
    prefixes = ['']
    prefix = prefixes[0]

    for item in sorted(set(words)):
        backtracked = False

        while not item.startswith(prefix):
            prefix = prefixes.pop()
            backtracked = True

        # If we have backtracked, we put back the current prefix onto the
        # prefix stack since we may have to use it later on.
        if backtracked:
            prefixes.append(prefix)

        # Construct new output and a new prefix and append it to the
        # prefix stack
        output[item] = item.replace(prefix, '', 1)
        prefix = item
        prefixes.append(prefix)

    return output

Running:

print(ambiguate(li))

Yields:

{'1': '1',                                       
 '123': '23',                                    
 '12345': '45',                                  
 'Fire': 'Fire',                                 
 'FireTruck': 'Truck',                           
 'FireTruckGarage': 'Garage',                    
 'First': 'First',                               
 'FirstBarracudaSphyraena': 'BarracudaSphyraena',
 'FirstFoo': 'Foo',                              
 'FirstSecond': 'FirstSecond',                   
 'FirstSecondFirst': 'First',                    
 'FirstSecondFirstly': 'ly',                     
 'FirstSecondFourth': 'Fourth',                  
 'FirstSecondThird': 'FirstSecondThird',         
 'a': 'a',                                       
 'abc': 'bc',                                    
 'xy': 'xy',                                     
 'xyz': 'z'}

Upvotes: 1

joefromct
joefromct

Reputation: 1556

I think maybe i was trying too get too fancy with list and dict comprehensions.

I believe i was able to solve the problem with the following:

  1. sort in-bound list
  2. use a variable for "previous list element"
  3. loop, and output the the current element replacing previous element (if it is found)

Here's what i have so far:

li = [
     'first',
     'first.second',
     'first.second.third',
     'a',
     'a.b',
     'a.b.c',
     'x.y.z',
     'x.y'] 

li.sort() 

prev_l = ''
output = {} 
for l in li:
    if l.find(prev_l) ==0:
        output[l] = l.replace(prev_l,'')
    else: 
        output[l] = l
    prev_l = l

output:

{
'a'                   : 'a',
'a.b'                : '.b',
'a.b.c'              : '.c',
'first'              : 'first',
'first.second'       : '.second',
'first.second.third' : '.third',
'x.y'                : 'x.y',
'x.y.z'              : '.z'}

Upvotes: 0

Rockybilly
Rockybilly

Reputation: 4510

If the output order is not important, this solution will give you the expected values.

This is by implementation almost the same as @brianpck's answer. But I used sorting to deal with the "x.y.z" problem. And some extra explanations.

l = ('first',
     'first.second',
     'first.second.third',
     'a.b.c',
     'x.y.z',
     'x.y')

def clarify(lst):

    # Sort the list to deal with the order problem.
    # for ex. "x.y.z" deletes "x.y" if not sorted
    lst = sorted(lst)

    # Words encountered.
    words = set()

    new_list = []

    for elem in lst:

        # Divide elements using dots.
        divided = elem.split(".")

        # New element to be added to the result.
        new_elem = []

        # For each word in a divided element.
        for word in divided:
            # If a word in the element is not encountered before.
            # Add it to new element
            # Add it to words
            if word not in words:
                new_elem.append(word)
                words.add(word)
        # Join new element
        new_list.append(".".join(new_elem))

    return new_list

print clarify(l)

# Gives ['a.b.c', 'first', 'second', 'third', 'x.y', 'z']
# You can make this a tuple in the solution as in the input if you want.

Upvotes: 2

brianpck
brianpck

Reputation: 8254

I believe you need to define your problem a little more exactly before writing a solution. Here's what I infer from your test case:

  1. "Members" are delimited by periods: the same "member" can't appear in two tuple items.
  2. Each member should only appear once.

The problem, though, is that the precedence is ambiguous. For example, in the following sequence:

lst = ('a.b.c',
       'a.b.d')

Where do 'a' and 'b' belong? Your test case implies that a common member should go to the one with the least common members (so "z" doesn't stay with "x.y.z"), but there are plenty of edge cases that need to be considered. You will need to put your requirements in a more exact format.

Adopting the much simpler rule that a "member" should stay in the first place that it appears, the following function does the trick:

def remove_common(lst):
    seen = set()
    res = []
    for i in lst:
        members = i.split('.')
        res.append('.'.join(w for w in members if w not in seen))
        seen |= set(members)
    return tuple(res)

This gives a result very close to yours:

>>> remove_common(l)
('first', 'second', 'third', 'a.b.c', 'x.y.z', '')

Upvotes: 2

Related Questions