Reputation: 1556
I'm struggling to come up with a way to solve this problem (and how to come up with a name for this question on stackoverflow).
I need to somehow remove commonalities of strings preserving the remainder.
Given a list such as the following:
l = ('first',
'first.second',
'first.second.third',
'a.b.c',
'x.y.z',
'x.y')
I'm hoping to have a list output as:
('first',
'second',
'third',
'a.b.c',
'x.y',
'z' )
as you can see, when first.second
subtracts first
and remains second
. first.second
is subtracted from first.second.third
we get third
. a.b.c
doesn't have anything to subtract from itself, so it remains. x.y
is subtracted from x.y.z
and z
remains.
I'm thinking maybe a sort(key=len)
will be part of the solution, as well as some sort of recursion to end up with the string remainder. I'm hoping for a clean way to remove commonalities of each string in the list.
Upvotes: 1
Views: 128
Reputation: 2703
I've thought about this interesting problem some more and came up with a solution.
The problem is fundamentally tree structured, regardless of which tree-like technique you end up using:
I'll use the following list of expanded list of words since it points out some edge cases that make other solutions fail:
li = ['First',
'FirstSecond',
'FirstSecondFirst',
'FirstSecondFirstly',
'FirstSecondThird',
'FirstFoo',
'FirstBarracudaSphyraena',
'xyz',
'xy',
'12345',
'123',
'1',
'FireTruckGarage',
'FireTruck',
'Fire']
The trick is in noticing that, every time there's a potential lengthening of the prefix, we must save the previous prefix on the prefix stack (called prefixes
here), which contains all prefixes seen so far that haven't yet been exhausted. After we've done with some of the "deeper" words—in the sense of being nodes deeper down the tree—we may need to backtrack and use an old prefix for some shorter word yet again.
After encountering a word that is not prefixed by the current prefix, we must pop the stack of prefixes until we reach one that does prefix the word and continue from there.
def ambiguate(words):
output = {}
prefixes = ['']
prefix = prefixes[0]
for item in sorted(set(words)):
backtracked = False
while not item.startswith(prefix):
prefix = prefixes.pop()
backtracked = True
# If we have backtracked, we put back the current prefix onto the
# prefix stack since we may have to use it later on.
if backtracked:
prefixes.append(prefix)
# Construct new output and a new prefix and append it to the
# prefix stack
output[item] = item.replace(prefix, '', 1)
prefix = item
prefixes.append(prefix)
return output
Running:
print(ambiguate(li))
Yields:
{'1': '1',
'123': '23',
'12345': '45',
'Fire': 'Fire',
'FireTruck': 'Truck',
'FireTruckGarage': 'Garage',
'First': 'First',
'FirstBarracudaSphyraena': 'BarracudaSphyraena',
'FirstFoo': 'Foo',
'FirstSecond': 'FirstSecond',
'FirstSecondFirst': 'First',
'FirstSecondFirstly': 'ly',
'FirstSecondFourth': 'Fourth',
'FirstSecondThird': 'FirstSecondThird',
'a': 'a',
'abc': 'bc',
'xy': 'xy',
'xyz': 'z'}
Upvotes: 1
Reputation: 1556
I think maybe i was trying too get too fancy with list and dict comprehensions.
I believe i was able to solve the problem with the following:
Here's what i have so far:
li = [
'first',
'first.second',
'first.second.third',
'a',
'a.b',
'a.b.c',
'x.y.z',
'x.y']
li.sort()
prev_l = ''
output = {}
for l in li:
if l.find(prev_l) ==0:
output[l] = l.replace(prev_l,'')
else:
output[l] = l
prev_l = l
output:
{
'a' : 'a',
'a.b' : '.b',
'a.b.c' : '.c',
'first' : 'first',
'first.second' : '.second',
'first.second.third' : '.third',
'x.y' : 'x.y',
'x.y.z' : '.z'}
Upvotes: 0
Reputation: 4510
If the output order is not important, this solution will give you the expected values.
This is by implementation almost the same as @brianpck's answer. But I used sorting to deal with the "x.y.z"
problem. And some extra explanations.
l = ('first',
'first.second',
'first.second.third',
'a.b.c',
'x.y.z',
'x.y')
def clarify(lst):
# Sort the list to deal with the order problem.
# for ex. "x.y.z" deletes "x.y" if not sorted
lst = sorted(lst)
# Words encountered.
words = set()
new_list = []
for elem in lst:
# Divide elements using dots.
divided = elem.split(".")
# New element to be added to the result.
new_elem = []
# For each word in a divided element.
for word in divided:
# If a word in the element is not encountered before.
# Add it to new element
# Add it to words
if word not in words:
new_elem.append(word)
words.add(word)
# Join new element
new_list.append(".".join(new_elem))
return new_list
print clarify(l)
# Gives ['a.b.c', 'first', 'second', 'third', 'x.y', 'z']
# You can make this a tuple in the solution as in the input if you want.
Upvotes: 2
Reputation: 8254
I believe you need to define your problem a little more exactly before writing a solution. Here's what I infer from your test case:
The problem, though, is that the precedence is ambiguous. For example, in the following sequence:
lst = ('a.b.c',
'a.b.d')
Where do 'a' and 'b' belong? Your test case implies that a common member should go to the one with the least common members (so "z" doesn't stay with "x.y.z"), but there are plenty of edge cases that need to be considered. You will need to put your requirements in a more exact format.
Adopting the much simpler rule that a "member" should stay in the first place that it appears, the following function does the trick:
def remove_common(lst):
seen = set()
res = []
for i in lst:
members = i.split('.')
res.append('.'.join(w for w in members if w not in seen))
seen |= set(members)
return tuple(res)
This gives a result very close to yours:
>>> remove_common(l)
('first', 'second', 'third', 'a.b.c', 'x.y.z', '')
Upvotes: 2