pomegranate
pomegranate

Reputation: 765

Identifying elements that occur within strings in a list

Let's say I have a list, with pipes denoting a hierarchy.

l = ['animals|fish|salmon', 'fish|salmon', 'fish', 'animals', 'furniture',
    'animals|big cats|lions', 'animals|birds|fisher bird']

I want to return a list with all the entries that are not redundant:

l = ['animals|fish|salmon', 'animals|big cats|lions',
     'animals|birds|fisher bird', 'furniture']

I tried various variations of sorting the list by length and then using the "any" keyword to find elements that were contained within one of these entries.

l2 = paths.sort(key=len)
for i in l2:
    if any(i in j for j in l if i != j):
...

But this wasn't really successful. Can anyone recommend a better way of approaching this?

Thanks!

Upvotes: 2

Views: 83

Answers (4)

flakes
flakes

Reputation: 23624

Earlier I asked "what happens if you have say l = ['animals|fish', 'fish|salmon']?" And I was bored so I've fiddled around with a solution to return all unique chains in full.

now ['animals|fish', 'fish|salmon'] -> ['animals|fish|salmon']

and ['animals|fish|trout', 'fish|salmon'] -> ['animals|fish|salmon','animals|fish|trout']

You can create a Node for each species containing a list of child species, and its parent species. For each animal individual animal in the list you can create a node. For each string in the list, you link the child animal to the parent animal in-front of it.

When you're done you print the hierarchy of all animals that do not have a parent.

class SpeciesNode(object):
    def __init__(self, name):
        self.parent = None
        self.name = name
        self.children = []
    def add_node(self,s_node):
        if not s_node in self.children:
            s_node.parent = self
            self.children.append(s_node)
    def get_branches(self):
        if len(self.children) == 0:
            yield self.name
        else:
            for child in self.children:
                for branch in child.get_branches():
                    yield self.name + '|' + branch

Now you can define a function to convert a list of hierarchies into a list of nodes.

def get_s_nodes(animal_list):
    s_nodes = {}
    for hierarchy in animal_list:
        h_list = hierarchy.split('|')
        parent = None
        for species in h_list:
            if not species in s_nodes.keys():
                s_nodes[species] = SpeciesNode(species)
            if parent is not None:
                s_nodes[parent].add_node(s_nodes[species])
            parent = species
    return s_nodes.values()

Finally convert this back to a list of strings

def get_animal_list(s_nodes):
    animal_kingdom = []
    for node in s_nodes:
        if node.parent is None:
            for branch in child.get_branches():
                animal_kingdom.append(branch)
    return animal_kingdom

so:

>>> l = ['animals|fish|salmon', 'fish|salmon','fish|trout', 'salmon|salmon eggs', 'fish', 'animals',
         'furniture', 'animals|big cats|lions', 'animals|birds|fisher bird']
>>> get_animal_list(get_s_nodes(l))
['animals|fish|salmon|salmon eggs', 'animals|fish|trout', 'animals|big cats|lions', 'animals|birds|fisher bird', 'furniture']
>>> 

Upvotes: 1

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 250961

You can do something like this.

Here I am using a set to keep track of already seen items. I loop over each item, split it at whitespaces first and then at |, and next step is to check if any of the item in that list is not present in seen set or not, if yes then store that string in out list and the items of the list are added to the seen set.

lis = ['animals|fish|salmon', 'fish|salmon', 'fish', 'animals', 'furniture',
    'animals|big cats|lions', 'animals|birds|fisher bird']
seen = set()
out = []
for x in lis:
    items = [z for y in x.split() for z in y.split('|')]
    if any(y not in seen for y in items):
        seen.update(items)
        out.append(x)
print out        

#['animals|fish|salmon', 'furniture', 'animals|big cats|lions', 'animals|birds|fisher bird']        

Upvotes: 2

Oliver Dain
Oliver Dain

Reputation: 9953

I think the definition of "redundant" isn't 100% clear here. Your post seems to indicate that if all of the text in one entry appears in another that's redundant, but I don't think that's really what you mean. For example ['fish', 'fishers'] should probably not be reduced to just ['fishers'] even though 'fish' in 'fishers' == True.

I'm guessing what you want is closer to this:

def is_redundant(a, b):
   """Is a redundant given b?"""
   a_parts = set(a.split('|'))
   b_parts = set(b.split('|'))
   return len(a_parts.intersect(b_parts) == len(b_parts)

Here's a simple (but inefficient) solution:

for item in l:
    l = filter(lambda x: not is_redundant(x, item), l)

Upvotes: 0

Hyperboreus
Hyperboreus

Reputation: 32429

I am not sure if this is what you are looking for:

l = ['animals|fish|salmon', 'fish|salmon', 'fish', 'animals', 'furniture',
    'animals|big cats|lions', 'animals|birds|fisher bird']

def simplify (data):
    data = ['|{}|'.format (e) for e in data]
    return [e [1:-1] for e in data if all (e is other or e not in other for other in data) ]

print (simplify (l) )

It prints:

['animals|fish|salmon', 'furniture', 'animals|big cats|lions', 'animals|birds|fisher bird']

What I do:

First step: put pipes at the beginning and end of each item '|{}|'.format (to avoid conflicts with e.g. fish and fisher bird.

Second step: Filter the list, throwing away all items which are subpaths of another (e not in other) except themselves (e is other or). I also trim again the additional pipes (e [1:-1])

Upvotes: 3

Related Questions