I'm trying to clean a list, by removing duplicates. For example: bb = ['Gppe (Aspirin Combined)', 'Gppe Cap (Migraine)', 'Gppe Tab', 'Abilify', 'Abilify Maintena', 'Abstem', 'Abstral'] Ideally, I need to get the following list: bb = ['Gppe', 'Abilify', 'Abstem', 'Abstral'] What I tried: Split the list and remove duplicates (a naive approach) list(set(sorted([j for bb_i in bb for j in bb_i.split(' ')]))) which leaves a lot of 'rubbish': ['(Aspirin', '(Migraine)', 'Abilify', 'Abstem', 'Abstral', 'Cap', 'Combined)', 'Gppe', 'Maintena', 'Tab'] Find the most frequent word: Counter(['Gppe (Aspirin Combined)', 'Gppe Cap (Migraine)', 'Gppe Tab').most_common(1)[0][0] But I'm not sure how to find similar words (a group)?? I am wondering, whether one can use a kind of 'groupby()' and first group by names and then remove duplicates within those names.

Reputation: 3096

Find the common string in a subgroup in a list in Python

I'm trying to clean a list, by removing duplicates. For example:

 bb = ['Gppe (Aspirin Combined)', 
       'Gppe Cap (Migraine)',  
       'Gppe Tab', 
       'Abilify', 
       'Abilify Maintena', 
       'Abstem', 
       'Abstral']

Ideally, I need to get the following list:

 bb = ['Gppe', 
       'Abilify', 
       'Abstem', 
       'Abstral']

What I tried:

Split the list and remove duplicates (a naive approach)

list(set(sorted([j for bb_i in bb for j in bb_i.split(' ')])))

which leaves a lot of 'rubbish':

['(Aspirin',
 '(Migraine)',
 'Abilify',
 'Abstem',
 'Abstral',
 'Cap',
 'Combined)',
 'Gppe',
 'Maintena',
 'Tab']

Find the most frequent word:

Counter(['Gppe (Aspirin Combined)', 'Gppe Cap (Migraine)', 'Gppe Tab').most_common(1)[0][0]

But I'm not sure how to find similar words (a group)??

I am wondering, whether one can use a kind of 'groupby()' and first group by names and then remove duplicates within those names.

Upvotes: 4

Answers (3)

automationleg

Reputation: 323

You could try split every item and collect only the first string before separator(space)

print(list(set(item.split(' ',1)[0] for item in bb)))

This looks of getting what you need:

['Abilify', 'Abstem', 'Gppe', 'Abstral']

Upvotes: 1

Pulsar

Reputation: 288

If order doesn't matter, you can use a set comprehension:

res = list({x.split()[0] for x in bb})

If order matters and you have Python 3.6 or higher, you can use a dict comprehension:

res = list({x.split()[0]:None for x in bb})

If order matters and you have Python 3.5 or lower, you can use an OrderedDict:

from collections import OrderedDict
res = list(OrderedDict((x.split()[0],None) for x in bb))

Upvotes: 5

Dani Mesejo

Reputation: 61930

You could do, assuming you want the unique first word of each string:

bb = ['Gppe (Aspirin Combined)',
       'Gppe Cap (Migraine)',
       'Gppe Tab',
       'Abilify',
       'Abilify Maintena',
       'Abstem',
       'Abstral']


result = set(map(lambda x: x.split()[0], bb))
print(result)

Output

{'Gppe', 'Abstral', 'Abilify', 'Abstem'}

If you want a list of unique elements in the order of appearance, you could do:

bb = ['Gppe (Aspirin Combined)',
       'Gppe Cap (Migraine)',
       'Gppe Tab',
       'Abilify',
       'Abilify Maintena',
       'Abstem',
       'Abstral']

seen = set()
result = []
for e in bb:
    key = e.split()[0]
    if key not in seen:
        result.append(key)
        seen.add(key)

print(result)

Output

['Gppe', 'Abilify', 'Abstem', 'Abstral']

As an alternative to the first solution you could do:

Suggested by @Jean-FrançoisFabre {x.split()[0] for x in bb}
Suggested by @RoadRunner set(x.split()[0] for x in bb)

Upvotes: 6

Find the common string in a subgroup in a list in Python

Answers (3)

Related Questions