Reputation: 362
Sample df:
filldata = [['5,Blue,Football', 3], ['Baseball,Blue,College,1993', 4], ['Green,5,Football', 1]]
df = pd.DataFrame(filldata, columns=['Tags', 'Count'])
I am wanting a unique list of words used in the Tags
column. So I'm trying to loop through df and pull each row of Tags
, split on ,
and add the words to a list. I could either check and add only unique words, or add them all and then just pull unique. I would like a solution for both methods if possible to see which is faster.
So expected output should be:
5, Blue, Football, Baseball, College, 1993, Green
.
I have tried these:
tagslist = df['Tags'][0].split(',') # To give me initial starting words
def adduniquetags(newtags, tagslist):
thesetags = newtags.split(',')
tagslist = tagslist.extend(thesetags)
return tagslist
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]
and
tagslist = df['Tags'][0].split(',')
def adduniquetags(newtags, tagslist):
thesetags = newtags.split(',')
for word in thesetags:
if word not in tagslist:
tagslist.append(word)
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]
These two are essentially the same with one looking only for unique words. Both of these return a list of 'None'.
I have also tried this:
tagslist = df['Tags'][0].split(',')
def adduniquetags(newtags, tagslist):
thesetags = newtags.split(',')
tagslist = list(set(tagslist + thesetags))
return tagslist
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]
This one is adding unique values for each row, but not the words in each row. So even though I tried to split on the ,
, it is still treating the entire text as one instead of using the individual words from the string.
Upvotes: 2
Views: 1897
Reputation: 71687
Use Series.str.split
to split strings, then use np.hstack
to horizontally stack all the lists in column Tags
, next use np.unique
on this stacked array, to find the unique elements in array.
lst = np.unique(np.hstack(df['Tags'].str.split(','))).tolist()
Another possible idea using Series.explode
+ Series.unique
:
lst = df['Tags'].str.split(',').explode().unique().tolist()
Result:
['1993', '5', 'Baseball', 'Blue', 'College', 'Football', 'Green']
Upvotes: 3