Pandas: Each row take a string, separate by commas, and add unique word to list

Question

Sample df:

filldata = [['5,Blue,Football', 3], ['Baseball,Blue,College,1993', 4], ['Green,5,Football', 1]]
df = pd.DataFrame(filldata, columns=['Tags', 'Count'])

I am wanting a unique list of words used in the Tags column. So I'm trying to loop through df and pull each row of Tags, split on , and add the words to a list. I could either check and add only unique words, or add them all and then just pull unique. I would like a solution for both methods if possible to see which is faster.
So expected output should be:

5, Blue, Football, Baseball, College, 1993, Green.

I have tried these:

tagslist = df['Tags'][0].split(',')  # To give me initial starting words
def adduniquetags(newtags, tagslist):
    thesetags = newtags.split(',')
    tagslist = tagslist.extend(thesetags)
    return tagslist
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]

and

tagslist = df['Tags'][0].split(',')
def adduniquetags(newtags, tagslist):
    thesetags = newtags.split(',')
    for word in thesetags:
            if word not in tagslist:
                tagslist.append(word)   
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]

These two are essentially the same with one looking only for unique words. Both of these return a list of 'None'.
I have also tried this:

tagslist = df['Tags'][0].split(',')
def adduniquetags(newtags, tagslist):
    thesetags = newtags.split(',')
    tagslist = list(set(tagslist + thesetags))
    return tagslist
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]

This one is adding unique values for each row, but not the words in each row. So even though I tried to split on the ,, it is still treating the entire text as one instead of using the individual words from the string.

Shubham Sharma · Accepted Answer

Use Series.str.split to split strings, then use np.hstack to horizontally stack all the lists in column Tags, next use np.unique on this stacked array, to find the unique elements in array.

lst = np.unique(np.hstack(df['Tags'].str.split(','))).tolist()

Another possible idea using Series.explode + Series.unique:

lst = df['Tags'].str.split(',').explode().unique().tolist()

Result:

['1993', '5', 'Baseball', 'Blue', 'College', 'Football', 'Green']

Pandas: Each row take a string, separate by commas, and add unique word to list

Answers (1)

Related Questions