Nicko
Nicko

Reputation: 362

Pandas: Each row take a string, separate by commas, and add unique word to list

Sample df:

filldata = [['5,Blue,Football', 3], ['Baseball,Blue,College,1993', 4], ['Green,5,Football', 1]]
df = pd.DataFrame(filldata, columns=['Tags', 'Count'])

I am wanting a unique list of words used in the Tags column. So I'm trying to loop through df and pull each row of Tags, split on , and add the words to a list. I could either check and add only unique words, or add them all and then just pull unique. I would like a solution for both methods if possible to see which is faster.
So expected output should be:

5, Blue, Football, Baseball, College, 1993, Green.

I have tried these:

tagslist = df['Tags'][0].split(',')  # To give me initial starting words
def adduniquetags(newtags, tagslist):
    thesetags = newtags.split(',')
    tagslist = tagslist.extend(thesetags)
    return tagslist
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]

and

tagslist = df['Tags'][0].split(',')
def adduniquetags(newtags, tagslist):
    thesetags = newtags.split(',')
    for word in thesetags:
            if word not in tagslist:
                tagslist.append(word)   
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]

These two are essentially the same with one looking only for unique words. Both of these return a list of 'None'.
I have also tried this:

tagslist = df['Tags'][0].split(',')
def adduniquetags(newtags, tagslist):
    thesetags = newtags.split(',')
    tagslist = list(set(tagslist + thesetags))
    return tagslist
tagslist = [adduniquetags(row, tagslist) for row in df['Tags']]

This one is adding unique values for each row, but not the words in each row. So even though I tried to split on the ,, it is still treating the entire text as one instead of using the individual words from the string.

Upvotes: 2

Views: 1897

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71687

Use Series.str.split to split strings, then use np.hstack to horizontally stack all the lists in column Tags, next use np.unique on this stacked array, to find the unique elements in array.

lst = np.unique(np.hstack(df['Tags'].str.split(','))).tolist()

Another possible idea using Series.explode + Series.unique:

lst = df['Tags'].str.split(',').explode().unique().tolist()

Result:

['1993', '5', 'Baseball', 'Blue', 'College', 'Football', 'Green']

Upvotes: 3

Related Questions