AlpU
AlpU

Reputation: 373

How to remove duplicates in a list of strings in a pandas column Python

I am trying to remove the duplicate strings in a list of strings under a column in a Pandas DataFrame.

For example; the list value of:

[btc, btc, btc]

Should be;

[btc]

I have tried multiple methods however, none seems to be working as I am unable access the string values in the list. Any help is much appreciated.

DataFrame:

          dollar_sign  followers_count  \
0                   [btc]            35946
1                   [btc]            35946
2                   [btc]            35946
3                   [nav]            35946
4         [btc, btc, btc]            35946

Access the list of strings under a column

for row in df_twitter['dollar_sign']:
    print row

Output:

[btc]
[btc]
[btc]
[nav]
[btc, btc, btc]

Upvotes: 1

Views: 8051

Answers (4)

Ben Wilson
Ben Wilson

Reputation: 2676

Simpler, and will turn the Series back into lists so you can stack, unstack, etc:

df['column_name'] = df['column_name'].apply(set).apply(list)

Upvotes: 0

Tai
Tai

Reputation: 7994

From the information revealed, I believe OP's df is actually not full of list of strings but strings that look like a list.

From the OP's print result, we see

[btc]
[btc]
[nav]
[btc, btc,btc]

However, if it is of lists of strings, it should yield

['btc']
['btc']
['btc']
['nav']
['btc', 'btc', 'btc']

Solution:

df = pd.DataFrame({
        'dollar_sign':['[btc]','[btc]','[btc]','[nav]','[btc, btc, btc]'],
        'followers_count':[35946,35946,35946,35946,35946]}
     )


df.dollar_sign.str[1:-1].str.split(",\s").map(set)

0    {btc}
1    {btc}
2    {btc}
3    {nav}
4    {btc}
Name: dollar_sign, dtype: object
  • .str[1:-1] removes [ and ].

  • str.split(",\s") splits with ", ", a comma and a space. (Assuming the strings use ", " as the delimiter, otherwise, you may need "\s*,\s*" or something even more sophisticated.)

  • map(set) turns each list into a set.

Upvotes: 3

BENY
BENY

Reputation: 323266

You can using list with map , and set can get the unique value

df['dollar_sign']=list(map(set,df['dollar_sign']))
df
Out[1068]: 
  dollar_sign  followers_count
0       {btc}            35946
1       {btc}            35946
2       {btc}            35946
3       {nav}            35946
4       {btc}            35946

This is how I create the df

df=pd.DataFrame({'dollar_sign':[['btc'],['btc'],['btc'],['nav'],['btc','btc','btc']],'followers_count':[35946,35946
,35946
,35946
,35946
]})

Upvotes: 2

Mangu
Mangu

Reputation: 3325

You can use sets. A set will take out the duplicates.

So, as an example, keeping the style of the output:

for row in df_twitter['dollar_sign']:
    print list(set(row))

Output:

[btc]
[btc]
[btc]
[nav]
[btc]

Upvotes: 3

Related Questions