Reputation: 373
I am trying to remove the duplicate strings in a list of strings under a column in a Pandas DataFrame.
For example; the list value of:
[btc, btc, btc]
Should be;
[btc]
I have tried multiple methods however, none seems to be working as I am unable access the string values in the list. Any help is much appreciated.
DataFrame:
dollar_sign followers_count \
0 [btc] 35946
1 [btc] 35946
2 [btc] 35946
3 [nav] 35946
4 [btc, btc, btc] 35946
Access the list of strings under a column
for row in df_twitter['dollar_sign']:
print row
Output:
[btc]
[btc]
[btc]
[nav]
[btc, btc, btc]
Upvotes: 1
Views: 8051
Reputation: 2676
Simpler, and will turn the Series back into lists so you can stack, unstack, etc:
df['column_name'] = df['column_name'].apply(set).apply(list)
Upvotes: 0
Reputation: 7994
From the information revealed, I believe OP's df is actually not full of list of strings but strings that look like a list.
From the OP's print result, we see
[btc]
[btc]
[nav]
[btc, btc,btc]
However, if it is of lists of strings, it should yield
['btc']
['btc']
['btc']
['nav']
['btc', 'btc', 'btc']
Solution:
df = pd.DataFrame({
'dollar_sign':['[btc]','[btc]','[btc]','[nav]','[btc, btc, btc]'],
'followers_count':[35946,35946,35946,35946,35946]}
)
df.dollar_sign.str[1:-1].str.split(",\s").map(set)
0 {btc}
1 {btc}
2 {btc}
3 {nav}
4 {btc}
Name: dollar_sign, dtype: object
.str[1:-1]
removes [
and ]
.
str.split(",\s")
splits with ", ", a comma and a space. (Assuming the strings use ", " as the delimiter, otherwise, you may need "\s*,\s*"
or something even more sophisticated.)
map(set)
turns each list into a set.Upvotes: 3
Reputation: 323266
You can using list
with map
, and set
can get the unique value
df['dollar_sign']=list(map(set,df['dollar_sign']))
df
Out[1068]:
dollar_sign followers_count
0 {btc} 35946
1 {btc} 35946
2 {btc} 35946
3 {nav} 35946
4 {btc} 35946
This is how I create the df
df=pd.DataFrame({'dollar_sign':[['btc'],['btc'],['btc'],['nav'],['btc','btc','btc']],'followers_count':[35946,35946
,35946
,35946
,35946
]})
Upvotes: 2
Reputation: 3325
You can use sets. A set will take out the duplicates.
So, as an example, keeping the style of the output:
for row in df_twitter['dollar_sign']:
print list(set(row))
Output:
[btc]
[btc]
[btc]
[nav]
[btc]
Upvotes: 3