Reputation: 321
I am translating some code from R to python to improve performance, but I am not very familiar with the pandas library.
I have a CSV file that looks like this:
O43657,GO:0005737
A0A087WYV6,GO:0005737
A0A087WZU5,GO:0005737
Q8IZE3,GO:0015630 GO:0005654 GO:0005794
X6RHX1,GO:0015630 GO:0005654 GO:0005794
Q9NSG2,GO:0005654 GO:0005739
I would like to split the second column on a delimiter (here, a space), and get the unique values in this column. In this case, the code should return [GO:0005737, GO:0015630, GO:0005654 GO:0005794, GO:0005739]
.
In R, I would do this using the following code:
df <- read.csv("data.csv")
unique <- unique(unlist(strsplit(df[,2], " ")))
In python, I have the following code using pandas:
df = pd.read_csv("data.csv")
split = df.iloc[:, 1].str.split(' ')
unique = pd.unique(split)
But this produces the following error:
TypeError: unhashable type: 'list'
How can I get the unique values in a column of a CSV file after splitting on a delimiter in python?
Upvotes: 5
Views: 3326
Reputation: 294478
setup
from io import StringIO
import pandas as pd
txt = """O43657,GO:0005737
A0A087WYV6,GO:0005737
A0A087WZU5,GO:0005737
Q8IZE3,GO:0015630 GO:0005654 GO:0005794
X6RHX1,GO:0015630 GO:0005654 GO:0005794
Q9NSG2,GO:0005654 GO:0005739"""
s = pd.read_csv(StringIO(txt), header=None, squeeze=True, index_col=0)
solution
pd.unique(s.str.split(expand=True).stack())
array(['GO:0005737', 'GO:0015630', 'GO:0005654', 'GO:0005794', 'GO:0005739'], dtype=object)
Upvotes: 5