Reputation: 137
I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.
I have tried the following:
arr = df[['column1', 'column2']].values
thelist= []
for ix, iy in np.ndindex(arr.shape):
if arr[ix, iy] not in thelist:
thelist.append(edges[ix, iy])
This works but it is taking too long. The dataframe contains around 30 million rows.
Example:
column1 column2
1 adr1 adr2
2 adr1 adr2
3 adr3 adr4
4 adr4 adr5
Should generate the list with the values:
[adr1, adr2, adr3, adr4, adr5]
Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.
Upvotes: 0
Views: 148
Reputation: 30991
You can use just np.unique(df)
(maybe this is the shortest version).
Formally, the first parameter of np.unique
should be an array_like object,
but as I checked, you can also pass just a DataFrame.
Of course, if you want just plain list not a ndarray, write
np.unique(df).tolist()
.
If you want the list unique but in the order of appearance, write:
pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()
Operation order:
reshape
changes the source array into a single column.0
.[0]
takes just this (the only) column.drop_duplicates
acts exactly what the name says.tolist
converts to a plain list.Upvotes: 1
Reputation: 3967
@ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))
Upvotes: 2