alejo
alejo

Reputation: 137

Efficient way of converting a numpy array of 2 dimensions into a list with no duplicates

I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.

I have tried the following:

arr = df[['column1', 'column2']].values
thelist= []
    for ix, iy in np.ndindex(arr.shape):
        if arr[ix, iy] not in thelist:
            thelist.append(edges[ix, iy])

This works but it is taking too long. The dataframe contains around 30 million rows.

Example:

  column1 column2 
1   adr1   adr2   
2   adr1   adr2   
3   adr3   adr4   
4   adr4   adr5   

Should generate the list with the values:

[adr1, adr2, adr3, adr4, adr5]

Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.

Upvotes: 0

Views: 148

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 30991

You can use just np.unique(df) (maybe this is the shortest version).

Formally, the first parameter of np.unique should be an array_like object, but as I checked, you can also pass just a DataFrame.

Of course, if you want just plain list not a ndarray, write np.unique(df).tolist().

Edit following your comment

If you want the list unique but in the order of appearance, write:

pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()

Operation order:

  • reshape changes the source array into a single column.
  • Then a DataFrame is created, with default column name = 0.
  • Then [0] takes just this (the only) column.
  • drop_duplicates acts exactly what the name says.
  • And the last step: tolist converts to a plain list.

Upvotes: 1

meW
meW

Reputation: 3967

@ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))

Upvotes: 2

Related Questions