Jason Sanchez
Jason Sanchez

Reputation: 477

Pandas: Select values from specific columns of a DataFrame by row

Given a DataFrame with multiple columns, how do we select values from specific columns by row to create a new Series?

df = pd.DataFrame({"A":[1,2,3,4], 
                   "B":[10,20,30,40], 
                   "C":[100,200,300,400]})
columns_to_select = ["B", "A", "A", "C"]

Goal: [10, 2, 3, 400]

One method that works is to use an apply statement.

df["cols"] = columns_to_select
df.apply(lambda x: x[x.cols], axis=1)

Unfortunately, this is not a vectorized operation and takes a long time on a large dataset. Any ideas would be appreciated.

Upvotes: 10

Views: 10370

Answers (2)

Divakar
Divakar

Reputation: 221584

NumPy way

Here's a vectorized NumPy way using advanced indexing -

# Extract array data
In [10]: a = df.values

# Get integer based column IDs
In [11]: col_idx = np.searchsorted(df.columns, columns_to_select)

# Use NumPy's advanced indexing to extract relevant elem per row
In [12]: a[np.arange(len(col_idx)), col_idx]
Out[12]: array([ 10,   2,   3, 400])

If column names of df are not sorted, we need to use sorter argument with np.searchsorted. The code to extract col_idx for such a generic df would be :

# https://stackoverflow.com/a/38489403/ @Divakar
def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

So, col_idx would be obtained like so -

col_idx = column_index(df, columns_to_select)

Further optimization

Profiling it revealed that the bottleneck was processing strings with np.searchsorted, the usual NumPy weakness of not being so great with strings. So, to overcome that and using the special case scenario of column names being single letters, we could quickly convert those to numerals and then feed those to searchsorted for much faster processing.

Thus, an optimized version of getting the integer based column IDs, for the case where the column names are single letters and sorted, would be -

def column_index_singlechar_sorted(df, query_cols):
    c0 = np.fromstring(''.join(df.columns), dtype=np.uint8)
    c1 = np.fromstring(''.join(query_cols), dtype=np.uint8)
    return np.searchsorted(c0, c1)

This, gives us a modified version of the solution, like so -

a = df.values
col_idx = column_index_singlechar_sorted(df, columns_to_select)
out = pd.Series(a[np.arange(len(col_idx)), col_idx])

Timings -

In [149]: # Setup df with 26 uppercase column letters and many rows
     ...: import string
     ...: df = pd.DataFrame(np.random.randint(0,9,(1000000,26)))
     ...: s = list(string.uppercase[:df.shape[1]])
     ...: df.columns = s
     ...: idx = np.random.randint(0,df.shape[1],len(df))
     ...: columns_to_select = np.take(s, idx).tolist()

# With df.lookup from @MaxU's soln
In [150]: %timeit pd.Series(df.lookup(df.index, columns_to_select))
10 loops, best of 3: 76.7 ms per loop

# With proposed one from this soln
In [151]: %%timeit
     ...: a = df.values
     ...: col_idx = column_index_singlechar_sorted(df, columns_to_select)
     ...: out = pd.Series(a[np.arange(len(col_idx)), col_idx])
10 loops, best of 3: 59 ms per loop

Given that df.lookup solves for a generic case, that's a probably a better choice, but the other possible optimizations as shown in this post could be handy as well!

Upvotes: 8

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210862

Pandas approach:

In [22]: df['new'] = df.lookup(df.index, columns_to_select)

In [23]: df
Out[23]:
   A   B    C  new
0  1  10  100   10
1  2  20  200    2
2  3  30  300    3
3  4  40  400  400

Upvotes: 11

Related Questions