Josh
Josh

Reputation: 1601

Pandas - conditional column reference with regex

In pandas, I'm creating a new column which conditionally comes directly from another column based on a regex match.

It might be easier to just an example:

HAVE:

+-------+------------+------------+-------------+-----------+
| Color | value_Gold | value_Blue | value_Green | value_Red |
+-------+------------+------------+-------------+-----------+
| Gold  |         12 |          9 |          11 |         8 |
| Gold  |          2 |          7 |          12 |        11 |
| Blue  |          1 |          5 |           7 |         1 |
| Red   |          6 |         12 |           3 |        10 |
| Green |         11 |         10 |          11 |         2 |
+-------+------------+------------+-------------+-----------+

Code so that you can recreate this dataframe:

df = pd.DataFrame(np.array([['Gold',12,9,11,8],['Gold',2,7,12,11],['Blue',1,5,7,1],['Red',6,12,3,10],['Green',11,10,11,2]]),
columns = ['Color','value_Gold','value_Blue','value_Green','value_Red'])

WANT:

+-------+------------+------------+-------------+-----------+------+
| Color | value_Gold | value_Blue | value_Green | value_Red | WANT |
+-------+------------+------------+-------------+-----------+------+
| Gold  |         12 |          9 |          11 |         8 |   12 |
| Gold  |          2 |          7 |          12 |        11 |    2 |
| Blue  |          1 |          5 |           7 |         1 |    5 |
| Red   |          6 |         12 |           3 |        10 |   10 |
| Green |         11 |         10 |          11 |         2 |   11 |
+-------+------------+------------+-------------+-----------+------+

At first, I was thinking I should focus on eliminating the values I don't want in the dataframe first, and then doing some sort of coalesce. But I'm really not sure what the most elegant way to do this is.

Ideas?

Edit: Another thought I had, was to transpose the value_ columns and use it for a lookup, but the truth is that my real dataset has hundreds of thousands of columns and so I didn't think this route was best. In reality, think 200K rows and 20 unique colors.

Upvotes: 2

Views: 361

Answers (2)

Andy L.
Andy L.

Reputation: 25259

Use rename to split column names and call lookup

df['WANT'] = (df.rename(lambda x: x.split('_')[-1], axis=1)
                .lookup(df.index, df.Color))

Out[1365]:
   Color value_Gold value_Blue value_Green value_Red WANT
0   Gold         12          9          11         8   12
1   Gold          2          7          12        11    2
2   Blue          1          5           7         1    5
3    Red          6         12           3        10   10
4  Green         11         10          11         2   11

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150785

You can, for example, use mask and bfill:

mask = df['Color'].values  == df.columns[1:].str.extract('_(.*)$').values
df['WANT'] = df.iloc[:,1:].where(mask.T).bfill(1).iloc[:,0]

Upvotes: 4

Related Questions