cookie1986
cookie1986

Reputation: 895

Applying a function to all but one column in Pandas

I'm looking to apply a function to all but one column in pandas, whilst maintaining that column as it is originally. I have a working version that does what I need, but it seems unusually long for something so simple. I'm wondering if there is a better approach...

df = pd.DataFrame(columns = ['firstcol','secondcol','thirdcol'],
     data=[['a1',1,'a6.1'],['b2',3,'b9.3'],['c12',4,'c2']])

My dataframe is made up of strings and integers. Each column is exclusive of a particular type, and I am looking to standardize so that all columns are numerical (integers or floats as I have some decimal values). So, in the toy data above, I need to transform the first and third columns, and leave the second column alone...

df.loc[:, df.columns != 'secondcol'] = df.loc[:, df.columns != 'secondcol'].applymap(lambda x: float(re.sub(r'[^\d.]','', x)))

For clarity, this line: (1) specifies all but the column named "secondcol", (2) uses the applymap and lambda functions to remove non-numeric (or decimal) characters, and (3) converts to a float.

This produces the desired output, but as I say isn't overly readable. Have I stumbled across the best way to do this, or is there a simpler alternative?

Upvotes: 4

Views: 5364

Answers (2)

Mustafa Aydın
Mustafa Aydın

Reputation: 18315

We can use df.columns.difference (set nomenclature) to select "all but one column", subset the frame, and apply the operation:

bad_column     = "undesired_column"
other_cols     = df.columns.difference([bad_column])
df[other_cols] = df[other_cols].apply(...)

Note that we need to pass an array-like to .difference; it won't accept bad_column string as is, so we wrap it in a list. If you already have lists of undesired columns, they are put as is, e.g.,

bad_columns    = ["bad_1", "bad_7"]
other_cols     = df.columns.difference(bad_columns)
df[other_cols] = df[other_cols].apply(...)

Instead of .apply, you can do other things to that subset of course.


(answering the original question, which is somewhat more specific than the title reads)

You can select the non-numeric data type columns and replace their values:

non_numerics = df.select_dtypes(exclude="number").columns

df[non_numerics] = df[non_numerics].apply(lambda x: x.str.replace(r"[^\d.]", "").astype(float))

where I used your regex for replacing but not element-wise with applymap but column-wise with apply (you can change to applymap with your line if you will).

to get

>>> df

   firstcol  secondcol  thirdcol
0       1.0          1       6.1
1       2.0          3       9.3
2      12.0          4       2.0

Upvotes: 9

Paul H
Paul H

Reputation: 68186

Readable is in the eye of beholder. Here's what I would do:

import re

import pandas

df = pandas.DataFrame(
    columns=['firstcol', 'secondcol', 'thirdcol'],
    data=[['a1', 1, 'a6.1'], ['b2', 3, 'b9.3'], ['c12', 4, 'c2']]
)

df2 = (
    df.select_dtypes(exclude='number')
      .applymap(lambda x: float(re.sub(r'[^\d.]','', x)))
      .join(df.select_dtypes(include='number'))
      .loc[:, df.columns.tolist()]  
)

Upvotes: 2

Related Questions