Reputation: 895
I'm looking to apply a function to all but one column in pandas, whilst maintaining that column as it is originally. I have a working version that does what I need, but it seems unusually long for something so simple. I'm wondering if there is a better approach...
df = pd.DataFrame(columns = ['firstcol','secondcol','thirdcol'],
data=[['a1',1,'a6.1'],['b2',3,'b9.3'],['c12',4,'c2']])
My dataframe is made up of strings and integers. Each column is exclusive of a particular type, and I am looking to standardize so that all columns are numerical (integers or floats as I have some decimal values). So, in the toy data above, I need to transform the first and third columns, and leave the second column alone...
df.loc[:, df.columns != 'secondcol'] = df.loc[:, df.columns != 'secondcol'].applymap(lambda x: float(re.sub(r'[^\d.]','', x)))
For clarity, this line: (1) specifies all but the column named "secondcol", (2) uses the applymap
and lambda
functions to remove non-numeric (or decimal) characters, and (3) converts to a float.
This produces the desired output, but as I say isn't overly readable. Have I stumbled across the best way to do this, or is there a simpler alternative?
Upvotes: 4
Views: 5364
Reputation: 18315
We can use df.columns.difference
(set nomenclature) to select "all but one column", subset the frame, and apply the operation:
bad_column = "undesired_column"
other_cols = df.columns.difference([bad_column])
df[other_cols] = df[other_cols].apply(...)
Note that we need to pass an array-like to .difference
; it won't accept bad_column
string as is, so we wrap it in a list. If you already have lists of undesired columns, they are put as is, e.g.,
bad_columns = ["bad_1", "bad_7"]
other_cols = df.columns.difference(bad_columns)
df[other_cols] = df[other_cols].apply(...)
Instead of .apply
, you can do other things to that subset of course.
(answering the original question, which is somewhat more specific than the title reads)
You can select the non-numeric data type columns and replace their values:
non_numerics = df.select_dtypes(exclude="number").columns
df[non_numerics] = df[non_numerics].apply(lambda x: x.str.replace(r"[^\d.]", "").astype(float))
where I used your regex for replacing but not element-wise with applymap
but column-wise with apply
(you can change to applymap
with your line if you will).
to get
>>> df
firstcol secondcol thirdcol
0 1.0 1 6.1
1 2.0 3 9.3
2 12.0 4 2.0
Upvotes: 9
Reputation: 68186
Readable is in the eye of beholder. Here's what I would do:
import re
import pandas
df = pandas.DataFrame(
columns=['firstcol', 'secondcol', 'thirdcol'],
data=[['a1', 1, 'a6.1'], ['b2', 3, 'b9.3'], ['c12', 4, 'c2']]
)
df2 = (
df.select_dtypes(exclude='number')
.applymap(lambda x: float(re.sub(r'[^\d.]','', x)))
.join(df.select_dtypes(include='number'))
.loc[:, df.columns.tolist()]
)
Upvotes: 2