Reputation: 21523
I have dataset, in which I read the data, df.dir.value_counts()
returns
169 23042
170 22934
168 22873
316 22872
315 22809
171 22731
317 22586
323 22561
318 22530
...
0.069 1
0.167 1
0557 1
0.093 1
1455 1
0.130 1
0.683 1
2211 1
3.714 1
1.093 1
0819 1
0.183 1
0.110 1
2241 1
0.34 1
0.330 1
0.563 1
60+9 1
0.910 1
0.232 1
1410 1
0.490 1
0.107 1
1.257 1
1704 1
0.491 1
1.180 1
5-230 1
1735 1
1.384 1
The dir
column is about direction, and the data should be integer, ranging from (0,361). As you can see, there are a lot of errones data at the end of the value_counts()
list.
I want to know, how can I drop the non-integer data?
There are some possible ways
1.read_csv
as integer and throw all non-integer data
df = pd.read_csv("/data.dat", names = ['time', 'dir'], dtype={'dir': int}})
However, there some string like error data, such as 60+9
, which would cause error. I don't know how to handle it.
2.Select by isdigit()
, and then do a downcast
df = df[df['dir'].apply(lambda x: str(x).isdigit())]
df['dir']=pd.to_numeric(df['dir'], downcast='integer', errors='coerce')
This is from Drop rows if value in a specific column is not an integer in pandas dataframe, and works fine for me, but it feels a little bit too much. I'm wondering if there are better approaches?
Upvotes: 1
Views: 6800
Reputation: 294258
I like
df.dir[df.dir == df.dir // 1]
Consider the dataframe df
df = pd.DataFrame(dict(dir=[1, 1.5, 2, 2.5]))
print(df)
dir
0 1.0
1 1.5
2 2.0
3 2.5
Anything that is an integer should be equal to itself floor divided by one.
df.assign(floor_div=df.dir // 1)
dir floor_div
0 1.0 1.0
1 1.5 1.0
2 2.0 2.0
3 2.5 2.0
So we can test for when they are equal
df.assign(
floor_div=df.dir // 1,
is_int=df.dir // 1 == df.dir
)
dir floor_div is_int
0 1.0 1.0 True
1 1.5 1.0 False
2 2.0 2.0 True
3 2.5 2.0 False
So to filter, we can use the boolean mask in the demo column 'is_int'
df.dir[df.dir == df.dir // 1]
0 1.0
2 2.0
Name: dir, dtype: float64
If there are strings in this column, then you can incorporate pd.to_numeric
df.dir = pd.to_numeric(df.dir, 'coerce')
df.dir[df.dir == df.dir // 1]
Upvotes: 5