Reputation: 1061
I have massive dataset with more than two billion rows and 16 columns.
Two of the columns are latitude and longitude and one column of DateTime.
As the data is raw I am trying to sanitize it. In the process I came across some latitude longitude rows containing datetime which is been taken as string
and latitude and longitude are float
.
I am using re
for other preprocessing but here I am not able to do it.
I want to remove the entire row which has anything else but latitude and longitude in those two columns.
I am looking for some simple solution in python and pandas which will only remove those rows and keep the rest of the dataset as it is.
Edit: The input mini version is shared here. Last two columns are latitude & Longitude
and a column before that is datetime
row 10 as the error I mentioned.
Appriciate the help.. Thanks!
Upvotes: 1
Views: 33
Reputation: 863166
There is difference if all values are strings
:
df = pd.DataFrame({'lat':[10,20,'d'], 'lon':[4,'hh', 7]}).astype(str)
mask = pd.to_numeric(df['lat'], errors='coerce').notnull() &
pd.to_numeric(df['lon'], errors='coerce').notnull()
df = df[mask]
print (df)
lat lon
0 10 4
Or values are mixed - some are numeric and some are strings:
df = pd.DataFrame({'lat':[10,20,'d'], 'lon':[4,'hh', 7]})
mask = (df['lat'].apply(type) != str) & (df['lon'].apply(type) != str)
df = df[mask]
print (df)
lat lon
0 10 4
EDIT:
df = pd.read_csv('twt_mini_stack.csv')
#print (df)
mask = pd.to_numeric(df['Lat'], errors='coerce').notnull() & \
pd.to_numeric(df['Long'], errors='coerce').notnull()
df = df[mask]
print (df[['Tweeted Datetime','Lat','Long']])
Tweeted Datetime Lat Long
0 3:59 PM - 1 Jan 2016 35.68501 139.7514
1 3:59 PM - 1 Jan 2016 35.68501 139.7514
2 3:59 PM - 1 Jan 2016 35.68501 139.7514
3 3:59 PM - 1 Jan 2016 35.68501 139.7514
4 3:59 PM - 1 Jan 2016 35.68501 139.7514
5 3:59 PM - 1 Jan 2016 35.68501 139.7514
6 3:59 PM - 1 Jan 2016 35.68501 139.7514
7 3:59 PM - 1 Jan 2016 35.68501 139.7514
9 3:58 PM - 1 Jan 2016 35.68501 139.7514
10 3:58 PM - 1 Jan 2016 35.68501 139.7514
11 3:58 PM - 1 Jan 2016 35.68501 139.7514
12 3:58 PM - 1 Jan 2016 35.68501 139.7514
13 3:58 PM - 1 Jan 2016 35.68501 139.7514
Upvotes: 1