Sitz Blogz
Sitz Blogz

Reputation: 1061

Removed string values from geo cordinates

I have massive dataset with more than two billion rows and 16 columns.

Two of the columns are latitude and longitude and one column of DateTime.

As the data is raw I am trying to sanitize it. In the process I came across some latitude longitude rows containing datetime which is been taken as string and latitude and longitude are float.

I am using re for other preprocessing but here I am not able to do it.

I want to remove the entire row which has anything else but latitude and longitude in those two columns.

I am looking for some simple solution in python and pandas which will only remove those rows and keep the rest of the dataset as it is.

Edit: The input mini version is shared here. Last two columns are latitude & Longitude and a column before that is datetime row 10 as the error I mentioned.

Appriciate the help.. Thanks!

Upvotes: 1

Views: 33

Answers (1)

jezrael
jezrael

Reputation: 863166

There is difference if all values are strings:

df = pd.DataFrame({'lat':[10,20,'d'], 'lon':[4,'hh', 7]}).astype(str)

mask = pd.to_numeric(df['lat'], errors='coerce').notnull() & 
       pd.to_numeric(df['lon'], errors='coerce').notnull()

df = df[mask]
print (df)
  lat lon
0  10   4

Or values are mixed - some are numeric and some are strings:

df = pd.DataFrame({'lat':[10,20,'d'], 'lon':[4,'hh', 7]})

mask = (df['lat'].apply(type) != str) & (df['lon'].apply(type) != str)

df = df[mask]
print (df)
  lat lon
0  10   4

EDIT:

df = pd.read_csv('twt_mini_stack.csv')
#print (df)

mask = pd.to_numeric(df['Lat'], errors='coerce').notnull() &  \
       pd.to_numeric(df['Long'], errors='coerce').notnull()

df = df[mask]
print (df[['Tweeted Datetime','Lat','Long']])
        Tweeted Datetime       Lat      Long
0   3:59 PM - 1 Jan 2016  35.68501  139.7514
1   3:59 PM - 1 Jan 2016  35.68501  139.7514
2   3:59 PM - 1 Jan 2016  35.68501  139.7514
3   3:59 PM - 1 Jan 2016  35.68501  139.7514
4   3:59 PM - 1 Jan 2016  35.68501  139.7514
5   3:59 PM - 1 Jan 2016  35.68501  139.7514
6   3:59 PM - 1 Jan 2016  35.68501  139.7514
7   3:59 PM - 1 Jan 2016  35.68501  139.7514
9   3:58 PM - 1 Jan 2016  35.68501  139.7514
10  3:58 PM - 1 Jan 2016  35.68501  139.7514
11  3:58 PM - 1 Jan 2016  35.68501  139.7514
12  3:58 PM - 1 Jan 2016  35.68501  139.7514
13  3:58 PM - 1 Jan 2016  35.68501  139.7514

Upvotes: 1

Related Questions