Todd
Todd

Reputation: 1

K-means how to determine most locations near specific latitudes and longitudes

I know the central latitude and longitude for each neighborhood in a city and I have a data set of restaurants with their latitudes and longitudes. I need to determine which neighborhood is the most dense using something like K-meaans. So let's just say, I have a first series of say ten latitudes and longitudes, and a second of about 200, how would I determine which of those ten latitudes is the most dense, or has the most latitudes near it?

Upvotes: -2

Views: 132

Answers (2)

ASH
ASH

Reputation: 20362

How about this?

# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint


# define the number of kilometers in one radian
kms_per_radian = 6371.0088


# load the data set
df = pd.read_csv('C:\\your_path_here\\summer-travel-gps-full.csv', encoding = "ISO-8859-1")
df.head()


# how many rows are in this data set?
len(df)


# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)

 

# represent points consistently as (lat, lon)
# coords = df.as_matrix(columns=['lat', 'lon'])
df_coords = df[['lat', 'lon']]
# coords = df.to_numpy(df_coords)

# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian


start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))


# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))

# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
    
    size = 150
    if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
        color = 'gray'
        size = 30
    
    # plot the points that match the current cluster label
    # X.iloc[:-1]
    # df.iloc[:, 0]
    x_coords = df_coords.iloc[:, 0]
    y_coords = df_coords.iloc[:, 1]
    ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)

ax.set_title('Number of clusters: {}'.format(num_clusters))
plt.show()

enter image description here

coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))


# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian

# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))

# all done, print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))


# Result:
Silhouette coefficient: 0.854
Clustered 1,759 points down to 138 clusters, for 92.2% compression in 0.17 seconds

coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: {:0.03f}'.format(metrics.silhouette_score(df_coords, cluster_labels)))



# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: {}'.format(num_clusters))


# Result:
Number of clusters: 138


# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()

Final Result:

0                  lat        lon
1587  37.921659  22...
1                  lat        lon
1658  37.933609  23...
2                  lat        lon
1607  37.966766  23...
3                  lat        lon
1586  38.149019  22...
4                  lat        lon
1584  38.374766  21...
                       
133              lat        lon
662  50.37369  18.889205
134               lat        lon
561  50.448704  19.0...
135               lat        lon
661  50.462271  19.0...
136               lat        lon
559  50.489304  19.0...
137             lat       lon
1  51.474005 -0.450999

https://github.com/gboeing/urban-data-science/blob/2017/15-Spatial-Cluster-Analysis/cluster-analysis.ipynb

https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/

Upvotes: 2

Andrea Grillo
Andrea Grillo

Reputation: 125

If you knew the border of each neighbourhood (or its radius to make an approximation), from some cartographic data of the city, you could just check in which neighbourhood each restaurant is located.

Otherwise, you can compute the distance between the restaurant and the central point of the neighbourhood and assign each one of the 200 restaurants to the closest neighbourhood.

Then you can approximate the density of each neighbourhood as number of restaurants in the neighbourhood divided by the total number of restaurants.

I think you don't need any machine learning algorithm.

Of course you can choose the distance metric according to your problem.

Upvotes: 0

Related Questions