lmurdock12
lmurdock12

Reputation: 1021

Any alternatives to speeding up brute force 'tally' algorithm?

Apologies ahead of time if this is the wrong place to post this question..if there is a better stack exchange site let me know.

So currently developing a crime prediction algorithm that essentially lays a grid over a city and predicts whether each grid entry will be a hotspot or not for the next 30 days (at least one assault crime occurs).

I am using the city of Nashville currently with a grid overlay of 3446 grids. I have a grid dataset that contains all the data needed to display the grid, the map coordinates of each grid as well as neighboring grids around it (neighbor to the bottom, neighbor to the right,..etc)

enter image description here

Here is an example of what the predictions look like: enter image description here

In this case the green means a right prediction..red means false negative, purple means false positive for the machine learning algorithm.

To train my neural network I am using a feature set that looks like this: enter image description here

Here Hotspot is the target value (either 1 and 0). Week, month year are the crime tallys from the crime incidents pulled from the last year (crime occured in last week, last month, and last year). My issue is creating these feature sets takes an extensive amount of time (script takes over 6 hours)

#Loop through each grid in the dataset
for grid_index, grid_row in grid.iterrows():
    print("On grid number: ", grid_row['id'])
    near=0
    #Loop through all of the crimes 
    for crime_index, crime_row in crime.iterrows():

        #Parse out the month, day, and year
        date = crime_row['Incident Occurred']
        date_pars = date.split('/')
        month = int(date_pars[0])
        day= int(date_pars[1])
        year =int(date_pars[2].split(' ')[0])

        if grid_row['top '] == crime_row['grid']:
            near +=1
        if grid_row['bottom '] == crime_row['grid']:
            near +=1
        if grid_row['left '] == crime_row['grid']:
            near +=1
        if grid_row['right '] == crime_row['grid']:
            near +=1
        if grid_row['topleft'] == crime_row['grid']:
            near +=1
        if grid_row['topright'] == crime_row['grid']:
            near +=1
        if grid_row['bottomright'] == crime_row['grid']:
            near +=1
        if grid_row['bottomleft'] == crime_row['grid']:
            near +=1

        if month == 12 and grid_row['id'] == crime_row['grid']:
            countMonth = countMonth+1
        if day >= 25 and month == 12 and grid_row['id'] == crime_row['grid']:
            countWeek = countWeek + 1

        if  year == 2017 and grid_row['id'] == crime_row['grid']:
            countYear=countYear+1

    #Update the output for the specific grid
    output = output.append({'Grid': grid_row['id'], 'Hotspot': 0, 'week': countWeek, 'month': 
    countMonth, 'year': countYear,'near': near}, ignore_index=True)
    countMonth = 0
    countYear = 0
    countWeek = 0

Right now this code loops through each grid (3446 total) and within each grid it loops through each crime (around 18,000) counting up the tallys and appending it to a pandas dataframe...3446*18000 is around 62 million calculations to create this dataset. I feel like this wouldn't take too long but it is taking way longer then ideally.

Any ideas on how this could be speed up efficiently? I need to run this algorithm for each month of the last three years, so 36 times at over 5 hours each run time is way too long for my time constraints.

Thanks ahead of time for any insight.

EDIT: To clarify 'grid_row' is each record in the grid CSV file which I posted the columns above (the location of each grid and neighboring grids) and 'crime_row' is a each crime incident that happened within the last year: enter image description here

Upvotes: 0

Views: 210

Answers (1)

grodzi
grodzi

Reputation: 5703

The way you do things can be simplified as

forall grid
  forall crimes
    if crime.cell == grid.cell
      do something

That complexity is O(|grid| * |crimes|)

if you have 3k crimes and 5k grid, this makes it 15e6 iterations

A better way is to iterate over the crimes and push any of them to its associated grid, stacking all the crimes having the same grid_index to... the same emplacement

gridIdxToCrimes = {} // to a grid_index you associate all the crimes

for crime_row in crime.iterrows():
  grid_index = crime_row['grid']
  if grid_index not in gridIdxToCrimes:
    gridIdxToCrimes[grid_index] = []
  gridIdxToCrimes[grid_index].push(crime_row)

forall grid_index, grid_row in grid.iterrows():
  topIndex = grid_row['top ']
  if topIndex in gridIdxToCrimes:
    # you get all the crimes above your current grid
    near += count(gridIdxToCrimes[topIndex])

This way you did O(|crimes|+|grid|) = 5k iterations

Upvotes: 2

Related Questions