mandeno
mandeno

Reputation: 21

Compare incomplete date list with a reference date list

I know this is possible. I know there is a simple solution, but everything I've tried has failed.

Here's the deal:

I have a dataset in Excel format containing 939,019 weather station records (rows). The date/time interval is every 10 minutes starting from 1/29/1993 16:30 to 6/30/2013 24:00. If I do the math, it is clear that there are missing rows.

I need to know the missing dates/times. It would be cool if I could have some little program/script that returned the start date/time and end date/time of the missing intervals. But I'll just be happy with a list of the missing dates/times.

To figure it out, I thought, oh, all I need is a reference list to compare the list with missing dates and have some way of flagging or returning the gaps.

So, in Excel, I created a column adjacent to the weather station data and populated the first row with the start date. The subsequent rows just add 10 minutes to the cell above it. Unfortunately, the number of 10 minute intervals in that 20 year span is more than excel can handle. No worries. It gets close enough (1/6/2013 10:50).

Anyway, I tried the MATCH function in excel, but that is taking way too long. In the time it is taking me to type this, it has reached 3% (using 12 processors). I have 30 weather stations (with the same date range) to do. I'm hoping I can find a faster way to do this.

So, I next tried Acess. I imported the files (the weather station data and a separate reference date list) as tables in Access and thought I'd just do an UNMATCHED query, but for some reason (no matter how I format the date column (date/time, serial number), the query returns just about all the rows as unmatched. Not sure why, and it does do it quick, but it is obviously wrong.

I then thought - Python! That'd do it, right? But I'm a GIS person. I've only ever used Python sample scripts to run geoprocessing tools (or used ESRi's Model Builder). I don't really have a clue where to start. Any pointers?

Upvotes: 2

Views: 239

Answers (1)

MattDMo
MattDMo

Reputation: 102862

First, check out python-excel.org for xlrd, xlwt, and xlutils modules and documentation (I'm assuming you're working with .xls files, and not .xlsx - if so, check out openpyxl). Once you've got them installed, read through the docs to familiarize yourself with them, they're not too long or overly complicated. The actual comparison shouldn't be too hard: all you need to do is read cell N, compare its value to cell N+1, and see if the difference is 10 minutes. If it is, great, go to the next value. If not, print the value to a new workbook (or whatever you want to do - insert a blank row with the missing time and calculate again, or what have you).

I don't know how long this will take to run through ~30 million records, but I'm willing to bet it'll be faster than doing it via Excel itself :)

Good luck!

Upvotes: 2

Related Questions