Reputation: 21
I have never used UNIX before, and am using this because I could not find a solution on Windows to sort this list by date for such a large file.
I am trying to sort a CSV file with 14 million entries (the file is 2gigs). The file is all of the taxi transactions that happened in 2013 during the month of January. I wanted to sort the list by date so that I could only select data from the first week.
I found the https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html and I have been trying to write a command that will do what I want. What I have tried so far is
sort -t, -k 6n 8-trip_data_1.csv
that didn't work.
I think I want to tell it to sort by the 6th column (pickup date time) and then the 9,10 indexes of that column because that is all that will be changing in the data column across the file. I put some of the table below.
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
Upvotes: 1
Views: 1584
Reputation: 44213
I am sure you do not want to remove the header nor want it to "float", so create executable file sort_csv:
#!/usr/bin/perl
use strict;
sub my_cmp($$)
{
my $a = shift;
my $b = shift;
return substr($a, 81, 8) cmp substr($b, 81, 8); # assuming seconds are always zero
}
print scalar (<>);
print sort my_cmp <>;
And then:
# Make it executable
chmod +x sort_csv
sort_csv <input.csv >sorted.csv
Upvotes: 0
Reputation: 754480
You don't need the n
— indeed, it is counterproductive. The dates are in ISO 8601 format, and they sort in time order when sorted alphanumerically. Numeric sorting only pays attention to the 2013 part of the field; the rest isn't part of a single number. You also don't need to worry about subsetting the time information — the fact that only some parts change won't matter.
You've given a very minimal data set with the pickup-time information already in sorted order, so we have to get a little inventive. The heading information won't sort numerically; you can remove it, or let it float around. To show that the sorting works when the data is sorted, I specify r
(reverse order). This puts the heading data at the top and reverses the two lines of actual data.
$ sort -t, -k6r data.file
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
$
Or, in ascending order (the heading goes at the end):
$ sort -t, -k6 data.file
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
$
Also, you can decide which dates are relevant and modify this grep
command to select the correct dates for the first week — which reduces the data size to about one quarter of its original size.
grep ',2013-01-0[1-7] [0-2][0-9]:[0-5][0-9]:[0-5][0-9],' data.file
That looks for dates in the range 2013-01-01 through 2013-01-07 (allowing any time for each day). You could omit the regex after the blank if you prefer; if the data is valid, it won't make any difference, but the regex avoids selecting some invalid data. Obviously, you can change the dates if you want the first week to run, for example, from the first Sunday through the first Saturday (Sunday 6th to Saturday 12th 2013):
grep -E ',2013-01-(0[6-9]|1[012]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9],' data.file
You could then run this reduced data set through the sort process.
In future, please give 5 lines or so for sample data — it's easier to demonstrate what's working and what's not.
Upvotes: 1