user3502552
user3502552

Reputation: 75

Sorting .csv from command line

I'm trying to sort this line (2010 census block population densities for the whole US) by the last column.

12001,2,1009,Alachua FL,29.65612,-82.327274,0.0005131,0.013289229,12,902.9869232

censusBlockDensities.csv (moved here from comment)

17001,1,1010,Adams IL,39.960197,-91.373363,0.08861,00.037495258,23,613.41090336
17001,1,1020,Adams IL,39.955861,-91.354113,0.19038,0.493081936,2,4.05612100686
17001,1,1031,Adams IL,39.956978,-91.369,0.002268,0.005874093,0,0,22.8543955664
17001,1,1041,Adams IL,39.94333,-91.345319,0.000358,0.0009236128,0,0480.4506562
17001,1,1051,Adams IL,39.948201,-91.352052,0.213797,0.553731688,64,115.5794427

Upvotes: 3

Views: 538

Answers (1)

Parker
Parker

Reputation: 7494

I am assuming a unix shell (i.e., bash).

Read the manual page for the sort command: man sort

From the man page:

The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

export LC_ALL=C

sort -t , -k 10,10 -n censusBlockDensities.csv

Explanation of the flags:

-t ,: Specifies commas as field delimiters.

-k 10,10: Specifies sorting on the 10th field only (start,stop) (the first field is 1, not 0).

KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end. If neither -t nor -b is in effect, characters in a field are counted from the beginning of the preceding whitespace. OPTS is one or more single-letter ordering options [bdfgiMhnRrV], which override global ordering options for that key. If no key is given, use the entire line as the key.

-n: Perform a numeric sort instead of the default alphanumeric sort (alternatively, add the 'n' to the -k parameter as mention in a comment below).

censusBlockDensities.csv

17001,1,1010,Adams IL,39.960197,-91.373363,0.08861,00.037495258,23,613.41090336
17001,1,1020,Adams IL,39.955861,-91.354113,0.19038,0.493081936,2,4.05612100686
17001,1,1031,Adams IL,39.956978,-91.369,0.002268,0.005874093,0,0,22.8543955664
17001,1,1041,Adams IL,39.94333,-91.345319,0.000358,0.0009236128,0,0480.4506562
17001,1,1051,Adams IL,39.948201,-91.352052,0.213797,0.553731688,64,115.5794427

output:

17001,1,1020,Adams IL,39.955861,-91.354113,0.19038,0.493081936,2,4.05612100686
17001,1,1031,Adams IL,39.956978,-91.369,0.002268,0.005874093,0,0,22.8543955664
17001,1,1051,Adams IL,39.948201,-91.352052,0.213797,0.553731688,64,115.5794427
17001,1,1041,Adams IL,39.94333,-91.345319,0.000358,0.0009236128,0,0480.4506562
17001,1,1010,Adams IL,39.960197,-91.373363,0.08861,00.037495258,23,613.41090336

Edit: A helpful comment indicated an error in my answer. You also need the -n flag to perform a numeric sort (default is alphanumeric). I've amended my answer to include that. You can verify that it is working properly by also trying out the -r flag to sort in reverse order. I also added the stop field index to the -k 10 parameter, as mentioned in another post.

Additionally, you should check your input file to make sure you have the same number of fields in each line:

awk '{print gsub(/,/,"")}' censusBlockDensities.csv

9
9
10 <-- the third record has an additional field
9
9

Upvotes: 5

Related Questions