Edit values in one column in 4,000,000 row CSV file

Question

I have a CSV file I am trying to edit to add a numeric ID-type column in with unique integers from 1 - approx 4,000,000. Some of the fields already have an ID value, so I was hoping I could just sort those and then fill in starting on the largest value + 1. However, I cannot open this file to edit in Excel because of its size (I can only see the max of 1,048,000 or whatever rows). Is there an easy way to do this? I am not familiar with coding, so I was hoping there was a way to do it manually that is similar to Excel's fill series feature.

Thanks!

-also - I know there are threads on how to edit a large CSV file, but I was hoping for help with how to edit this specific feature. Thanks!

-I want to basically sort the rows based on idnumber and then add unique IDs to rows without that ID value Screenshot of file

Luuk · Accepted Answer

one way, using Notepad++, and a plugin named SQL:

Load the CSV in Notepad++
SELECT a+1,b,c FROM data
Hit 'start'

When starting with a file like this:

a,b,c
1,2,3
4,5,6
7,8,9

The results after look like this:

SQL Plugin 1.0.1025
Query         : select a+1,b,c from data
Sourcefile    : abc.csv
Delimiter     : ,
Number of hits: 3
===================================================================================
Query result:
2,2,3
5,5,6
8,8,9

Or, in words, the first column is incremented by 1.

2nd solution, using gawk, downloaded from https://www.klabaster.com/freeware.htm#mawk:

D:\TEMP>type abc.csv
a,b,c
1,2,3
4,5,6
7,8,9

D:\TEMP>gawk  "BEGIN{ FS=OFS=\",\"; getline; print $0 }{ print $1+1,$2,$3 }" abc.csv
a,b,c
2,2,3
5,5,6
8,8,9

(g)awk id a tool which reads a file line by line. The line is then accessible via $0, and the parts from the line via $1,$2,$3,... using a separator.

This separator is set in my example (FS=OFS=\",\";) in the BEGIN section which is only done once per input file. Do not get confused by the \". This is because the script is between double quotes, and a variable (like OFS) is set using double quotes too, so it needs to be escaped like \".

The getline; print $0, do take care of the first line in a CSV which typically hold column names.

Then, for every line, this piece of code print $1+1,$2,$3 will increment the first column, and print the second and third column.

To extend this second example:

gawk  "BEGIN{ FS=OFS=\",\"; getline; print $0 }{ print ($1<5?$1+1:$1),$2,$3 }" abc.csv

The ($1<5?$1+1:$1) will check if value of $1is less then 5 ($1<5), if true, it will return $1+1, and else $1. Or, in words, it will only add 1 if the current value is less than 5.

With your data you end up with something like this (untested!):

gawk  "BEGIN{ FS=OFS=\",\"; getline; a=42; print $0 }{ if($4+0==0){ a++ }; print ($4<=0?$a:$1),$2,$3 }" input.csv

a=42 to set the initial value for the column values which needs to be update (you need to change this to the correct value )
The if($4+0==0){ a++ } will increment the value of a when the fourth column equals 0 (The $4+0 is done to convert empty values like "" to a numeric value 0).

Edit values in one column in 4,000,000 row CSV file

Answers (1)

Related Questions