Reputation: 31
I have a huge csv file (~17GB) and I need to create new file that contains all rows that the value of the last column appear more than once, unfortunately, the file is not well formatted. There are values that contains commas, like line 6 in the below example:
entity,entity_type,component_id
[email protected],email,1111
[email protected],email,2222
15158112233,phone,3333
15158990000,phone,2222
hello,[email protected],email,3333
1327168,phone,4444
fds_213445,device,3333
for the following example I would expect this new file:
[email protected],email,2222
15158990000,phone,2222
15158112233,phone,3333
hello,[email protected],email,3333
fds_213445,device,3333
I currently use a naive solution of:
But, as I said, this solution is very naive and it's running for a very long time (almost a week and still running...)
How could I create new file contains all lines with component_id appear more than once in bash, and in efficient way?
Upvotes: 2
Views: 219
Reputation: 2273
I'm not sure what you mean by a "bash" solution. bash isn't the right tool for this kind of task - not to say that someone won't come up with an elegant all bash solution... Anyways, since awk
answers have already appeared I figured I'd try a python solution. This slurps the entire file into memory but most machines in 2020 should handle a 17GB file just fine. This does one pass on the file read.
python3 -c "import csv,collections;d=collections.defaultdict(list);[d[r[-1]].append(r) for r in csv.reader(open('hugefile.csv'))];[print(','.join(r)) for (k,v) in d.items() for r in v if len(v)>1]"
Broken down:
import csv,collections
#Create a defaultdict that accepts lists:
d=collections.defaultdict(list)
# For each row in the csv file append the row to the dict with
# the last field (id) as the key:
[d[r[-1]].append(r) for r in csv.reader(open('hugefile.csv'))]
# Print each value in the dict if value if the value has more than one row in it.
[print(','.join(r)) for (k,v) in d.items() for r in v if len(v)>1]
Output with your example data:
[email protected],email,2222
15158990000,phone,2222
15158112233,phone,3333
hello,[email protected],email,3333
fds_213445,device,3333
Upvotes: 0
Reputation: 204558
Here's a way to do this without reading the whole file into memory at once in awk while being able to handle input coming from a pipe or a file (so it'd work even if the input was the output of some other command without creating a temp file to contain all of the input) and retaining the original input order and header.
Input from a file:
$ awk 'BEGIN{FS=OFS=","} {print (NR>1), $NF, ++cnt[$NF], NR, $0}' file |
sort -t, -k1,1n -k2,2 -k3,3nr |
awk -F, '$2!=p2{p2=$2; p3=$3} (NR==1) || (p3>1)' |
sort -t, -k4,4n |
cut -d, -f5-
entity,entity_type,component_id
[email protected],email,2222
15158112233,phone,3333
15158990000,phone,2222
hello,[email protected],email,3333
fds_213445,device,3333
or input from a pipe:
$ cat file |
awk 'BEGIN{FS=OFS=","} {print (NR>1), $NF, ++cnt[$NF], NR, $0}' |
sort -t, -k1,1n -k2,2 -k3,3nr |
awk -F, '$2!=p2{p2=$2; p3=$3} (NR==1) || (p3>1)' |
sort -t, -k4,4n |
cut -d, -f5-
entity,entity_type,component_id
[email protected],email,2222
15158112233,phone,3333
15158990000,phone,2222
hello,[email protected],email,3333
fds_213445,device,3333
Note that this approach allows you to take input from a pipe, not just from a file, so you can pipe the output of another command to it if you like. In the above only sort
has to handle the whole input at once and it's designed to do so by using demand paging, etc. so it's extremely unlikely to have any issues handling large input.
Here's what the script is doing in steps so you can see how it works:
$ awk 'BEGIN{FS=OFS=","} {print (NR>1), $NF, ++cnt[$NF], NR, $0}' file
0,component_id,1,1,entity,entity_type,component_id
1,1111,1,2,[email protected],email,1111
1,2222,1,3,[email protected],email,2222
1,3333,1,4,15158112233,phone,3333
1,2222,2,5,15158990000,phone,2222
1,3333,2,6,hello,[email protected],email,3333
1,4444,1,7,1327168,phone,4444
1,3333,3,8,fds_213445,device,3333
$ ... | sort -t, -k1,1n -k2,2 -k3,3nr
0,component_id,1,1,entity,entity_type,component_id
1,1111,1,2,[email protected],email,1111
1,2222,2,5,15158990000,phone,2222
1,2222,1,3,[email protected],email,2222
1,3333,3,8,fds_213445,device,3333
1,3333,2,6,hello,[email protected],email,3333
1,3333,1,4,15158112233,phone,3333
1,4444,1,7,1327168,phone,4444
$ ... | awk -F, '$2!=p2{p2=$2; p3=$3} (NR==1) || (p3>1)'
0,component_id,1,1,entity,entity_type,component_id
1,2222,2,5,15158990000,phone,2222
1,2222,1,3,[email protected],email,2222
1,3333,3,8,fds_213445,device,3333
1,3333,2,6,hello,[email protected],email,3333
1,3333,1,4,15158112233,phone,3333
$ ... | sort -t, -k4,4n
0,component_id,1,1,entity,entity_type,component_id
1,2222,1,3,[email protected],email,2222
1,3333,1,4,15158112233,phone,3333
1,2222,2,5,15158990000,phone,2222
1,3333,2,6,hello,[email protected],email,3333
1,3333,3,8,fds_213445,device,3333
$ ... | cut -d, -f5-
entity,entity_type,component_id
[email protected],email,2222
15158112233,phone,3333
15158990000,phone,2222
hello,[email protected],email,3333
fds_213445,device,3333
Upvotes: 0
Reputation: 133760
1st solution: Using single time reading Input_file and playing around with arrays to check if last field value is more than 1 in whole Input_file.
awk '
BEGIN{
FS=","
}
{
arr[$NF]++
if(!temparr[$NF]++){
first[$NF]=$0
}
}
arr[$NF]>1{
if(first[$NF]){
print first[$NF]
delete first[$NF]
}
print
}
' Input_file
2nd solution: Reading whole Input_file and getting all lines and last fields values into arrays and playing around with them in END
block of the awk
once Input_file is done reading.
awk '
BEGIN{
FS=","
}
{
arr[$NF]++
if(!arr1[$NF]++){
arr2[++count]=$NF
}
val[$NF]=(val[$NF]?val[$NF] ORS:"")$0
}
END{
for(i=1;i<=count;i++){
if(arr[arr2[i]]>1){
print val[arr2[i]]
}
}
}' Input_file
NOTE: My 3rd solution was to pass Input_file 2 times into awk
which is already covered by Shawn in his answer :) so removed it from here. Also these are tested with shown samples and NOT with huge data set fyi here.
Upvotes: 1
Reputation: 52579
With awk
and two passes through the file? First one counts the occurrences of the last field, second one prints just the duplicates.
awk -F, 'FNR == NR { ids[$NF]++; next }
ids[$NF] > 1 || FNR == 1' hugefile.csv hugefile.csv > newfile.csv
Upvotes: 3