Reputation: 31

print all duplicates in not well formatted csv

I have a huge csv file (~17GB) and I need to create new file that contains all rows that the value of the last column appear more than once, unfortunately, the file is not well formatted. There are values that contains commas, like line 6 in the below example:

entity,entity_type,component_id
[email protected],email,1111
[email protected],email,2222
15158112233,phone,3333
15158990000,phone,2222
hello,[email protected],email,3333
1327168,phone,4444
fds_213445,device,3333

for the following example I would expect this new file:

[email protected],email,2222
15158990000,phone,2222
15158112233,phone,3333
hello,[email protected],email,3333
fds_213445,device,3333

I currently use a naive solution of:

Count the size of each component and store in file A.
Remove from file A all components with size = 1.
Running in a script over all component_ids in file A print the match lines from the original file to new result file.

But, as I said, this solution is very naive and it's running for a very long time (almost a week and still running...)

How could I create new file contains all lines with component_id appear more than once in bash, and in efficient way?

Upvotes: 2

Answers (4)

keithpjolley

Reputation: 2273

I'm not sure what you mean by a "bash" solution. bash isn't the right tool for this kind of task - not to say that someone won't come up with an elegant all bash solution... Anyways, since awk answers have already appeared I figured I'd try a python solution. This slurps the entire file into memory but most machines in 2020 should handle a 17GB file just fine. This does one pass on the file read.

python3 -c "import csv,collections;d=collections.defaultdict(list);[d[r[-1]].append(r) for r in csv.reader(open('hugefile.csv'))];[print(','.join(r)) for (k,v) in d.items() for r in v if len(v)>1]"

Broken down:

import csv,collections

#Create a defaultdict that accepts lists:
d=collections.defaultdict(list)

# For each row in the csv file append the row to the dict with
# the last field (id) as the key:
[d[r[-1]].append(r) for r in csv.reader(open('hugefile.csv'))]

# Print each value in the dict if value if the value has more than one row in it.
[print(','.join(r)) for (k,v) in d.items() for r in v if len(v)>1]

Output with your example data:

[email protected],email,2222
15158990000,phone,2222
15158112233,phone,3333
hello,[email protected],email,3333
fds_213445,device,3333

Upvotes: 0

Ed Morton

Reputation: 204558

Here's a way to do this without reading the whole file into memory at once in awk while being able to handle input coming from a pipe or a file (so it'd work even if the input was the output of some other command without creating a temp file to contain all of the input) and retaining the original input order and header.

Input from a file:

$ awk 'BEGIN{FS=OFS=","} {print (NR>1), $NF, ++cnt[$NF], NR, $0}' file |
    sort -t, -k1,1n -k2,2 -k3,3nr |
    awk -F, '$2!=p2{p2=$2; p3=$3} (NR==1) || (p3>1)' |
    sort -t, -k4,4n |
    cut -d, -f5-
entity,entity_type,component_id
[email protected],email,2222
15158112233,phone,3333
15158990000,phone,2222
hello,[email protected],email,3333
fds_213445,device,3333

or input from a pipe:

$ cat file |
    awk 'BEGIN{FS=OFS=","} {print (NR>1), $NF, ++cnt[$NF], NR, $0}' |
    sort -t, -k1,1n -k2,2 -k3,3nr |
    awk -F, '$2!=p2{p2=$2; p3=$3} (NR==1) || (p3>1)' |
    sort -t, -k4,4n |
    cut -d, -f5-
entity,entity_type,component_id
[email protected],email,2222
15158112233,phone,3333
15158990000,phone,2222
hello,[email protected],email,3333
fds_213445,device,3333

Note that this approach allows you to take input from a pipe, not just from a file, so you can pipe the output of another command to it if you like. In the above only sort has to handle the whole input at once and it's designed to do so by using demand paging, etc. so it's extremely unlikely to have any issues handling large input.

Here's what the script is doing in steps so you can see how it works:

$ awk 'BEGIN{FS=OFS=","} {print (NR>1), $NF, ++cnt[$NF], NR, $0}' file
0,component_id,1,1,entity,entity_type,component_id
1,1111,1,2,[email protected],email,1111
1,2222,1,3,[email protected],email,2222
1,3333,1,4,15158112233,phone,3333
1,2222,2,5,15158990000,phone,2222
1,3333,2,6,hello,[email protected],email,3333
1,4444,1,7,1327168,phone,4444
1,3333,3,8,fds_213445,device,3333

$ ... | sort -t, -k1,1n -k2,2 -k3,3nr
0,component_id,1,1,entity,entity_type,component_id
1,1111,1,2,[email protected],email,1111
1,2222,2,5,15158990000,phone,2222
1,2222,1,3,[email protected],email,2222
1,3333,3,8,fds_213445,device,3333
1,3333,2,6,hello,[email protected],email,3333
1,3333,1,4,15158112233,phone,3333
1,4444,1,7,1327168,phone,4444

$ ... | awk -F, '$2!=p2{p2=$2; p3=$3} (NR==1) || (p3>1)'
0,component_id,1,1,entity,entity_type,component_id
1,2222,2,5,15158990000,phone,2222
1,2222,1,3,[email protected],email,2222
1,3333,3,8,fds_213445,device,3333
1,3333,2,6,hello,[email protected],email,3333
1,3333,1,4,15158112233,phone,3333

$ ... | sort -t, -k4,4n
0,component_id,1,1,entity,entity_type,component_id
1,2222,1,3,[email protected],email,2222
1,3333,1,4,15158112233,phone,3333
1,2222,2,5,15158990000,phone,2222
1,3333,2,6,hello,[email protected],email,3333
1,3333,3,8,fds_213445,device,3333

$ ... | cut -d, -f5-
entity,entity_type,component_id
[email protected],email,2222
15158112233,phone,3333
15158990000,phone,2222
hello,[email protected],email,3333
fds_213445,device,3333

Upvotes: 0

RavinderSingh13

Reputation: 133760

1st solution: Using single time reading Input_file and playing around with arrays to check if last field value is more than 1 in whole Input_file.

awk '
BEGIN{
  FS=","
}
{
  arr[$NF]++
  if(!temparr[$NF]++){
    first[$NF]=$0
  }
}
arr[$NF]>1{
  if(first[$NF]){
    print first[$NF]
    delete first[$NF]
  }
  print
}
' Input_file

2nd solution: Reading whole Input_file and getting all lines and last fields values into arrays and playing around with them in END block of the awk once Input_file is done reading.

awk '
BEGIN{
  FS=","
}
{
  arr[$NF]++
  if(!arr1[$NF]++){
    arr2[++count]=$NF
  }
  val[$NF]=(val[$NF]?val[$NF] ORS:"")$0
}
END{
  for(i=1;i<=count;i++){
    if(arr[arr2[i]]>1){
      print val[arr2[i]]
    }
  }
}' Input_file

NOTE: My 3rd solution was to pass Input_file 2 times into awk which is already covered by Shawn in his answer :) so removed it from here. Also these are tested with shown samples and NOT with huge data set fyi here.

Upvotes: 1

Shawn

Reputation: 52579

With awk and two passes through the file? First one counts the occurrences of the last field, second one prints just the duplicates.

awk -F, 'FNR == NR { ids[$NF]++; next }
         ids[$NF] > 1 || FNR == 1' hugefile.csv hugefile.csv > newfile.csv

Upvotes: 3

print all duplicates in not well formatted csv

Answers (4)

Related Questions