Reputation: 1216
I have a .CSV data file with a ton, and I mean a TON (80+ million lines) of data.
The data is all in two columns, and looks like the following:
src | dst
123123 | 456456
321321 | 654654
987987 | 789789
123123 | 456456
and so on for 80 million lines.
(note: I know that the delimiter should be a ',' in a .CSV, but in this case it's an '|' . The file extension is still .CSV)
I'm trying to figure out how to write a program that will read in all the data, and print out the number of repeated values in the 'src' field. For example, in my example, the output would look like '123123: showed up 2 times'
I've tried a few solutions, most notably this: How to read the csv file properly if each row contains different number of fields (number quite big)?
I wrote a loop to split the 'src' from the 'dst' with 'newData' being the .CSV file
//go through each line and split + link the data to src/dst
data.forEach(function (line) {
newData = line.split('|'); //note, split returns an array
let src = newData[0]; //src from data.csv
let dst = newData[1]; //dst from data.csv
//test print the data
//console.log(newData);
});
But am having issues getting a count duplicate values from the newData[0] (src) column.
Upvotes: 2
Views: 3858
Reputation: 577
It can be done in a single loop (an O(N) complexity solution...very important if you have 80 million lines...):
function solution(A)
{
var lines = A.split(/\r?\n/g);
var counts = {};
var multiples = {};
for (var i=0, ii=lines.length; i<ii; i++)
{
var splt = lines[i].split(/\s*\|\s*/g);
var val = splt[0];
if (!counts[val]) {
counts[val] = 1;
} else {
counts[val]++;
multiples[val] = counts[val];
}
}
return multiples;
}
That returns an object with key of all the values that exist multiple times in the first column, and their values represent how many times they appear. For example your given string would return the object:
{ '123123': 2 }
because that value is seen twice.
Here is a jsfiddle of it working (it logs it to the console, so open your dev tools): https://jsfiddle.net/x8b7ko3g/
Upvotes: 4
Reputation: 1577
I would try to sort the file first, e.g. using the command line tool "sort". After that, you can count how often the same "src" repeats until you find another "src".
Upvotes: 0