Reputation: 11
I have a text file that has a list of skill Names (nearly 150 million lines). I sorted this using the command
sort myFile.txt >> SortedFile.txt
To verify the result of this command, I executed the command
grep -n "^JavaScript$" SortedFile.txt >> lineNumbers.txt
I could see that JavaScript occurs in two groups. One group from 27819903 - 28071139
and the other from 99390179 - 99607141
.
This problem is not only with the Skill "JavaScript", it occurs for many skills. What is the problem with sort command?
How could I sort myFile
correctly using the sort command?
Upvotes: 0
Views: 502
Reputation: 882078
It's a little hard to tell without the test data(a) but, since it's in two sections, my first suggestion would be to change:
sort myFile.txt >> SortedFile.txt
to:
sort myFile.txt > SortedFile.txt
The first of those simply appends the results to a file that may already exist so, if you do that twice, you will get two disparate sections. Ditto for the grep
command that you're using to discover the line numbers.
I'd expect that, if you used the same source, you'd get two chunks of equal size (which is not the case here), but I have no idea what the file contains before you appended to it.
So, try it without appending and see if you have the same issue.
The other thing I'd be asking myself is: why are you sorting it? It appears to me (though I've been wrong before, just ask my wife for a comprehensive list) that the only possible use case here is to count the amount of each skill. I'm having a hard time envisaging any other possibility but, if you have one, please let us know, there may be an equally better way to do this for a different use case.
If that counting is the case, there's better ways to do it than sorting it, you could simply count them without regard to order, with something like:
awk '{count[$1]++}END{for(key in count){print key" "count[key]}}'
See the following command as an example:
pax> ( echo JavaScript; echo C; echo Java; echo JavaScript ) | awk '
...> {count[$1]++}END{for(key in count){print key" "count[key]}}'
This generates:
C 1
Java 1
JavaScript 2
(a) And three-odd gig is probably a little too much test data to post :-)
Upvotes: 2