Reputation: 4239
I have a tsv file with 3 columns and 7000 rows. It looks like this:
1341234jh34h123h abc 1
23k4j123j4123h4h abc 1
123j41j234j234jj bbb 1
1234jj1324j123j4 ccc 1
2134j1234j1234jj bbb 1
1324j123j4123j41 abc 1
132j412j34j1234j ddd 1
12j34j1234j4j234 abc 1
12j34j234j123j43 abc 1
123j412j341234jj abc 1
123j4j234j132j4j abc 1
123k41k234123l4l bbb 1
2k134k2134k23k4k abc 2
132k4k132k423k4k ddd 1
k234k123k4k34k34 bbb 1
23k4k34k3k43k43k abc 1
l234k34l3l43;3;4 abc 1
k234k23k42k342k3 bbb 1
q,wmeqwjneqkwjen ddd 1
llqkweqweqjwejqw bbb 1
My goal is to: take the second column out , sort it and return unique values in a tsv file.
The code that I wrote on the terminal is: cut -f 2 input.tsv | sort | uniq > output_final.tsv
It took forever to run this in terminal (note that the file has 7000 rows
. If you just use the above code on the twenty rows of data provided above, it will be done very quick. )
However, if I do it this in a naive way like below, it is done very fast.
cut -f 2 input.tsv > output1.tsv
then
sort output1.tsv > output2.tsv
uniq output2.tsv > output_final.tsv
So why is the cut -f 2 input.tsv | sort | uniq > output_final.tsv
code taking forever to run ? Am I writing this wrong ?
BIG UPDATE: So I did time
thing suggested by @paxdiablo. Interestingly, I found that
time (cut -f 2 input.tsv >/dev/null)
real 0m0.017s
user 0m0.015s
sys 0m0.002s
time (cut -f 2 input.tsv | sort >/dev/null)
real 0m0.025s
user 0m0.021s
sys 0m0.006s
time (cut -f 2 input.tsv | sort | uniq >/dev/null)
real 0m0.027s
user 0m0.026s
sys 0m0.008s
So, the jobs take tiny amount of time. BUT when I run cut -f 2 input.tsv | sort >/dev/null
, the terminal just hangs in there like below and not returning anything at all:
chinegro $ > cut -f 2 input.tsv | sort >/dev/null
Normally, when the job finishes, the terminal should be like this:
chinegro $ > cut -f 2 input.tsv | sort >/dev/null
output blablablalblalblalblalba
chinegro $ >
Upvotes: 1
Views: 109
Reputation: 881463
The pipeline shouldn't make much of a difference.
The first thing I would do is to see which component is causing the problem by running the following commands:
time ( cut -f 2 input.tsv >/dev/null )
time ( cut -f 2 input.tsv | sort >/dev/null)
time ( cut -f 2 input.tsv | sort | uniq >/dev/null)
a few times each and recording the times.
Then you may want to ask a question on a suitable site :-) about how to best do the job you want to do, not pre-supposing that cut
, sort
and uniq
will be necessary. Far too many people limit their solution space unnecessarily by stating the tools they're using. You should state just the problem and only limit the solution space if absolutely required.
For a start, you could ditch the uniq
by using sort -u
, and there may even be a better way using different tools, such as:
awk '{keys[$2] = 1} END {for (key in keys) { print key } }' input.tsv
And, after your update:
time (cut -f 2 input.tsv | sort | uniq >/dev/null)
real 0m0.027s
user 0m0.026s
sys 0m0.008s
you can see that it's taking about a thirtieth of a second (user + sys = 0.034s) CPU time.
Hence it's likely that you've gotten something wrong in the original command itself. If it's not returning to the prompt for a long time, that's usually indicative that you've left off the input file name, with something like:
cut -f 2 | sort
and the cut
will wait forever until you enter some lines then press CTRL-D to indicate end of file (you can test this by entering CTRL-D while it's running and seeing if the prompt returns).
So I would urge you to check your actual command, especially in light of the fact your final one was using output.tsv
as the input file. That's wrong and, if it's a typo, you should double-check that the other commands you've shown us are the actual ones you're using.
Upvotes: 2
Reputation: 6228
On my laptop your command on 7000 rows is immediate, but doesn't work since cut -f 2
doesn't work as expected. This snippet works fast :
while read a b c ; do echo "$b" ; done < input.tsv | sort | uniq >| output_final.tsv
The last >|
stands for overwrite.
Upvotes: 0