mynameisJEFF
mynameisJEFF

Reputation: 4239

Bash: Piping results out to file not working as expected

I have a tsv file with 3 columns and 7000 rows. It looks like this:

1341234jh34h123h    abc 1
23k4j123j4123h4h    abc 1
123j41j234j234jj    bbb 1
1234jj1324j123j4    ccc 1
2134j1234j1234jj    bbb 1
1324j123j4123j41    abc 1
132j412j34j1234j    ddd 1
12j34j1234j4j234    abc 1
12j34j234j123j43    abc 1
123j412j341234jj    abc 1
123j4j234j132j4j    abc 1
123k41k234123l4l    bbb 1
2k134k2134k23k4k    abc 2
132k4k132k423k4k    ddd 1
k234k123k4k34k34    bbb 1
23k4k34k3k43k43k    abc 1
l234k34l3l43;3;4    abc 1
k234k23k42k342k3    bbb 1
q,wmeqwjneqkwjen    ddd 1
llqkweqweqjwejqw    bbb 1

My goal is to: take the second column out , sort it and return unique values in a tsv file.

The code that I wrote on the terminal is: cut -f 2 input.tsv | sort | uniq > output_final.tsv

It took forever to run this in terminal (note that the file has 7000 rows. If you just use the above code on the twenty rows of data provided above, it will be done very quick. )

However, if I do it this in a naive way like below, it is done very fast.

cut -f 2 input.tsv > output1.tsv

then

sort output1.tsv > output2.tsv

uniq output2.tsv > output_final.tsv

So why is the cut -f 2 input.tsv | sort | uniq > output_final.tsv code taking forever to run ? Am I writing this wrong ?

BIG UPDATE: So I did time thing suggested by @paxdiablo. Interestingly, I found that

time (cut -f 2 input.tsv >/dev/null)

real    0m0.017s
user    0m0.015s
sys 0m0.002s

time (cut -f 2 input.tsv | sort >/dev/null)

real    0m0.025s
user    0m0.021s
sys 0m0.006s

time (cut -f 2 input.tsv | sort  | uniq >/dev/null)

real    0m0.027s
user    0m0.026s
sys 0m0.008s

So, the jobs take tiny amount of time. BUT when I run cut -f 2 input.tsv | sort >/dev/null, the terminal just hangs in there like below and not returning anything at all:

 chinegro $ > cut -f 2 input.tsv | sort >/dev/null

Normally, when the job finishes, the terminal should be like this:

chinegro $ > cut -f 2 input.tsv | sort >/dev/null
output blablablalblalblalblalba
chinegro $ > 

Upvotes: 1

Views: 109

Answers (2)

paxdiablo
paxdiablo

Reputation: 881463

The pipeline shouldn't make much of a difference.

The first thing I would do is to see which component is causing the problem by running the following commands:

time ( cut -f 2 input.tsv >/dev/null )
time ( cut -f 2 input.tsv  | sort >/dev/null)
time ( cut -f 2 input.tsv  | sort | uniq >/dev/null)

a few times each and recording the times.

Then you may want to ask a question on a suitable site :-) about how to best do the job you want to do, not pre-supposing that cut, sort and uniq will be necessary. Far too many people limit their solution space unnecessarily by stating the tools they're using. You should state just the problem and only limit the solution space if absolutely required.

For a start, you could ditch the uniq by using sort -u, and there may even be a better way using different tools, such as:

awk '{keys[$2] = 1} END {for (key in keys) { print key } }' input.tsv

And, after your update:

time (cut -f 2 input.tsv | sort  | uniq >/dev/null)

real    0m0.027s
user    0m0.026s
sys 0m0.008s

you can see that it's taking about a thirtieth of a second (user + sys = 0.034s) CPU time.

Hence it's likely that you've gotten something wrong in the original command itself. If it's not returning to the prompt for a long time, that's usually indicative that you've left off the input file name, with something like:

cut -f 2 | sort

and the cut will wait forever until you enter some lines then press CTRL-D to indicate end of file (you can test this by entering CTRL-D while it's running and seeing if the prompt returns).

So I would urge you to check your actual command, especially in light of the fact your final one was using output.tsv as the input file. That's wrong and, if it's a typo, you should double-check that the other commands you've shown us are the actual ones you're using.

Upvotes: 2

Edouard Thiel
Edouard Thiel

Reputation: 6228

On my laptop your command on 7000 rows is immediate, but doesn't work since cut -f 2 doesn't work as expected. This snippet works fast :

 while read a b c ; do echo "$b" ; done < input.tsv | sort | uniq >| output_final.tsv

The last >| stands for overwrite.

Upvotes: 0

Related Questions