rororo
rororo

Reputation: 845

Read lines from a file, grep in a second file, and output a file for each $line

I have the following two files:

sequences.txt

158333741       Acaryochloris_marina_MBIC11017_uid58167 158333741       432     1       432     COG0001 0
158339504       Acaryochloris_marina_MBIC11017_uid58167 158339504       491     1       491     COG0002 0
379012832       Acetobacterium_woodii_DSM_1030_uid88073 379012832       430     1       430     COG0001 0
302391336       Acetohalobium_arabaticum_DSM_5501_uid51423      302391336       441     1       441     COG0003 0
311103820       Achromobacter_xylosoxidans_A8_uid59899  311103820       425     1       425     COG0004 0
332795879       Acidianus_hospitalis_W1_uid66875        332795879       369     1       369     COG0005 0
332796307       Acidianus_hospitalis_W1_uid66875        332796307       416     1       416     COG0005 0

allids.txt

COG0001
COG0002
COG0003
COG0004
COG0005

Now I want to read each line in allids.txt, search all lines in sequences.txt (specifically in column 7), and write for each line in allids.txt a file with the filename $line.

my approach is to use a simple grep:

while read line; do
  grep "$line" sequences.txt
done <allids.txt

but where do I incorporate the command for the output? If there is a command that is faster, feel free to suggest!

My expected output:

COG0001.txt

158333741       Acaryochloris_marina_MBIC11017_uid58167 158333741       432     1       432     COG0001 0
379012832       Acetobacterium_woodii_DSM_1030_uid88073 379012832       430     1       430     COG0001 0

COG0002.txt

158339504       Acaryochloris_marina_MBIC11017_uid58167 158339504       491     1       491     COG0002 0

[and so on]

Upvotes: 3

Views: 3285

Answers (3)

athul.sure
athul.sure

Reputation: 328

Extending your approach, this seemed to work:

while read line; do
  # touching is not necessary as pointed out by @123
  # touch "$line.txt" 
  grep "$line" sequences.txt > "$line.txt"
done <allids.txt

It produces text files with the required output. But I cannot comment on the efficiency of this approach.

EDIT:

As has been pointed out in the comments, this method is slow and would break for any file that violates the unsaid assumptions used in the answer. I'm leaving it here people to see how a quick and hacky solution could backfire.

Upvotes: -2

Ed Morton
Ed Morton

Reputation: 204731

I suspect all you really need is:

awk '{print > ($7".txt")}' sequences.txt

That suspicion is based on your IDs file being named allIds.txt (note the all) and there being no IDs in sequences.txt that don't exist in allIds.txt.

Upvotes: 3

anubhava
anubhava

Reputation: 786359

It is quite simple to do it using awk:

awk 'NR==FNR{ids[$1]; next} $7 in ids{print > ($7 ".txt")}' allids.txt sequences.txt

Reference: Effective AWK Programming

Upvotes: 5

Related Questions