Reputation: 3522
I'm trying to combine the topics of this and this question, i.e. matching each string/line in File2
with its occurrence (each string only occurs once) in File1
while printing the whole line that it occurs on in File2
, while also printing the lines between each match (i.e. the sequence in File2
).
File1
>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail)
CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAA
GAUUAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAG
ACCGCGAAUGGCUCAUUAUAUCAGUUAUGGUUCCUUAGA
ACUUACUACUUGGAUAACUGUGGUAAUUCUAGAGCUAAU
>GAXI01000526.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail)
CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAAGAU
UAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAGUGAAAC
>GAXI01005455.1.1233 Bacteria;Bacteroidetes;Flavobacteriia;Flavobacteriales;Flavobacteriaceae;Chryseobacterium;Tetrodontophora bielanensis (giant springtail)
CUUUCGAAAGGAAGAUUAAUACCCCAUAACAUA
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail)
AGAAUUUGAUCUUGGUUCAGAUUGAAUGCUGG
UGCAAGUCGAACGAAGCUAGAGGGCAACCUCU
File2
>GAXI01000525.151.1950
>GAXI01006199.29.1525
What I have so far:
awk 'FNR==NR{a[$0];next} $1 in a' file2 file1 > output
which gives:
>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail)
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail)
I would like this:
>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail)
CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAA
GAUUAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAG
ACCGCGAAUGGCUCAUUAUAUCAGUUAUGGUUCCUUAGA
ACUUACUACUUGGAUAACUGUGGUAAUUCUAGAGCUAAU
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail)
AGAAUUUGAUCUUGGUUCAGAUUGAAUGCUGG
UGCAAGUCGAACGAAGCUAGAGGGCAACCUCU
The original files contain thousands of rows so the fastest possible solution is appreciated, either awk, sed or anything else...
Upvotes: 1
Views: 166
Reputation: 58488
This might work for you (GNU sed):
sed 's:.*:/^&/bb:' file2 | sed -e ':a' -f - -e 'd;:b;n;/^>/ba;bb' file1
Transform file2 into the matches to be printed from file1, otherwise delete the non-matches.
Use two invocations of sed. The first uses file2 to create regexp to match, the second the framework to print lines following a match to the next start of record or end of file.
Upvotes: 1
Reputation: 133680
@jO: Try:
awk 'FNR==NR{A[$1];next} ($0 ~ /^>/){Q=""} ($1 in A){Q=1} Q{print}' file2 file1
EDIT: Adding an explanation too here for solution now.
awk 'FNR==NR ##### This condition will be TRUE when only file2 is being read. where FNR and NR are the awk's in-built keywords FNR and NR both shows number of lines in a Input_file only difference between them FNR gets RESET when it reads next file and NR keep on increase it's values till all files get read successfully.
{A[$1]; ##### creating an array named A whose index is $1 first field of file2.
next} ##### putting next will skip all the further statements.
##### All further mentioned statements will be executed in file1 only.
($0 ~ /^>/) ##### checking if any line is starting with > in file1
{Q=""} ##### Making variable named Q as nullified.
($1 in A) ##### Checking if current line's $1 is coming into array A, if yes then do following.
{Q=1} ##### If current $1 is coming into array A then make variable Q's value to 1.
Q ##### Check if Q's value is NOT NULL then do following.
{print} ##### print the lines whenever above condition is TRUE which has Q's value is NOT NULL.
' file2 file1 ##### Mentioning Input_files file2 and file1 here.
Upvotes: 1
Reputation: 8174
you can try with awk
awk 'FNR==NR{d[$1]; next}/^>/{f=0}$1 in d{f=1}f' file2 file1
you get
>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail) CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAA GAUUAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAG ACCGCGAAUGGCUCAUUAUAUCAGUUAUGGUUCCUUAGA ACUUACUACUUGGAUAACUGUGGUAAUUCUAGAGCUAAU >GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail) AGAAUUUGAUCUUGGUUCAGAUUGAAUGCUGG UGCAAGUCGAACGAAGCUAGAGGGCAACCUCU
Upvotes: 1