Reputation: 3522

awk/sed: matching pattern between files and printing everything between matches

I'm trying to combine the topics of this and this question, i.e. matching each string/line in File2 with its occurrence (each string only occurs once) in File1 while printing the whole line that it occurs on in File2, while also printing the lines between each match (i.e. the sequence in File2).

File1

>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail)
CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAA
GAUUAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAG
ACCGCGAAUGGCUCAUUAUAUCAGUUAUGGUUCCUUAGA
ACUUACUACUUGGAUAACUGUGGUAAUUCUAGAGCUAAU
>GAXI01000526.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail)
CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAAGAU
UAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAGUGAAAC
>GAXI01005455.1.1233 Bacteria;Bacteroidetes;Flavobacteriia;Flavobacteriales;Flavobacteriaceae;Chryseobacterium;Tetrodontophora bielanensis (giant springtail)
CUUUCGAAAGGAAGAUUAAUACCCCAUAACAUA
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail)
AGAAUUUGAUCUUGGUUCAGAUUGAAUGCUGG
UGCAAGUCGAACGAAGCUAGAGGGCAACCUCU

File2

>GAXI01000525.151.1950
>GAXI01006199.29.1525

What I have so far:

awk 'FNR==NR{a[$0];next} $1 in a' file2 file1 > output

which gives:

>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail)
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail)

I would like this:

>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail)
CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAA
GAUUAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAG
ACCGCGAAUGGCUCAUUAUAUCAGUUAUGGUUCCUUAGA
ACUUACUACUUGGAUAACUGUGGUAAUUCUAGAGCUAAU
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail)
AGAAUUUGAUCUUGGUUCAGAUUGAAUGCUGG
UGCAAGUCGAACGAAGCUAGAGGGCAACCUCU

The original files contain thousands of rows so the fastest possible solution is appreciated, either awk, sed or anything else...

Upvotes: 1

Answers (3)

potong

Reputation: 58488

This might work for you (GNU sed):

sed 's:.*:/^&/bb:' file2 | sed -e ':a' -f - -e 'd;:b;n;/^>/ba;bb' file1

Transform file2 into the matches to be printed from file1, otherwise delete the non-matches.

Use two invocations of sed. The first uses file2 to create regexp to match, the second the framework to print lines following a match to the next start of record or end of file.

Upvotes: 1

RavinderSingh13

Reputation: 133680

@jO: Try:

awk 'FNR==NR{A[$1];next} ($0 ~ /^>/){Q=""} ($1 in A){Q=1} Q{print}' file2  file1

EDIT: Adding an explanation too here for solution now.

awk 'FNR==NR        ##### This condition will be TRUE when only file2 is being read. where FNR and NR are the awk's in-built keywords FNR and NR both shows number of lines in a Input_file only difference between them FNR gets RESET when it reads next file and NR keep on increase it's values till all files get read successfully.
{A[$1];             ##### creating an array named A whose index is $1 first field of file2.
next}               ##### putting next will skip all the further statements.
                    ##### All further mentioned statements will be executed in file1 only.
($0 ~ /^>/)         ##### checking if any line is starting with > in file1
{Q=""}              ##### Making variable named Q as nullified.
($1 in A)           ##### Checking if current line's $1 is coming into array A, if yes then do following.
{Q=1}               ##### If current $1 is coming into array A then make variable Q's value to 1.
Q                   ##### Check if Q's value is NOT NULL then do following.
{print}             ##### print the lines whenever above condition is TRUE which has Q's value is NOT NULL.
' file2  file1      ##### Mentioning Input_files file2 and file1 here.

Upvotes: 1

Jose Ricardo Bustos M.

Reputation: 8174

you can try with awk

awk 'FNR==NR{d[$1]; next}/^>/{f=0}$1 in d{f=1}f' file2 file1

you get

>GAXI01000525.151.1950 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Hexapoda;Ellipura;Collembola;Tetrodontophora bielanensis (giant springtail)
CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAA
GAUUAAGCCAUGCAUGUCUAAGUUCAAGCAAAAAUAAAG
ACCGCGAAUGGCUCAUUAUAUCAGUUAUGGUUCCUUAGA
ACUUACUACUUGGAUAACUGUGGUAAUUCUAGAGCUAAU
>GAXI01006199.29.1525 Bacteria;Chlamydiae;Chlamydiae;Chlamydiales;Simkaniaceae;Candidatus Rhabdochlamydia;Tetrodontophora bielanensis (giant springtail)
AGAAUUUGAUCUUGGUUCAGAUUGAAUGCUGG
UGCAAGUCGAACGAAGCUAGAGGGCAACCUCU

Upvotes: 1

awk/sed: matching pattern between files and printing everything between matches

Answers (3)

Related Questions