Reputation: 421
I have to process a big input file (2.9 GB) to produce the output in a particular required format (describe below:)
Sample of input file is:
GS RSPH14
CC Build HSA_Jul2014 (GRCh38; hg38): chr22:23141092..23152092 (REVERSE)
FT TFBS CHIP: FR000000873; SP1 (Jurkat); PMID:14980218; 23144712..23145380
FT TFBS CHIP: FR000643682; ER-ALPHA (MCF-7); PMID:19339991; 23147445..23148194
FT TFBS CHIP: FR029934262; C/EBPBETA (A-549); https://www.encodeproject.org/experiments/ENCSR000DYI/; 23150853..23151108
GS CLXC15
CC Build HSA_Jul2014 (GRCh38; hg38): chr3:23144021..23155021 (REVERSE)
FT TFBS CHIP: FR000643682; ER-ALPHA (MCF-7); PMID:19339991; 23147445..23148194
FT TFBS CHIP: FR034213319; CTCF (MCF-7); https://www.encodeproject.org/experiments/ENCSR000DMV/; 23151393..23151582
Description: Every line in input file starts with either GS
or CC
or FT
, I want to ignore the GS* lines. For the CC* line, I want to split it on :
and take the 1st index
(0-based counting), according to my input sample it will be chr22
(in line 2) and chr3
(in line 7). For the FT line, I want to split it on ;
and take the 1st
and last index
(according to my input sample's line 3 it will be SP1 (Jurkat)
and 23144712..23145380
, respectively) and want to proccess them in such a way that my output file should look like this:
chr22 23144712 23145380 SP1
chr22 23147445 23148194 ER-ALPHA
chr22 23150853 23151108 C/EBPBETA
chr3 23147445 23148194 ER-ALPHA
chr3 23151393 23151582 CTCF
Any help will be much appreciated!
My Try: I am able to split the file on ;
so that I get my desired columns. What I tried is: awk -F'[;]' '{print $2 "\t" $4}' sample.txt > output.txt
. This gives me output as:
hg38): chr22:23141092..23152092 (REVERSE)
SP1 (Jurkat) 23144712..23145380
ER-ALPHA (MCF-7) 23147445..23148194
C/EBPBETA (A-549) 23150853..23151108
hg38): chr3:23144021..23155021 (REVERSE)
ER-ALPHA (MCF-7) 23147445..23148194
CTCF (MCF-7) 23151393..23151582
Now from the 1st and 6th line I only want chr22
and chr3
and from the other lines (non 1st and 6th which were originally starting with GS
or CC
) only the last column and append the corresponding chr in front. Also 1st index of other lines should be processed to split on (
and keep the 1st index.
Upvotes: 2
Views: 77
Reputation: 3089
On your request; Using awk
$ awk '/^CC /{FS=":"; $0=$0; a=$2} /^FT /{FS="[ ;.]+"; $0=$0;print a,$(NF-1),$NF,$5}' file
chr22 23144712 23145380 SP1
chr22 23147445 23148194 ER-ALPHA
chr22 23150853 23151108 C/EBPBETA
chr3 23147445 23148194 ER-ALPHA
chr3 23151393 23151582 CTCF
/^CC /{FS=":"; $0=$0; a=$2;}
: If record starts with CC
(mind the space) set :
as FS.
$0=$0
will force awk to split the records on the basis of whatever FS
is. Set a
to second field
/^FT /{FS="[ ;.]+"; $0=$0; print a,$(NF-1),$NF,$5}
: If record starts with FT
(again, mind the space) set [ ;.]+
as FS
which will equate FS
to repeated or
;
or .
for ex. ..
as in your last field.
At last, print the required fields.
Upvotes: 1
Reputation: 133528
Following awk
may help you on same.
awk '/^CC.*/{match($0,/chr[0-9]+/);val=substr($0,RSTART,RLENGTH);next} /^FT.*/{sub(/\.+/,OFS,$NF);print val,$NF,$5}' OFS="\t" Input_file
Adding a non-one liner form of solution too now.
awk '
/^CC.*/{
match($0,/chr[0-9]+/);
val=substr($0,RSTART,RLENGTH);
next}
/^FT.*/{
sub(/\.+/,OFS,$NF);
print val,$NF,$5}
' OFS="\t" Input_file
Upvotes: 1
Reputation: 246827
Using awk:
awk '
$1 == "CC" { split($0, a, /:/); key=a[2] }
$1 == "FT" {
n = split($0, a, /;/)
split(a[2], b, FS)
split(a[n], c, /[.]{2}/)
print key, c[1],c[2], b[1]
}
' file | column -t
chr22 23144712 23145380 SP1
chr22 23147445 23148194 ER-ALPHA
chr22 23150853 23151108 C/EBPBETA
chr3 23147445 23148194 ER-ALPHA
chr3 23151393 23151582 CTCF
Upvotes: 1