Reputation: 421

file proccessing by awk or grep

I have to process a big input file (2.9 GB) to produce the output in a particular required format (describe below:)

Sample of input file is:

GS  RSPH14
CC  Build HSA_Jul2014 (GRCh38; hg38): chr22:23141092..23152092 (REVERSE)
FT  TFBS CHIP: FR000000873; SP1 (Jurkat); PMID:14980218; 23144712..23145380
FT  TFBS CHIP: FR000643682; ER-ALPHA (MCF-7); PMID:19339991; 23147445..23148194
FT  TFBS CHIP: FR029934262; C/EBPBETA (A-549); https://www.encodeproject.org/experiments/ENCSR000DYI/; 23150853..23151108
GS  CLXC15
CC  Build HSA_Jul2014 (GRCh38; hg38): chr3:23144021..23155021 (REVERSE)
FT  TFBS CHIP: FR000643682; ER-ALPHA (MCF-7); PMID:19339991; 23147445..23148194
FT  TFBS CHIP: FR034213319; CTCF (MCF-7); https://www.encodeproject.org/experiments/ENCSR000DMV/; 23151393..23151582

Description: Every line in input file starts with either GS or CC or FT, I want to ignore the GS* lines. For the CC* line, I want to split it on : and take the 1st index (0-based counting), according to my input sample it will be chr22 (in line 2) and chr3 (in line 7). For the FT line, I want to split it on ; and take the 1st and last index (according to my input sample's line 3 it will be SP1 (Jurkat) and 23144712..23145380, respectively) and want to proccess them in such a way that my output file should look like this:

chr22   23144712    23145380    SP1
chr22   23147445    23148194    ER-ALPHA
chr22   23150853    23151108    C/EBPBETA
chr3    23147445    23148194    ER-ALPHA
chr3    23151393    23151582    CTCF

Any help will be much appreciated!

My Try: I am able to split the file on ; so that I get my desired columns. What I tried is: awk -F'[;]' '{print $2 "\t" $4}' sample.txt > output.txt. This gives me output as:

 hg38): chr22:23141092..23152092 (REVERSE)  
 SP1 (Jurkat)    23144712..23145380
 ER-ALPHA (MCF-7)    23147445..23148194
 C/EBPBETA (A-549)   23150853..23151108

 hg38): chr3:23144021..23155021 (REVERSE)   
 ER-ALPHA (MCF-7)    23147445..23148194
 CTCF (MCF-7)    23151393..23151582

Now from the 1st and 6th line I only want chr22 and chr3 and from the other lines (non 1st and 6th which were originally starting with GS or CC) only the last column and append the corresponding chr in front. Also 1st index of other lines should be processed to split on ( and keep the 1st index.

Upvotes: 2

Answers (3)

Rahul Verma

Reputation: 3089

On your request; Using awk

$ awk '/^CC /{FS=":"; $0=$0; a=$2} /^FT /{FS="[ ;.]+"; $0=$0;print a,$(NF-1),$NF,$5}' file
 chr22 23144712 23145380 SP1
 chr22 23147445 23148194 ER-ALPHA
 chr22 23150853 23151108 C/EBPBETA
 chr3 23147445 23148194 ER-ALPHA
 chr3 23151393 23151582 CTCF

/^CC /{FS=":"; $0=$0; a=$2;} : If record starts with CC(mind the space) set : as FS.
$0=$0 will force awk to split the records on the basis of whatever FS is. Set a to second field

/^FT /{FS="[ ;.]+"; $0=$0; print a,$(NF-1),$NF,$5} : If record starts with FT(again, mind the space) set [ ;.]+ as FS which will equate FS to repeated or ; or . for ex. .. as in your last field. At last, print the required fields.

Upvotes: 1

RavinderSingh13

Reputation: 133528

Following awk may help you on same.

awk '/^CC.*/{match($0,/chr[0-9]+/);val=substr($0,RSTART,RLENGTH);next} /^FT.*/{sub(/\.+/,OFS,$NF);print val,$NF,$5}' OFS="\t"  Input_file

Adding a non-one liner form of solution too now.

awk '
/^CC.*/{
  match($0,/chr[0-9]+/);
  val=substr($0,RSTART,RLENGTH);
  next}
/^FT.*/{
  sub(/\.+/,OFS,$NF);
  print val,$NF,$5}
' OFS="\t"  Input_file

Upvotes: 1

glenn jackman

Reputation: 246827

Using awk:

awk '
    $1 == "CC" { split($0, a, /:/); key=a[2] }
    $1 == "FT" {
        n = split($0, a, /;/)
        split(a[2], b, FS)
        split(a[n], c, /[.]{2}/)
        print key, c[1],c[2], b[1]
    }
' file | column -t

chr22  23144712  23145380  SP1
chr22  23147445  23148194  ER-ALPHA
chr22  23150853  23151108  C/EBPBETA
chr3   23147445  23148194  ER-ALPHA
chr3   23151393  23151582  CTCF

Upvotes: 1

file proccessing by awk or grep

Answers (3)

Related Questions