user2293045
user2293045

Reputation: 69

How to delimit and print third value for every column in a file using in bash or awk

I have a file with a header row containing the file path as the name for each column and I'd like to extract and print out just the file name. There are over 100 columns.

E.g. Input header row:

AAF2Y7VM5-8/cnv/F04_reads.tsv    AAF2Y7VM5-7/cnv/D04_reads.tsv    AAF2Y7VM5-6/cnv/E04_reads.tsv

Goal output header row:

F04_reads.tsv D04_reads.tsv E04_reads.tsv

I have:
awk -F '[/|\t]' '{if (NR==1) {for(i=1;i<=NF;i++) printf $i"\t"}}' ZScores.txt

That outputs all three delimited values for every column, but I want just the third value, i.e. the file name, for each column in this row. Awk, bash, or sed solutions appreciated!

Upvotes: 6

Views: 497

Answers (7)

Daweo
Daweo

Reputation: 36680

I would exploit GNU AWK for this task following way. Let file.txt content be TAB-sheared file with following content:

AAF2Y7VM5-8/cnv/F04_reads.tsv   AAF2Y7VM5-7/cnv/D04_reads.tsv   AAF2Y7VM5-6/cnv/E04_reads.tsv
something   something   something
something   something   something

Then

awk 'BEGIN{FS="/";RS="[\t\n]";ORS="\t"}{print $3}RT=="\n"{exit}' file.txt

gives output

F04_reads.tsv   D04_reads.tsv   E04_reads.tsv   

Explanation: I inform GNU AWK that record are separated by TAB or newline character and fields are separated by / and print value should be suffixed with \t, rather than newline. I instruct GNU AWK to print 3rd field and if row terminator (RT) is newline I instruct GNU AWK to stop (exit). Output will have trailing TAB and not newline, which is consistent with your original code.

(tested in GNU Awk 5.3.1)

Upvotes: 1

Gilles Qu&#233;not
Gilles Qu&#233;not

Reputation: 185560

KISS:

$ echo $(head -n1 file | tr ' ' '\n' | cut -d/ -f3)
F04_reads.tsv D04_reads.tsv E04_reads.tsv

or

$ echo $(head -n1 file | tr ' ' '\n'  | awk -F/ 'NF{printf "%s " ,$3}')
F04_reads.tsv D04_reads.tsv E04_reads.tsv

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 204259

Using any awk if your fields are tab-separated as they appear to be:

$ awk 'NR==1{gsub("[^\t]+/","")} 1' file
F04_reads.tsv    D04_reads.tsv    E04_reads.tsv

Otherwise, using any POSIX awk:

$ awk 'NR==1{gsub("[^[:space:]]+/","")} 1' file
F04_reads.tsv    D04_reads.tsv    E04_reads.tsv

Change [^[:space:]] to [^ \t] if you don't have a POSIX awk but - get a new awk.

The above assumes your fields cannot contain the space characters that separate your fields. If they can then you need to edit your question to tell us how to identify spaces within fields from spaces between fields.

Upvotes: 12

jhnc
jhnc

Reputation: 16817

To just extract first line:

Bash (replace tabs):

( IFS=$'\t' read -ra cols <file; echo "${cols[@]##*/}" )
  • load first line of file into array, columns delimited by (any number of) tabs
  • print array after stripping longest prefix that ends with a slash from each element

Bash (retain tabs):

(
    shopt -s extglob
    IFS= read -r cols
    echo "${cols//+([!$'\t'])\/}"
) <file

Sed (replace tabs):

sed -E 's|[^\t]+/||g; y|\t| |; q' file

Sed (retain tabs):

sed -E 's|[^\t]+/||g; q' file

If the intention is to also retain the whole file as tsv:

Bash: append cat after echo in the "retain tabs" version:

(
    shopt -s extglob
    IFS= read -r cols
    echo "${cols//+([!$'\t'])\/}"
    cat
) <file

Sed: prefix s command with 1 and elide the q from "retain tabs" version:

sed -E '1s|[^\t]+/||g' file

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133680

1st solution: With your shown samples please try following.

awk  '
{
  while(match($0,/(\/[^\/]*\/)([^.]*\.tsv)/,arr)){
    val=(val?val OFS:"") arr[2]
    $0=substr($0,RSTART+RLENGTH)
  }
  $0=val
}
1
' Input_file

2nd solution: if ok with perl onliner solution

perl -nle 'print join(" ", /([^\/]+_reads\.tsv)/g)' Input_file

Upvotes: 4

karakfa
karakfa

Reputation: 67537

a non-awk solution

$ sed 1q file | tr -s ' ' \n | cut -d/ -f3 | paste -sd' ' 

extract first row, transpose to column, cut the 3rd field, serialize back to a row

Upvotes: 3

markp-fuso
markp-fuso

Reputation: 35146

Tweaking OP's current code to print every 3rd field:

$ awk -F '[/|\t]' '{if (NR==1) {for(i=3;i<=NF;i+=3) printf $i"\t"}}' ZScores.txt
F04_reads.tsv   D04_reads.tsv   E04_reads.tsv

NOTE: there's a trailing \t on that output; also, the line does not end with a \n

Removing the trailing \t, adding a trailing \n, and skipping processing of rest of file:

$ awk -F '[/|\t]' 'NR==1 { for (i=3;i<=NF;i+=3) { printf "%s%s", sep, $i; sep="\t" }; print ""; exit }' ZScores.txt
F04_reads.tsv   D04_reads.tsv   E04_reads.tsv

Where:

  • sep is blank for first pass through loop, then set to \t for remaining passes through the loop
  • print "" - terminate the printf line of output with a \n (default output record separator)
  • exit - to keep from reading (and in this case ignoring) rest of file

NOTE: OP's code places a tab (\t) between output values but the expected output shows a single space between values; if OP wishes to separate the output with single spaces then replace sep="\t" with sep=" "

Upvotes: 4

Related Questions