Reputation: 2157
I have a big tab delimited .txt file of 4 columns
col1 col2 col3 col4
name1 1 2 ens|name1,ccds|name2,ref|name3,ref|name4
name2 3 10 ref|name5,ref|name6
... ... ... ...
Now I want to extract from this file everything that starts with 'ref|'. This pattern is only present in col4
So for this example I would like to have as output
ref|name3
ref|name4
ref|name5
ref|name6
I thought of using 'sed' for this, but I don't know where to start.
Upvotes: 2
Views: 648
Reputation: 14975
I think awk
is better suited for this task:
$ awk '{for (i=1;i<=NF;i++){if ($i ~ /ref\|/){print $i}}}' FS='( )|(,)' infile
ref|name3
ref|name4
ref|name5
ref|name6
FS='( )|(,)'
sets a multile FS
to itinerate columns by ,
and blank spaces
, then prints the column when it finds the ref
pattern.
Upvotes: 5
Reputation: 58558
This might work for you (GNU sed):
sed 's/\(ref|[^,]*\),/\n\1\n/;/^ref/P;D' file
Surround the required strings by newlines and only print those lines that begin with the start of the required string.
Upvotes: 0
Reputation: 195229
Now I want to extract from this file everything that starts with 'ref|'. This pattern is only present in col4
If you are sure that the pattern only present in col4, you could use grep:
grep -o 'ref|[^,]*' file
output:
ref|name3
ref|name4
ref|name5
ref|name6
Upvotes: 4
Reputation: 8140
One solution I had was to first use awk
to only get the 4th column, then use sed
to convert commas into newlines, and then use grep
(or awk
again) to get the ones that start with ref
:
awk '{print $4}' < data.txt | sed -e 's/,/\n/g' | grep "^ref"
Upvotes: 2