user1987607
user1987607

Reputation: 2157

linux: extract pattern from file

I have a big tab delimited .txt file of 4 columns

col1    col2    col3    col4
name1   1       2       ens|name1,ccds|name2,ref|name3,ref|name4
name2   3       10      ref|name5,ref|name6
...     ...     ...     ...

Now I want to extract from this file everything that starts with 'ref|'. This pattern is only present in col4

So for this example I would like to have as output

ref|name3
ref|name4
ref|name5
ref|name6

I thought of using 'sed' for this, but I don't know where to start.

Upvotes: 2

Views: 648

Answers (4)

Juan Diego Godoy Robles
Juan Diego Godoy Robles

Reputation: 14975

I think awk is better suited for this task:

$ awk  '{for (i=1;i<=NF;i++){if ($i ~ /ref\|/){print $i}}}' FS='( )|(,)' infile
ref|name3
ref|name4
ref|name5
ref|name6

FS='( )|(,)' sets a multile FS to itinerate columns by , and blank spaces, then prints the column when it finds the ref pattern.

Upvotes: 5

potong
potong

Reputation: 58558

This might work for you (GNU sed):

sed 's/\(ref|[^,]*\),/\n\1\n/;/^ref/P;D' file

Surround the required strings by newlines and only print those lines that begin with the start of the required string.

Upvotes: 0

Kent
Kent

Reputation: 195229

Now I want to extract from this file everything that starts with 'ref|'. This pattern is only present in col4

If you are sure that the pattern only present in col4, you could use grep:

grep -o 'ref|[^,]*' file

output:

ref|name3
ref|name4
ref|name5
ref|name6

Upvotes: 4

chw21
chw21

Reputation: 8140

One solution I had was to first use awk to only get the 4th column, then use sed to convert commas into newlines, and then use grep (or awk again) to get the ones that start with ref:

awk '{print $4}' < data.txt | sed -e 's/,/\n/g' | grep "^ref"

Upvotes: 2

Related Questions