SoptikHa
SoptikHa

Reputation: 467

Match two patterns on one line and print them in two columns

I have about few hundred CSV files. These CSV files have different definitions and I don't want to manually unite all the CSV files into one format.

I want to get two different things from the files - A and B, and I can match both of them with regex. I want to match both of them at once - so only rows with both things will be printed. I know how to do that, and I've seen many SO posts answering how to do it.

But I don't know how to print just A B without rest of the line. I don't know in which order or in which columns will be the two things, so I don't know how (or if I even can) use awk.

Example:

(match A[0-9], B[0-9])

A0 B0 C0
B1 C1 D1
E2 C2 A2
C3 F3 F3
B4 F4 A4

Result:

A0 B0
A4 B4

Upvotes: 0

Views: 98

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133770

1st Solution: using match function of awk. It will give output in order from letter A to B as per OP's shown examples.

awk '
match($0,/A[0-9]+/){
  val=substr($0,RSTART,RLENGTH)
  if(val && match($0,/B[0-9]+/)){
     print val,substr($0,RSTART,RLENGTH)
  }
}'  Input_file


2nd Solution: This solution will not care of letter A and B, so in which order they are coming into line they will appear in same order.

awk '
{
  for(i=1;i<=NF;i++){
    if($i ~ /A[0-9]+/ || $i ~ /B[0-9]+/){
       val=val?val OFS $i:$i
    }
  }
  if(val ~ /A[0-9]+/ && val ~ /B[0-9]+/){
    print val
  }
  val=""
}
END{
  if(val ~ /A[0-9]+/ && val ~ /B[0-9]+/){
    print val
  }
}'   Input_file


3rd Solution: considering that you need them in order of A to B in output then following may help.

awk '
{
  for(i=1;i<=NF;i++){
     line=$i
     sub(/[0-9]+/,"",line)
     if($i ~ /A[0-9]+/ || $i ~ /B[0-9]+/){
       array[tolower(line)]=$i
     }
  }
  if(array["a"] ~ /A[0-9]+/ && array["b"] ~ /B[0-9]+/){
     print array["a"],array["b"]
  }
  delete array
}
END{
  if(array["a"] ~ /A[0-9]+/ && array["b"] ~ /B[0-9]+/){
     print array["a"],array["b"]
  }
}'   Input_file

NOTE: Adding information from man awk documentation about used functions eg--> match, tolower, RSTART and RLENGTH

match(s, r [, a]) Returns the position in s where the regular expression r occurs, or 0 if r is not present, and sets the values of RSTART and RLENGTH. Note that the argument order is the same as for the ~ operator: str ~ re. If array a is provided, a is cleared and then elements 1 through n are filled with the portions of s that match the corresponding parenthesized subexpression in r. The 0’th element of a contains the portion of s matched by the entire regular expression r. Sub- scripts a[n, "start"], and a[n, "length"] provide the starting index in the string and length respectively, of each matching substring.

RSTART The index of the first character matched by match(); 0 if no match. (This implies that character indices start at one.)

RLENGTH The length of the string matched by match(); -1 if no match.

tolower(str) Returns a copy of the string str, with all the upper-case characters in str translated to their corresponding lower-case counterparts. Non-alphabetic characters are left unchanged.

Upvotes: 3

oguz ismail
oguz ismail

Reputation: 50815

But I don't know how to print just A B without rest of the line.

Well, you need to remove everything but A and B from matching lines and force awk to recompute fields ($1=$1 does that).

awk '/A[0-9]/ && /B[0-9]/ { gsub(/[^AB][0-9]/,""); $1=$1; print }' file

Upvotes: 1

Related Questions