Iliass10
Iliass10

Reputation: 21

AWK: join separate words from 2 consecutive lines

I'd like an AWK command that joins separate words:

1st part is in the end of the line, end up with "_".

2nd part is in the beginning of the next line.

(PS: there are some lines that have both the 2nd and 1st part like in the example below)

Example:

Bla bla bla bla SATU_
RDAY bla bla, bla bla
bla bla bla bla bla SUN_
DAY: bla bla bla bla M_
ONDAY. Bla bla bla bla TU_
ESDAY, bla bla bla.

Result:

Line 1: SATURDAY
Line 3: SUNDAY
Line 4: MONDAY
Line 5: TUESDAY

Upvotes: 0

Views: 170

Answers (5)

Ed Morton
Ed Morton

Reputation: 203985

With GNU awk for multi-char RS:

$ awk -v RS='[[:alpha:]]+_\n[[:alpha:]]+' 'RT!=""{sub(/_\n/,"",RT); print RT}' file
SATURDAY
SUNDAY
MONDAY
TUESDAY

or with any awk:

$ awk 'w{w=w $1; gsub(/[^[:alpha:]]/,"",w); print w; w=""} /_$/{w=$NF}' file
SATURDAY
SUNDAY
MONDAY
TUESDAY

and if you really want the starting line numbers included then with any awk:

$ awk 'w{w=w $1; gsub(/[^[:alpha:]]/,"",w); printf "Line %d: %s\n", NR-1, w; w=""} /_$/{w=$NF}' file
Line 1: SATURDAY
Line 3: SUNDAY
Line 4: MONDAY
Line 5: TUESDAY

Upvotes: 1

George Vasiliou
George Vasiliou

Reputation: 6345

With GNU awk:

$ awk 'p{print gensub(/[^A-Z]$/,"","g",$1);p=0}/_$/{printf "%s",gensub("_","","g",$NF);p=1}' file
SATURDAY
SUNDAY
MONDAY
TUESDAY

Upvotes: 0

James Brown
James Brown

Reputation: 37424

$ awk 'p~/_$/{sub(/_$/,"",p);print "Line " (NR-1) ":", p $1}{p=$NF}' file
Line 1: SATURDAY
Line 3: SUNDAY:
Line 4: MONDAY.
Line 5: TUESDAY,

Upvotes: 2

mklement0
mklement0

Reputation: 439237

A POSIX-compliant solution:

awk '
  firstPart != "" { sub(/[[:punct:]]$/, "", $1); print firstPart $1 } 
  $NF ~ /._$/ { firstPart=substr($NF, 1, length($NF) - 1); next } 
  { firstPart= "" }
' file
  • Pattern (condition) firstPart != "" is true only if a token of interest was found on the previous line and only then executes the associated action ({ ... }):

    • sub(/[[:punct:]]$/, "", $1) replaces (sub()) a trailing ($) instance of a punctuation character ([[:punct:]]), if any, in the 1st field ($1) with the empty string, thereby effectively removing it.

    • print firstPart $1 prints the direct concatenation of the token of interest from the previous line with the (modified) 1st field, simply by placing firstPart and $1 next to each other, separated only by a space.

  • Pattern $NF ~ /._$/ tests if the last field ($NF) ends in ($) _ (preceded by at least 1 other character (.)).

    • firstPart=substr($NF, 1, length($NF) - 1) stores the contents of the last field except for the trailing _ in variable firstPart.
    • next skips processing of the remainder of the script for the line at hand and moves to the next line.
  • Action { firstPart= "" }, because it is not preceded by a pattern, is processed unconditionally - if reached:

    • Here it is only reached if the line at hand contains no token of interest.
    • Resetting firstPart signals to the next script cycle that nothing is to be printed for the next line.

Upvotes: 0

Mischa
Mischa

Reputation: 2298

Not quite sure of all your requirements, but:

awk 'x  {sub("[^A-Z].*", "", $1); print "Line "n": "x $1; x = ""}
     sub("_$", "", $NF) {x = x $NF; n = NF}' input.txt

hth

Upvotes: 0

Related Questions