Piodo
Piodo

Reputation: 616

Linux ubuntu Awk. Find the sentences containg 2 same words placed near themselves

I want to print all the sentences that contains two same words next to each other. The sentence is ending with . or ? or !.

For the input:

word ja ba word. Na Na word wdd? Nothing kkk
ok ok! word no this no word. ok ok. notok!

output should be:

Na Na word wdd?

Nothing kkk
ok ok!

ok ok.

This is my code so far:

 #!/bin/bash
if [ $# -eq 0 ]
 then
echo "No arguments"
 fi

if [[ -f $1 ]] #if its file
then

cat $1 | awk '{
for (i=1;i<=NF;i++)         
  {

}}'

fi

I dont know how to separate full sentences with AWK. I can't use multpile file separators (! its important). If I separate them, how to check every word inside? I need to use AWK.

this is my newest idea:

cat $1 | awk  '{
 for (i=1;i<=NF;i++)         
  {
   a=0;
    if ($i ~ "\?$" || $i ~ "\!$" || $i ~ "\.$")          
    {

  #print $i;
      k='';

    for(j=$i; j!=$a; j--);
    {
      if( $j == $k)
        #print whole sentence

       $k=$j;

    }

    }
}}'

I found the words ending with ?/./!, then check all the previous words before the last sentence

Upvotes: 0

Views: 391

Answers (2)

Aaron
Aaron

Reputation: 24812

grep is enough to do so :

grep -Pzo "[^.?!]*\b(\w+) \1[^.?!]*"

Test:

$ echo '''word ja ba word. Na Na word wdd? Nothing kkk  
ok ok! word no this no word. ok ok. notok!''' | grep -Pzo "[^.?!]*\b(\w+) \1[^.?!]*"  
Na Na word wdd  
Nothing kkk  
ok ok  
ok ok

Explanation :

  • the -o flag makes grep only return the matched result, rather than the line it appears in
  • the -P flag makes grep use PCRE regex
  • the -z flag suppress newline at the end of line, substituting it for nul character. That is, grep knows where end of line is, but sees the input as one big line.
  • [^.?!]* matches the start of the sentence : it will match as much characters as it can, but no sentence terminators (.?!)
  • \b(\w+) matches word characters, and groups them in the first group of the regular expression. The word boundary makes sure we do not only match the end of a word (thanks 123 !).
  • \1 references this first group, so we must have two identical words separated by a space
  • [^.?!]* matches the end of the sentence

Upvotes: 4

karakfa
karakfa

Reputation: 67507

with gawk

$ awk -v RS='[!?.] +' '{for(i=1;i<NF;i++) if($i==$(i+1)) print $0 RT "\n"}' file

Na Na word wdd?

Nothing kkk
ok ok!

ok ok.

set the records ending with [!?.] and optional space. Iterate over words in the sentence for repeats, print the sentence with matched record terminator and new line for spacing between sentences.

Here is the same script with the here document

awk -v RS='[!?.] +' '{for(i=1;i<NF;i++) if($i==$(i+1)) print $0 RT "\n"}' << EOF
> word ja ba word. Na Na word wdd? Nothing kkk
> ok ok! word no this no word. ok ok. notok!
> EOF

should give you

Na Na word wdd?

Nothing kkk
ok ok!

ok ok.

Upvotes: 2

Related Questions