Mat Fluor
Mat Fluor

Reputation: 466

GREP and RegEx - find pattern and look for it again

Here's what I want to do:

Search a document for a pattern containing RegEx, then check if this exact pattern is present twice inside of a line.

Content of file.xml:
(some code) "testen"  (more code)  >testete<
(some code) "bleiben" (more code)  >bleiben<
(some code) "stehen"  (more code)  >stand<
(some code) "hängen"  (more code)  >hängten<
... 

Now I want to check for .*en and check if the (exact) same word occurs twice in the line. So the outcome should be:

bleiben

Because Testen != testete, stehen != stand, hängen != hängten

Is there a way to do this?

Upvotes: 5

Views: 6333

Answers (5)

vara
vara

Reputation: 836

Using sed

sed -n  's/[^"]\+"\([^"]\+\)"[^>]\+>\1</\1/p' FileName.txt

Output :

bleiben

Upvotes: 0

kon
kon

Reputation: 554

Using sed:

sed -rn 's/.*\b(\w+en)\b.*\b\1\b.*/\1/gp' input_file

Upvotes: 1

newfurniturey
newfurniturey

Reputation: 38456

You can handle this search on the first grep line by using the pattern: .*en.*en:

grep .*en.*en your_file

This will output only the lines that have en appearing twice in them.

If you need to handle it in two back-to-back grep's, you could still use this same command in a piped version:

grep .*en your_file | grep .*en.*en

Also, if you ever want to increase the number of instances in the same line, you can take advantage of grep's -P option and use a Perl regex:

grep -P "(.*en){2}" your_file

With this, you can just change the {2} to however-many instances you want it to appear in a single line and it should work.

EDIT (to find lines with exact same word twice)

This is difficult without an extended pattern that can define the boundaries of a word - and your example output doesn't really help much. To go for a straight-to-the-point example, we can just assume a "word" is any alphabetical string a-z that's ending with en. You can customize this boundary as needed:

grep -P "([a-z]+en).*\1" your_file

This will print any line that has a word ending in en that is found elsewhere in the line (the \1).

One caveat to mention, which relates to the word-boundary issue noted above. In the context of "bleiben" and "bleiben", they are equal. However, in the context of "ben" and "bleiben", this pattern will also match because it will see then ending "ben" from "bleiben" as the matching pattern (thereby using "ben" = "ben"). If this is not acceptable, you will have to establish a more-strict word-boundary (i.e. - don't allow special characters?).

Upvotes: 7

Steve
Steve

Reputation: 54512

Here's one way using GNU awk. I'm assuming by twice you mean two or more times. Run like:

awk -f script.awk file.xml

Contents of script.awk:

/.*en/ { 
    gsub(/["<>]/, " ")
    for (i=1; i<=NF; i++) {
        if ($i ~ /.*en/) {
            array[$i]++
        } 
    }
}
{
    for (j in array) {
        if (array[j]>=2) {
            print j
        }
    }
    delete array
}

Alternatively, here's the one-liner:

awk '/.*en/ { gsub(/["<>]/, " "); for (i=1; i<=NF; i++) if ($i ~ /.*en/) array[$i]++ } { for (j in array) if (array[j]>=2) print j; delete array }' file.xml

Upvotes: 1

jahroy
jahroy

Reputation: 22692

You can use grep's -o option to only return the matching portion of the line.

Here's a link that suggests that awk might be a better tool for the job:

Upvotes: 0

Related Questions