Reputation: 466
Here's what I want to do:
Search a document for a pattern containing RegEx, then check if this exact pattern is present twice inside of a line.
Content of file.xml:
(some code) "testen" (more code) >testete<
(some code) "bleiben" (more code) >bleiben<
(some code) "stehen" (more code) >stand<
(some code) "hängen" (more code) >hängten<
...
Now I want to check for .*en
and check if the (exact) same word occurs twice in the line. So the outcome should be:
bleiben
Because Testen != testete, stehen != stand, hängen != hängten
Is there a way to do this?
Upvotes: 5
Views: 6333
Reputation: 836
Using sed
sed -n 's/[^"]\+"\([^"]\+\)"[^>]\+>\1</\1/p' FileName.txt
Output :
bleiben
Upvotes: 0
Reputation: 38456
You can handle this search on the first grep
line by using the pattern: .*en.*en
:
grep .*en.*en your_file
This will output only the lines that have en
appearing twice in them.
If you need to handle it in two back-to-back grep
's, you could still use this same command in a piped version:
grep .*en your_file | grep .*en.*en
Also, if you ever want to increase the number of instances in the same line, you can take advantage of grep
's -P
option and use a Perl regex:
grep -P "(.*en){2}" your_file
With this, you can just change the {2}
to however-many instances you want it to appear in a single line and it should work.
EDIT (to find lines with exact same word twice)
This is difficult without an extended pattern that can define the boundaries of a word - and your example output doesn't really help much. To go for a straight-to-the-point example, we can just assume a "word" is any alphabetical string a-z
that's ending with en
. You can customize this boundary as needed:
grep -P "([a-z]+en).*\1" your_file
This will print any line that has a word ending in en
that is found elsewhere in the line (the \1
).
One caveat to mention, which relates to the word-boundary issue noted above. In the context of "bleiben" and "bleiben", they are equal. However, in the context of "ben" and "bleiben", this pattern will also match because it will see then ending "ben" from "bleiben" as the matching pattern (thereby using "ben" = "ben"). If this is not acceptable, you will have to establish a more-strict word-boundary (i.e. - don't allow special characters?).
Upvotes: 7
Reputation: 54512
Here's one way using GNU awk
. I'm assuming by twice you mean two or more times. Run like:
awk -f script.awk file.xml
Contents of script.awk
:
/.*en/ {
gsub(/["<>]/, " ")
for (i=1; i<=NF; i++) {
if ($i ~ /.*en/) {
array[$i]++
}
}
}
{
for (j in array) {
if (array[j]>=2) {
print j
}
}
delete array
}
Alternatively, here's the one-liner:
awk '/.*en/ { gsub(/["<>]/, " "); for (i=1; i<=NF; i++) if ($i ~ /.*en/) array[$i]++ } { for (j in array) if (array[j]>=2) print j; delete array }' file.xml
Upvotes: 1
Reputation: 22692
You can use grep's -o
option to only return the matching portion of the line.
Here's a link that suggests that awk might be a better tool for the job:
Upvotes: 0