Reputation: 135
I want to write a grep command which will extract content between h1 tags irrespective of class and other attributes
I tried
grep -o '>.*</h1>' Email.txt
But gave only three elements
Upvotes: 3
Views: 2405
Reputation: 626738
With GNU grep
, you may use
grep -oP '<h1(?:\s[^>]*)?>\K.*?(?=</h1>)' Email.txt
The -P
option will enable PCRE regex engine and the pattern will match
<h1
- <h1
string(?:\s[^>]*)?
- an optional non-capturing group matching 1 or 0 occurrences of a whitespace (\s
) followed with 0+ chars other than >
>
- a >
char\K
- match reset operator that discards the text matched so far from the match memory buffer.*?
- any 0+ chars other than line break chars, as few as possible(?=</h1>)
- a positive lookahead that matches a location that is immediately followed with </h1>
substring.Upvotes: 2