Reputation: 772
I want to run grep on HTML files to find out lines longer than x characters and truncate the display using grep.
What I know
To figure out lines longer than 100 characters in html files.
find . -name '*.html' -print | xargs grep -on '.\{100\}'
To find lines matching title
and limit display by 40 characters with
find . -name '*.html' -print | xargs grep -onE '.{0,40}title.{0,40}'
What I don't know
How can I find out lines that exceed 100 characters and then display those lines by limited to 40 characters?
MVCE
I have a bunch of html files, which look like
$ cat 1.html
abcdefghijklmnopqrstuv12345675689
12345675689abcdefghijklmnopqrstuv
abcd1234
Now, I'd like to find out lines longer than 20 characters, and then cut the display to 15 characters only.
Expected output with favoretti solution
$ find . -name '*.html' -print | xargs grep -on '.\{20\}' | cut -c -15
./1.html:1:abcd
./1.html:2:1234
./2.html:1:abcd
./2.html:2:1234
Upvotes: 0
Views: 2552
Reputation: 74685
First of all it's worth mentioning that unless you're very confident that you can treat your "HTML" files as a series of line-separated records, then you should probably be using an HTML-aware tool (either standalone, or included in a scripting language).
Since you mentioned Awk in an earlier comment:
find . -name '*.html' -exec awk '
length($0) > 20 { print FILENAME, substr($0, 1, 15) }' {} +
This matches lines with length greater than 20 and prints the first 15 characters. I put the file name at the start, you can remove that if you like.
It's not clear whether you need find
for a recursive search or not - if not, then you might be fine with letting the shell generate the list of files:
awk 'length($0) > 20 { print FILENAME, substr($0, 1, 15) }' *.html
And with globstar
enabled (shopt -s globstar
), you can use **/*.html
for recursive matching in Bash.
Upvotes: 4
Reputation: 11226
If for some reason you want to just use grep
find . -name '*.html' -exec grep -oP '.{40}(?=.{60})' {} /dev/null \;
Upvotes: 2
Reputation: 30197
The first grep works ok I suppose, so if you want to print out just 40 chars, pipe it through cut
?
find . -name '*.html' -print | grep -on '.\{100\}' | cut -c 1-40
Upvotes: 0