piroot
piroot

Reputation: 772

Find lines longer than x characters and truncate for display

I want to run grep on HTML files to find out lines longer than x characters and truncate the display using grep.

What I know

To figure out lines longer than 100 characters in html files.

find . -name '*.html' -print | xargs grep -on '.\{100\}'

To find lines matching title and limit display by 40 characters with

find . -name '*.html' -print | xargs grep -onE '.{0,40}title.{0,40}'

What I don't know

How can I find out lines that exceed 100 characters and then display those lines by limited to 40 characters?


MVCE

I have a bunch of html files, which look like

$ cat 1.html
abcdefghijklmnopqrstuv12345675689
12345675689abcdefghijklmnopqrstuv
abcd1234

Now, I'd like to find out lines longer than 20 characters, and then cut the display to 15 characters only.

Expected output with favoretti solution

$ find . -name '*.html' -print | xargs grep -on '.\{20\}' | cut -c -15
./1.html:1:abcd
./1.html:2:1234

./2.html:1:abcd
./2.html:2:1234

Upvotes: 0

Views: 2552

Answers (3)

Tom Fenech
Tom Fenech

Reputation: 74685

First of all it's worth mentioning that unless you're very confident that you can treat your "HTML" files as a series of line-separated records, then you should probably be using an HTML-aware tool (either standalone, or included in a scripting language).

Since you mentioned Awk in an earlier comment:

find . -name '*.html' -exec awk '
    length($0) > 20 { print FILENAME, substr($0, 1, 15) }' {} +

This matches lines with length greater than 20 and prints the first 15 characters. I put the file name at the start, you can remove that if you like.

It's not clear whether you need find for a recursive search or not - if not, then you might be fine with letting the shell generate the list of files:

awk 'length($0) > 20 { print FILENAME, substr($0, 1, 15) }' *.html

And with globstar enabled (shopt -s globstar), you can use **/*.html for recursive matching in Bash.

Upvotes: 4

123
123

Reputation: 11226

If for some reason you want to just use grep

find . -name '*.html' -exec grep -oP '.{40}(?=.{60})' {} /dev/null \;

Upvotes: 2

favoretti
favoretti

Reputation: 30197

The first grep works ok I suppose, so if you want to print out just 40 chars, pipe it through cut?

find . -name '*.html' -print | grep -on '.\{100\}' | cut -c 1-40

Upvotes: 0

Related Questions