VileTouch
VileTouch

Reputation: 15

Use of grep + sed based on a pattern file?

Here's the problem: i have ~35k files that might or might not contain one or more of the strings in a list of 300 lines containing a regex each

if I grep -rnwl 'C:\out\' --include=*.txt -E --file='comp.log' i see there are a few thousands of files that contain a match.

now how do i get sed to delete each line in these files containing the strings in comp.log used before?

edit: comp.log contains a simple regex in each line, but for the most part each string to be matched is unique

this is is an example of how it is structured:

server[0-9]\/files\/bobba fett.stw
[a-z]+ mochaccino
[2-9] CheeseCakes
...

etc. silly examples aside, it goes to show each line is unique save for a few variations so it shouldn't affect what i really want: see if any of these lines match the lines in the file being worked on. it's no different than 's/pattern/replacement/' except that i want to use the patterns in the file instead of inline.


Ok here's an update (S.O. gets inpatient if i don't declare the question answered after a few days) after MUCH fiddling with the @Kenavoz/@Fischer approach, i found a totally different solution, but first things first. creating a modified pattern list for sed to work with does work.

as well as @werkritter's approach of dropping sed altogether. (this one i find the most... err... "least convoluted" way around the problem).

I couldn't make @Mklement's answer work under windows/cygwin (it did work on under ubuntu, so...not sure what that means. figures.)

What ended up solving the problem in a more... long term, reusable form was a wonderful program pointed out by a colleage called PowerGrep. it really blows every other option out of the water. unfortunately it's windows only AND it's not free. (not even advertising here, the thing is not cheap, but it does solve the problem).

so considering @werkiter's reply was not a "proper" answer and i can't just choose both @Lars Fischer and @Kenavoz's answer as a solution (they complement each other), i am awarding @Kenavoz the tickmark for being first.

final thoughts: i was hoping for a simpler, universal and free solution but apparently there is not.

Upvotes: 1

Views: 746

Answers (3)

mklement0
mklement0

Reputation: 440677

Both Kenavoz's answer and Lars Fischer's answer use the same ingenious approach:
transform the list of input regexes into a list of sed match-and-delete commands, passed as a file acting as the script to sed via -f.

To complement these answers with a single command that puts it all together, assuming you have GNU sed and your shell is bash, ksh, or zsh (to support <(...)):

find 'c:/out' -name '*.txt' -exec sed -i -r -f <(sed 's#.*#/\\<&\\>/d#' comp.log) {} +
  • find 'c:/out' -name '*.txt' matches all *.txt files in the subtree of dir. c:/out

    • -exec ... + passes as many matching files as will fit on a single command line to the specified command, typically resulting only in a single invocation.
  • sed -i updates the input files in-place (conceptually speaking - there are caveats); append a suffix (e.g., -i.bak) to save backups of the original files with that suffix.

  • sed -r activates support for extended regular expressions, which is what the input regexes are.

  • sed -f reads the script to execute from the specified filename, which in this case, as explained in Kenavoz's answer, uses a process substitution (<(...)) to make the enclosed sed command's output act like a [transient] file.

    • The s/// sed command - which uses alternative delimiter # to facilitate use of literal / - encloses each line from comp.log in /\<...\>/d to yield the desired deletion command; the enclosing of the input regex in \<...\>ensures matching as a word, as grep -w does.
      This is the primary reason why GNU sed is required, because neither POSIX EREs (extended regular expressions) nor BSD/OSX sed support \< and \>.
      • However, you could make it work with BSD/OSX sed by replacing -r with -E, and \< / \> with [[:<:]] / [[:>:]]

Upvotes: 0

SLePort
SLePort

Reputation: 15481

You can try this :

sed -f <(sed 's/^/\//g;s/$/\/d/g' comp.log) file > outputfile

All regex in comp.log are formatted to a sed address with a d command : /regex/d. This command deletes lines matching the patterns.

This internal sed is sent as a file (with process substitition) to the -f option of the external sed applied to file.

To delete just string matching the patterns (not all line) :

sed -f <(sed 's/^/s\//g;s/$/\/\/g/g' comp.log) file > outputfile

Update :

The command output is redirected to outputfile.

Upvotes: 2

Lars Fischer
Lars Fischer

Reputation: 10229

Some ideas but not a complete solution, as it requires some adopting to your script (not shown in the question).

  1. I would convert comp.log into a sed script containing the necessary deletes:

    cat comp.log | sed -r "s+(.*)+/\1/ d;+" > comp.sed`
    

    That would make your example comp.sed look like:

    /server[0-9]\/files\/bobba fett.stw/ d;
    /[a-z]+ mochaccino/ d;
    /[2-9] CheeseCakes/ d;
    
  2. then I would apply the comp.sed script to each file reported by grep (With your -rnwl that would require some filtering to get the filename.):

    sed -i.bak -f comp.sed $AFileReportedByGrep
    

    If you have gnu sed, you can use -i inplace replacement creating a .bak backup, otherwise use piping to a temporary file

Upvotes: 2

Related Questions