David Parks
David Parks

Reputation: 32081

Piping unfiltered text to an awk system command's stdin

I have an gawk script that has accumulated a bunch of HTML in a variable, and now should pipe it to lynx via a system command.

(feel free to tell me AWK is a bad solution... while read LINE; was wildly bad (slow), so this is take 2)

I tried this in awk:

    cmd = sprintf( "bash -c \'lynx -dump -force_html -stdin <<< \"%s\"\'", html )
    system ( cmd )

Bad idea, although simple test cases work, with raw HTML, special character issues and string termination issues abound, and escapes-within-escapes-within-escapes is just getting mindbogglingly complex.

lynx handles well whatever I throw at it on stdin, I just can't get it to stdin from awk without piping it through the the command line, which seems like an unwieldy solution.


Edit (adding detail about my end goal) in case awk isn't a good approach:

What I want is to parse HTML out of a large text file with delimiters between blocks of html. I need to pass each block of HTML to lynx to be formatted and dump that into a new, big text file.


Example input (a dump from another system):

**********URL: http://some/url
<html>
<head><title>Any 'ol HTML document</title</head>
<body>
<p>With pretty much any character you can imagine at some point</p>
<p>I'm using lynx to strip off the HTML and give me a nice format</p>
</body>
</html>
**********URL: http://another/url
<html><head><title>My input file provides a few 100,000 such html documents</title></head>
<body/></html>

Each HTML document should be feed through lynx -dump. Lynx can read in the HTML from file (e.g. named pipe, or file is an option), or stdin (with the -stdin option).

My output is then:

**********URL: http://some/url
  Any 'ol HTML document

  With pretty much any character you can imagine at some point
  I'm using lynx to strip off the HTML and give me a nice format
**********URL: http://another/url
  My input file provides a few 100,000 such html documents

Upvotes: 0

Views: 1306

Answers (2)

David Parks
David Parks

Reputation: 32081

To add to n0741337's answer, here's an example using gawk coprocesses I did after reading his answer, it takes "aline" from stdin, and pipes it to a cat coprocess, and captures the output from the cat coprocess and prints it:

printf "aline" | awk '
  BEGIN{cmd="cat"} 
  {
    print $0 |& cmd; 
    close(cmd, "to"); 
    while ((cmd |& getline line) > 0) { 
      print "got", line 
    }; 
    close (cmd);
  }'

result: got aline

The gawk manual has a more extensive discussion of this feature: http://www.gnu.org/software/gawk/manual/html_node/Two_002dway-I_002fO.html#Two_002dway-I_002fO

Upvotes: 0

n0741337
n0741337

Reputation: 2514

Try |& in gawk., which I found out about from here. That would let you send the output from gawk to the stdin of another command as a coprocess.

Upvotes: 1

Related Questions