Reputation: 3781

Awk pattern matching

I want to print

userId = 1234
userid = 12345
timestamp = 88888888
js = abc

from my data

messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
<input name="userId" value="1234" type="hidden"> messsssssssssssssssssss
<input name="userid" value="12345" type="hidden"> messssssssssssssssssss
<input name="timestamp" value="88888888" type="hidden"> messssssssssssss
<input name="js" value="abc" type="hidden"> messssssssssssssssssssssssss
messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss

How can I do this with AWK(or whatever)? Assume that my data is stored in the "$info" variable (single line data).

Edit : single line data i mean all data represent like this

messss...<input name="userId" value="1234" type="hidden">messsss...<input ....>messssssss

So i can't use grep to extract interest section.

Upvotes: 1

Answers (8)

digerydoo

Reputation: 1

Here is a short awk oneliner using bash :

awk 'BEGIN{ FS="\""; RS="<";}/\=/{print $2," = ", $4;}' <(printf "%s" ${info})

Explanation :

RS="<" -- break the text into records (-lines)

FS="\"" -- break records into fields by "

/\=/ -- choose lines containing =

{print $2," = ", $4;} -- print 2nd and 4th field separated with spaces and =

Upvotes: 0

Mark Edgar

Reputation: 4827

Tools like awk and sed can be used together with XMLStarlet and HTML Tidy to parse HTML.

Upvotes: 0

Markus Jarderot

Reputation: 89241

AWK:

BEGIN {
  # Use record separator "<", instead of "\n".
  RS = "<"
  first = 1
}

# Skip the first record, as that begins before the first tag
first {
  first = 0
  next
}

/^input[^>]*>/ { #/
  # make sure we don't match outside of the tag
  end = match($0,/>/)

  # locate the name attribute
  pos = match($0,/name="[^"]*"/)
  if (pos == 0 || pos > end) { next }
  name = substr($0,RSTART+6,RLENGTH-7)

  # locate the value attribute
  pos = match($0,/value="[^"]*"/)
  if (pos == 0 || pos > end) { next }
  value = substr($0,RSTART+7,RLENGTH-8)

  # print out the result
  print name " = " value
}

Upvotes: 0

paxdiablo

Reputation: 882596

I'm not sure I understand your "single line data" comment but if this is in a file, you can just do something like:

cat file
    | grep '^<input '
    | sed 's/^<input name="//'
    | sed 's/" value="/ = /'
    | sed 's/".*$//'

Here's the cut'n'paste version:

cat file | grep '^<input ' | sed 's/^<input name="//' | sed 's/" value="/ = /' | sed 's/".*$//'

This turns:

messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
<input name="userId" value="1234" type="hidden"> messsssssssssssssssssss
<input name="userid" value="12345" type="hidden"> messssssssssssssssssss
<input name="timestamp" value="88888888" type="hidden"> messssssssssssss
<input name="js" value="abc" type="hidden"> messssssssssssssssssssssssss
messssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss

quite happily into:

userId = 1234
userid = 12345
timestamp = 88888888
js = abc

The grep simply extracts the lines you want while the sed commandsrespectively:

strip off up to the first quote.
replace the section between the name and value with an "=".
remove everything following the value closing quote (including that quote).

Upvotes: 4

glenn jackman

Reputation: 247200

IMO, parsing HTML should be done with a proper HTML/XML parser. For example, Ruby has an excellent package, Nokogiri, for parsing HTML/XML:

ruby -e '
    require "rubygems"
    require "nokogiri"
    doc = Nokogiri::HTML.parse(ARGF.read)
    doc.search("//input").each do |node|
        atts = node.attributes
        puts "%s = %s" % [atts["name"], atts["value"]]
    end
' mess.html

produces the output you're after

Upvotes: 1

johnB

Reputation: 21

using perl

cat file | perl -ne 'print($1 . "=" . $2 . "\n") if(/name="(.*?)".*value="(.*?)"/);'

Upvotes: 2

Cascabel

Reputation: 497602

This part should probably be a comment on Pax's answer, but it got a bit long for that little box. I'm thinking 'single line data' means you don't have any newlines in your variable at all? Then this will work:

echo "$info" | sed -n -r '/<input/s/<input +name="([^"]+)" +value="([^"]+)"[^>]*>[^<]*/\1 = \2\n/gp'

Notes on interesting bits: - -n means don't print by default - we'll say when to print with that p at the end.

-r means extended regex
/<input/ at the beginning makes sure we don't even bother to work on lines that don't contain the desired pattern
That \n at the end is there to ensure all records end up on separate lines - any original newlines will still be there, and the fastest way to get rid of them is to tack on a '| grep .' on the end - you could use some sed magic but you wouldn't be able to understand it thirty seconds after you typed it in.

I can think of ways to do this in awk, but this is really a job for sed (or perl!).

Upvotes: 3

soulmerge

Reputation: 75774

To process variables that contain more than one line, you need to put the variable name in double quotes:

echo "$info"|sed 's/^\(<input\( \)name\(=\)"\([^"]*\)" value="\([^"]*\)"\)\?.*/\4\2\3\2\5/'

Upvotes: 2

Awk pattern matching

Answers (8)

Related Questions