Amit
Amit

Reputation: 43

Need to grep a specific string using curl

I am trying to get language code from pages by curl

I wrote below and work...

curl -Ls yahoo.com | grep "lang=" | head -1 | cut -d ' ' -f 3 | cut -d"\"" -f 2

but sometimes code is different like

 curl -Ls stick-it.app | grep "lang=" | head -1 | cut -d ' ' -f 3 | cut -d"\"" -f 2

they wrote like

<html dir="rtl" lang="he-IL">

I just need to get he-IL

If is there any other way, I would appreciate it...

Upvotes: 4

Views: 6362

Answers (4)

The fourth bird
The fourth bird

Reputation: 163207

Another variation using gnu awk and a pattern with a capture group using match:

match(string, regexp [, array])

curl -Ls yahoo.com | awk 'match($0, /<html [^<>]*lang="([^"]*)"/, a) {print a[1]}'

Output

en-US

The pattern matches

  • <html Match literally
  • [^<>]* Match 0+ any char except < or >
  • lang=" Match literally
  • ([^"]*) Capture group 1 (denoted by a[1] in the example code) matching 0+ times any char except "
  • " Closing double quote

Upvotes: 0

RavinderSingh13
RavinderSingh13

Reputation: 133428

With awk's match function one could try following too.

your_curl_command | awk '
match($0,/^<html.*lang="[^"]*/){
  val=substr($0,RSTART,RLENGTH)
  sub(/.*lang="/,"",val)
  print val
}
'

Explanation: Adding detailed explanation for above.

your_curl_command | awk '          ##Starting awk program from here.
match($0,/^<html.*lang="[^"]*/){   ##using match function to match regex starting from <html till lang=" till next 1st occurrence of "
  val=substr($0,RSTART,RLENGTH)    ##Creating val which has substring of matched values.
  sub(/.*lang="/,"",val)           ##Substituting everything till lang=" with NULL in val here.
  print val                        ##printing val here.
}
'

Upvotes: 2

anubhava
anubhava

Reputation: 784948

If you have gnu-grep then using -P (perl regex):

curl -Ls yahoo.com | grep -oP '\slang="\K[^"]+'

he-IL

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 203209

Using any sed in any shell on every Unix box:

$ curl -Ls yahoo.com | sed -n 's/^<html.* lang="\([^"]*\).*/\1/p'
en-US

Upvotes: 6

Related Questions