Reputation: 43
I am trying to get language code from pages by curl
I wrote below and work...
curl -Ls yahoo.com | grep "lang=" | head -1 | cut -d ' ' -f 3 | cut -d"\"" -f 2
but sometimes code is different like
curl -Ls stick-it.app | grep "lang=" | head -1 | cut -d ' ' -f 3 | cut -d"\"" -f 2
they wrote like
<html dir="rtl" lang="he-IL">
I just need to get he-IL
If is there any other way, I would appreciate it...
Upvotes: 4
Views: 6362
Reputation: 163207
Another variation using gnu awk
and a pattern with a capture group using match:
match(string, regexp [, array])
curl -Ls yahoo.com | awk 'match($0, /<html [^<>]*lang="([^"]*)"/, a) {print a[1]}'
Output
en-US
The pattern matches
<html
Match literally[^<>]*
Match 0+ any char except <
or >
lang="
Match literally([^"]*)
Capture group 1 (denoted by a[1]
in the example code) matching 0+ times any char except "
"
Closing double quoteUpvotes: 0
Reputation: 133428
With awk
's match
function one could try following too.
your_curl_command | awk '
match($0,/^<html.*lang="[^"]*/){
val=substr($0,RSTART,RLENGTH)
sub(/.*lang="/,"",val)
print val
}
'
Explanation: Adding detailed explanation for above.
your_curl_command | awk ' ##Starting awk program from here.
match($0,/^<html.*lang="[^"]*/){ ##using match function to match regex starting from <html till lang=" till next 1st occurrence of "
val=substr($0,RSTART,RLENGTH) ##Creating val which has substring of matched values.
sub(/.*lang="/,"",val) ##Substituting everything till lang=" with NULL in val here.
print val ##printing val here.
}
'
Upvotes: 2
Reputation: 784948
If you have gnu-grep
then using -P
(perl regex):
curl -Ls yahoo.com | grep -oP '\slang="\K[^"]+'
he-IL
Upvotes: 2
Reputation: 203209
Using any sed in any shell on every Unix box:
$ curl -Ls yahoo.com | sed -n 's/^<html.* lang="\([^"]*\).*/\1/p'
en-US
Upvotes: 6