vishal_P
vishal_P

Reputation: 81

Bash sed command issue

I'm trying to further parse an output file I generated using an additional grep command. The code that I'm currently using is:

##!/bin/bash

# fetches the links of the movie's imdb pages for a given actor

# fullname="USER INPUT"
read -p "Enter fullname: " fullname

if [ "$fullname" = "Charlie Chaplin" ];
code="nm0000122"
then
code="nm0000050"
fi


curl "https://www.imdb.com/name/$code/#actor" | grep -Eo 
'href="/title/[^"]*' | sed 's#^.*href=\"/#https://www.imdb.com/#g' | 
sort -u > imdb_links.txt

#parses each of the link in the link text file and gets the details for 
each of the movie. THis is followed by the cleaning process
for i in $(cat imdb_links.txt) 
do 
   curl $i | 
   html2text | 
   sed -n '/Sign_In/,$p'|  
   sed -n '/YOUR RATING/q;p' | 
   head -n-1 | 
   tail -n+2 
done > imdb_all.txt

The sample generated output is:

EN
⁰
    * Fully supported
    * English (United States)
    * Partially_supported
    * Français (Canada)
    * Français (France)
    * Deutsch (Deutschland)
    * हिंदी (भारत)
    * Italiano (Italia)
    * Português (Brasil)
    * Español (España)
    * Español (México)
****** Duck Soup ******
    * 19331933
    * Not_RatedNot Rated
    * 1h 9m
IMDb RATING
7.8/10

How do I change the code to further parse the output to get only the data from the title of the movie up until the imdb rating ( in this case, the line that contains the title 'Duck Soup' up until the end.

Upvotes: 0

Views: 98

Answers (2)

Reino
Reino

Reputation: 3423

Please(!) have a look at the following urls on why it's a really bad idea to parse HTML with sed:

The thing you're trying to do can be done with the HTML/XML/JSON parser and with just 1 call!
In this example I'll use the IMDB of Charlie Chaplin as source.

Extract all 94 "Actor" IMDB movie urls:

$ xidel -s "https://www.imdb.com/name/nm0000122" -e '
  //div[@id="filmo-head-actor"]/following-sibling::div[1]//a/@href
'
/title/tt0061523/?ref_=nm_flmg_act_1
/title/tt0050598/?ref_=nm_flmg_act_2
/title/tt0044837/?ref_=nm_flmg_act_3
[...]
/title/tt0004288/?ref_=nm_flmg_act_94

There's no need to save these to a text-file. Just use -f (--follow) instead of -e and xidel will open all of them.


For the individual movie urls you could parse the HTML to get the text-nodes you want...

$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
  //h1,
  //div[@class="sc-94726ce4-3 eSKKHi"]/ul/li[1]/span,
  //div[@class="sc-94726ce4-3 eSKKHi"]/ul/li[3],
  (//div[@class="sc-7ab21ed2-2 kYEdvH"])[1]
'
A Countess from Hong Kong
1967
2h
6.0/10

...but with those class-names I'd say that's a rather fragile endeavor. Instead I'd recommend to parse the JSON at the top of the HTML-source within the <script>-node:

$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
  parse-json(//script[@type="application/ld+json"])/(
    name,
    datePublished,
    duration,
    aggregateRating/ratingValue
  )
'
A Countess from Hong Kong
1967-03-15
PT2H
6

...or to get a similar output as above:

$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
  parse-json(//script[@type="application/ld+json"])/(
    name,
    year-from-date(date(datePublished)),
    substring(lower-case(duration),3),
    format-number(aggregateRating/ratingValue,"#.0")||"/10"
  )
'
A Countess from Hong Kong
1967
2h
6.0/10

All combined:

$ xidel -s "https://www.imdb.com/name/nm0000122" \
  -f '//div[@id="filmo-head-actor"]/following-sibling::div[1]//a/@href' \
  -e '
    parse-json(//script[@type="application/ld+json"])/(
      name,
      year-from-date(date(datePublished)),
      substring(lower-case(duration),3),
      format-number(aggregateRating/ratingValue,"#.0")||"/10"
    )
  '
A Countess from Hong Kong
1967
2h
6.0/10
A King in New York
1957
1h50m
7.0/10
Limelight
1952
2h17m
8.0/10
[...]
Making a Living
1914
11m
5.5/10

Upvotes: 1

SergA
SergA

Reputation: 1174

Here is the code:

#!/bin/bash

# fullname="USER INPUT"
read -p "Enter fullname: " fullname

if [ "$fullname" = "Charlie Chaplin" ]; then
  code="nm0000122"
else
  code="nm0000050"
fi

rm -f imdb_links.txt

curl "https://www.imdb.com/name/$code/#actor" |
  grep -Eo 'href="/title/[^"]*' |
  sed 's#^href="#https://www.imdb.com#g' |
  sort -u |
while read link; do
   # uncomment the next line to save links into file:
   #echo "$link" >>imdb_links.txt

   curl "$link" |
     html2text -utf8 |
     sed -n '/Sign_In/,/YOUR RATING/ p' |
     sed -n '$d; /^\*\{6\}.*\*\{6\}$/,$ p'
done >imdb_all.txt

Upvotes: 1

Related Questions