Reputation: 307
I am trying to write a program using the lynx command on this page "http://www.rottentomatoes.com/movie/box_office.php" and I can't seem to wrap my head around a certain problem.... getting the title by itself. My problem is a title can contain special characters, numbers, and all titles are variable in length. I want to write a regex that could parse the entire page and find lines like this.... (I added spaces between the title and the next number, which is how many weeks it has been out, to distinguish between title and weeks released)
1 -- 30% The Vow 1 $41.2M $41.2M $13.9k 2958
2 -- 53% Safe House 1 $40.2M $40.2M $12.9k 3119
3 -- 42% Journey 2: The Mysterious Island 1 $27.3M $27.3M $7.9k 3470
4 -- 57% Star Wars: Episode I - The Phantom Menace (in 3D) 1 $22.5M $22.5M $8.5k 2655
5 1 86% Chronicle 2 $12.1M $40.0M $4.2k 2908
the regex I have started out with is:
/(\d+)\s(\d+|\-\-)\s(\d+\%)\s
If someone can help me figure out how to grab the title successfully that would be much appreciated! Thanks in advanced.
Upvotes: 1
Views: 114
Reputation: 93157
Capture all the things!!
^(\d+)\s+(\d+|\-\-)\s+(\d+\%)\s+(.*)\s+(\d+)\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\d+)$
Explained:
^ <- Start of the line
(\d+)\s+ <- Numbers (captured) followed by as many spaces as you want
(\d+|\-\-)\s+ <- Numbers [or "--"] (captured) followed by as many spaces as you want
(\d+\%)\s+ <- Numbers [with '%'] (captured) followed by as many spaces as you want
(.*)\s+ <- Anything you can match [don't be greedy] (captured) followed by as many spaces as you want
(\d+)\s+ <- Numbers (captured) followed by as many spaces as you want
(\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want
(\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want
(\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want
(\d+) <- Numbers (captured)
$ <- End of the line
So to be serious this is what I've done, I cheated a bit and captured everything (as I think you'll do in the end) to get a lookahead for the title capture.
In a non-greedy regex (.*)
[or (.*?)
if you want to force the "ungreedyness"] will capture the least possible characters, and the end of the regex tries to capture everything else.
Your regex ends up capturing only the title (the only thing left).
What you can do is using an actual lookahead and make assertions.
Resources:
Upvotes: 2