roman28
roman28

Reputation: 51

Regex to extract c/c++ from a webpage

I am new to web designing and I am currently designing a website for my college project. Right now, I am facing a problem which is as follows:

I am using perl to extract C/C++ codes from a link. I am using:

my $req = HTTP::Request->new( GET => $link );
my $response = $ua->request($req);
my $results = $response->content;

to get the whole webpage in "result" variable. Then, I am removing JavaScript using:

while($results=~s/<script.*?>.*?<\/script>//gsi){};

Then finally to print the output I am using:

pos($results)=0;
$delim='{}';
while($results=~s/.*?($regex\s*?\(.*?\)\s*?)\{/\{/s)
{
  $code=$1 . extract_codeblock($results,$delim);
  print Dumper( "$code" . "\n" . "\n");
}

where my regex is:

my $regex='(((int|long|double|float|void)\s*?\w{1,25})|if|while|for)';

But this code is not giving the output. My regex is not correct. Can somebody suggest me a good regex to extract the cpp codes. By extracting codes the idea is to extract anything and everything in between "{" and "}" on the webpage.

Upvotes: 1

Views: 90

Answers (1)

Miller
Miller

Reputation: 35208

For reading and parsing a webpage, I'd recommend that you use Mojo::UserAgent and Mojo::DOM. Both come installed with Mojolicious

For a tutorial on using both of them, I'd recommend watching the 8 minute video at mojocast episode 5.

Ideally, when working with the webpage, the type of content should be irrelevant. Instead where it's placed on the page should be the only information you need to extract your desired data.

Upvotes: 1

Related Questions