Michael d
Michael d

Reputation: 301

Get url value in querystring of each entry link in Google News RSS XML for Facebook Sharer

Hi I'm using simpleXML to display a news.google.com feed.

The displayed entries link to the original article in this way:

http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNEcqhcp4AfUzgxc2l1gumydaxQ-KQ&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778832126843&ei=keFLVfiHGvDVmQL5_4GgBg&url=http://WEBSITEWITHNEWS.COM/ARTICLEURLHERE

I need the entries to link to this instead: http://WEBSITEWITHNEWS.COM/ARTICLEURLHERE

The reason is that Facebook Sharer cannot interpret the following link:

https://www.facebook.com/sharer/sharer.php?u=http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNEcqhcp4AfUzgxc2l1gumydaxQ-KQ&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778832126843&ei=keFLVfiHGvDVmQL5_4GgBg&url=http://WEBSITEWITHNEWS.COM/ARTICLEURLHERE

Facebook Sharer needs it to look like this:

https://www.facebook.com/sharer/sharer.php?u=http://WEBSITEWITHNEWS.COM/ARTICLEURLHERE

Is there a way that I can use regex (str_replace or preg_match) to remove the Google redirect URL so that social sharing sites can recognize the link?

The Google redirect URL is dynamic and so it will be slightly different each time and so I will need something that can replace each variant.

My working, functional code:

    $feed = file_get_contents("https://news.google.com/news/feeds?q=KEYWORD&output=rss");
$xml = new SimpleXmlElement($feed);
foreach ($xml->channel->item as $entry){
  $date = $entry->pubDate; 
  $date = strftime("%m/%d/%y %I:%M:%S%P", strtotime($date));
  $desc = $entry->description;
  $desc = str_replace("and more »", "","$desc");
  $desc = str_replace("font-size:85%", "font-size:100%","$desc");
  ?>
  <div class="item"></div>
  <?php echo $desc; ?>
  <div class="date">
  <?php echo $date; ?></div>
  <?php } ?>
 $desc = $entry->description;
 $date = $entry->pubDate; 
 $date = strftime("%A, %m/%d/%Y, %H:%M:%S", strtotime($date));
 $desc = str_replace("and more »","x","and more »");
  echo $date; 
  echo $desc;
  }

I'm using $desc to display the link instead of $link, but URL to the article with the Google redirectURL is still in $link if you would like to str_replace or preg_match $link instead of $desc

Link to working Google News feed below: https://news.google.com/news/feeds?q=KEYWORD&output=rss

Upvotes: 0

Views: 276

Answers (2)

mhall
mhall

Reputation: 3701

You could use the built-in PHP functions parse_url (split URL into components) and parse_str (get parameter values from query string) for this:

$feed = file_get_contents(
    "https://news.google.com/news/feeds?q=KEYWORD&output=rss"
);
$xml = new SimpleXmlElement($feed);

foreach ($xml->channel->item as $entry){
    // Get query part of link
    $query = parse_url($entry->link, PHP_URL_QUERY);

    // Parse query parameters into $params array
    parse_str($query, $params);

    // Get URL from parameters
    $url = $params['url'];

    // Just output in this example
    echo "URL: $url", PHP_EOL;

    // ... Do some more stuff
}

Output:

URL: http://www.gamasutra.com/blogs/JonathanRaveh/20150506/242840/Death_of_the_app_keyword__whats_next.php
URL: http://www.business2community.com/online-marketing/8-keyword-optimization-tips-perfect-ppc-campaigns-01222200
URL: http://searchengineland.com/marry-keywords-compelling-content-218174
...

Upvotes: 1

chris85
chris85

Reputation: 23880

The answer from my first comment is using this regex.

<?php
date_default_timezone_set('America/New_York');
$feed = file_get_contents("https://news.google.com/news/feeds?q=KEYWORD&output=rss");
$xml = new SimpleXmlElement($feed);
foreach ($xml->channel->item as $entry) {
    $date = $entry->pubDate;
    $date = strftime("%m/%d/%y %I:%M:%S%P", strtotime($date));
    $desc = $entry->description;
    $desc = str_replace("and more&nbsp;&raquo;", "","$desc");
    $desc = str_replace("font-size:85%", "font-size:100%","$desc"); /*
    ?>
    <div class="item"></div>
    <?php // echo $desc; ?>
    <div class="date"><?php echo $date; ?></div>
    <?php
    */
    $desc = $entry->description;
    $desc = preg_replace('~href=".*?&amp;url=(.*?)"~', 'href="https://www.facebook.com/sharer/sharer.php?u=$1"', $desc);
    $date = $entry->pubDate; 
    $date = strftime("%A, %m/%d/%Y, %H:%M:%S", strtotime($date));
    //$desc = str_replace("and more »","x","and more »");
    echo $date . "\n" . $desc;
    die('1 pass');
}
?>

Output (format altered for displaying):

<table border="0" cellpadding="2" cellspacing="7" style="vertical-align:top;">
    <tr>
        <td width="80" align="center" valign="top"><font style="font-size:85%;font-family:arial,sans-serif"></font></td>
        <td valign="top" class="j"><font style="font-size:85%;font-family:arial,sans-serif"><br>
            <div style="padding-top:0.8em;"><img alt="" height="1" width="1"></div>
            <div class="lh"><a href="https://www.facebook.com/sharer/sharer.php?u=http://www.gamasutra.com/blogs/JonathanRaveh/20150506/242840/Death_of_the_app_keyword__whats_next.php"><b>Death of the app <b>keyword</b> – what&#39;s next?</b></a><br>
                <font size="-1"><b><font color="#6f6f6f">Gamasutra (blog)</font></b></font><br>
                <font size="-1">Yes, app <b>keywords</b> are dying. If you search the web you may find insightful stories about apps that gained massive recognition due to the clever use of <b>keywords</b>. Many companies and services (such as Sensor Tower) offer developers tools to help them&nbsp;...</font><br>
                <font size="-1" class="p"></font><br>
                <font class="p" size="-1"><a class="p" href="http://news.google.com/news/more?ncl=d4b6j-gMxFN1VKM&amp;authuser=0&amp;ned=us"><nobr><b>and more&nbsp;&raquo;</b></nobr></a></font></div>
            </font></td>
    </tr>
</table>
1 pass

This regex, ".*?&amp;url=(.*?)", is looking between the first double quote and last of an href and capturing everything after &amp;url=. In the examples I've seen every instance has the URL as the last parameter. This regex will NOT work if the URL is the not the last parameter because it using a check that looks for the last double quote or an entitied ampersand; that'd be ("|&amp;). I could see that cutting off parameters from URLs though; if they had additional GET parameters. Another thing I never saw in these URLs was them using the GET parameters. Take out the die('1 pass'); and give it a try or keep the die in if you want a small sample at first.

Upvotes: 1

Related Questions