Reputation: 95
I'm trying to get the plain text from this webpage: https://html2-f.scribdassets.com/55ssxtbbb45pk2eg/pages/319-42c28ee981.jsonp which upon inspection is a callback function that inserts HTML. I'm trying to scrape the page and reformat the text to be comprehensive and actually display the HTML instead of it being plain text.
PHP:
echo file_get_contents("https://html2-f.scribdassets.com/55ssxtbbb45pk2eg/pages/319-42c28ee981.jsonp");
The returning text is a complete mess
����X321-5db7e88872.jsonp�Y]n�6���E�ıH�;��E�@���b�PM��%�f#K�H��}�;�z���:�eG"e��:@�E����j��XޖdJ���$�&$~����>a�8#��p�ӥy��X��8�r��(#kZ���85�j�A�%��������Ȇ�...
Whereas it should look like this:
"<div class=\"newpage\" id=\"page319\" style=\"width: 902px; height:1167px\">\n<div class=text_layer style=\"z-index:2\"><div class=ie_fix>\n \n<div class=\"ff81\" style=\"font-size:114px\">\n<span class=a style=\"left:331px;top:75px;color:#ffffff\">1<span class=w9></span>3</span></div>...
Although I could manually copy/paste the text from the webpage into a text editor for future usage, I would like to eliminate this step as I'll need to do this for 320 pages.
Is there some work around for .jsonp urls? Or is the data encrypted by the server? (I just don't know)
Upvotes: 2
Views: 52
Reputation: 9957
The response is gzip'd. You can see it in the response headers:
Content-Encoding: gzip
So, you need to unzip it. You can do this either changing your whole approach and using cURL, or using the stream wrapper compress.zlib://
. Just prepend that to the URL:
echo file_get_contents("compress.zlib://https://html2-f.scribdassets.com/55ssxtbbb45pk2eg/pages/319-42c28ee981.jsonp");
That will get you the correct response. Notice that this is still a JSONP response, so it's in form of a callback. You need to decide what to do with it.
Upvotes: 2