sapbucket
sapbucket

Reputation: 7195

How to extract JSON serialization from HTML? C#

First. My apology for being a JSON newbie. I suppose my ignorance makes it difficult for me to ask an accurate question. No worries, I will edit and clean up this post after clarity is reached.

I have some html from a 3rd party website that has JSON data that I would like to extract. I've written unit tests that serialize/deserialize the JSON data to C# class. However, my offline test input file is generated by a manual copy paste operation: I opened the *.html source, found the JSON serialization data string, and copy pasted to an offline file. I then used it as input to my unit test. It works great.

This manual copy/paste operation I'd like to make automatic.

Currently the URL that I am using returns HTML - and the JSON data is buried in the HtmlDocument somewhere - and I haven't the foggiest how to determine what the direct JSON query url might be, or how to discover it. It would be ideal to know how to obtain this.

With this background information explained I'll now ask my question(s).

Conceptually I think there might be two questions to ask. There should be only one, but there in lies my ignorance: I'm not certain which question is the better to ask, or if the two I post below are even in the ballpark. My hope is that you understand what I'm asking from a conceptual view point and after I reach some understanding I can modify it to a more technical/accurate viewpoint. Please bear with me.

Q1: When working with a 3rd party website, how does one determine what the GET string should be to directly request JSON object?

This seems like the ideal solution but I do not understand the process of determining how the GET request should be constructed. I have barely scratched the surface of using a Inspector tool in Firefox to investigate html. Using this tool to find the JSON request URL string (for a GET) is a mystery to me.

Q2: When working with a 3rd party website, how does one navigate the Html to find the node where the JSON string can be extracted?

And this is a backup question. If the answer is that "no, you cannot directly determine the JSON URL GET string", the backup is to traverse the Html and locate the element that contains the JSON data string.

Example of the html: (heavily truncated to fit here in this post)

...lots of html, followed by:

    <script>
      window.dataLayer = window.dataLayer || [];
         function gtag(){dataLayer.push(arguments);}
         gtag('js', new Date());
         gtag('config', 'UA-6441790-1');
    </script>
    <script>
      var result = {"teams":["tigers","sharks","destroyers","nerfs"]};
    </script>

...lots more html, followed by EOF.

And the JSON data is encapsulated in the var result string

Upvotes: 1

Views: 1472

Answers (2)

Wanton
Wanton

Reputation: 840

Get HTML as text then use HtmlAgilityPack to parse HTML and find script tags. Then you need create your own code that will find correct script tag out of many. Maybe by matching if it's content starts with var result =. Then you need parse that JSON as text with your own code. Maybe getting everything after var result = and trimming out last ; is enought here. Then you can use JSON.NET to deserialize that JSON if needed.

Upvotes: 1

M G
M G

Reputation: 154

If I haved understood you correctly:

  1. If you want to obtain the HTML dynamically(in code) you can use HttpClient https://learn.microsoft.com/pl-pl/dotnet/api/system.net.http.httpclient?view=netframework-4.7.2 and make just simple GET request. It will return result (which in this case will be HTML). On client side you can use jQuery's load http://api.jquery.com/load/
  2. Regarding Q1: if someone wants to provide some data for others they simply expose and API (ex. using REST). If it is that case they also provides some API references. ex. https://api.stackexchange.com/docs . API's provide data in comfy way, using some standarized format ex. JSON/XML. Common practice is to allow the requestor to specify the format he would like to recive (using Accept Mime Type).
  3. Regarding Q2: You can not simply navigate the html to obtain the data. API's documentation specifies urls/formats/expected output. On the other hand when you request for HTML or try to traverse through html to obtain data you have no certainty of the result. One day the html can be changes which will result in not finding the data you would like to extract. So extracting data from respone's html is not a good practice. If someone does not expose an API that may mean they do not want to? Of cours if you really badly want that data you make extractions like described but it is burdened with possible mentioned errors and inconveniences while extracting the data.

Upvotes: 0

Related Questions