How to extract JSON serialization from HTML? C#

Question

First. My apology for being a JSON newbie. I suppose my ignorance makes it difficult for me to ask an accurate question. No worries, I will edit and clean up this post after clarity is reached.

I have some html from a 3rd party website that has JSON data that I would like to extract. I've written unit tests that serialize/deserialize the JSON data to C# class. However, my offline test input file is generated by a manual copy paste operation: I opened the *.html source, found the JSON serialization data string, and copy pasted to an offline file. I then used it as input to my unit test. It works great.

This manual copy/paste operation I'd like to make automatic.

Currently the URL that I am using returns HTML - and the JSON data is buried in the HtmlDocument somewhere - and I haven't the foggiest how to determine what the direct JSON query url might be, or how to discover it. It would be ideal to know how to obtain this.

With this background information explained I'll now ask my question(s).

Conceptually I think there might be two questions to ask. There should be only one, but there in lies my ignorance: I'm not certain which question is the better to ask, or if the two I post below are even in the ballpark. My hope is that you understand what I'm asking from a conceptual view point and after I reach some understanding I can modify it to a more technical/accurate viewpoint. Please bear with me.

Q1: When working with a 3rd party website, how does one determine what the GET string should be to directly request JSON object?

This seems like the ideal solution but I do not understand the process of determining how the GET request should be constructed. I have barely scratched the surface of using a Inspector tool in Firefox to investigate html. Using this tool to find the JSON request URL string (for a GET) is a mystery to me.

Q2: When working with a 3rd party website, how does one navigate the Html to find the node where the JSON string can be extracted?

And this is a backup question. If the answer is that "no, you cannot directly determine the JSON URL GET string", the backup is to traverse the Html and locate the element that contains the JSON data string.

Example of the html: (heavily truncated to fit here in this post)

...lots of html, followed by:

    
    

...lots more html, followed by EOF.

And the JSON data is encapsulated in the var result string

Wanton · Accepted Answer

Get HTML as text then use HtmlAgilityPack to parse HTML and find script tags. Then you need create your own code that will find correct script tag out of many. Maybe by matching if it's content starts with var result =. Then you need parse that JSON as text with your own code. Maybe getting everything after var result = and trimming out last ; is enought here. Then you can use JSON.NET to deserialize that JSON if needed.

How to extract JSON serialization from HTML? C#

Answers (2)

Related Questions