Reputation: 1
I’m currently doing a Data & AI internship. My job is to build a product database by retrieving information (product name, image, description, part number/SKU, technical specifications, datasheet, etc.) from the manufacturers' websites.
The challenge is that there are over 300 different manufacturers, each with its own website and structure, making traditional web scraping impractical and hard to maintain. To overcome this, I’m considering using AI and machine learning to make my scraping agent adaptable to changes in the HTML structure of each page.
I have downloaded and manually labeled 50 product pages. Here’s what my dataset looks like:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 52 non-null object
1 product_name 49 non-null object
2 html_product_name 51 non-null object
3 image_url 50 non-null object
4 html_image_url 50 non-null object
5 description 32 non-null object
6 html_description 51 non-null object
7 part_number 35 non-null object
8 html_part_number 36 non-null object
9 html_specification 44 non-null object
10 datasheet_url 40 non-null object
11 html_datasheet_url 41 non-null object
12 specification 2 non-null object
The text column contains the cleaned HTML of the product pages, while the other columns represent the target fields—the specific sections of the HTML that need to be identified and extracted.
This problem seems very similar to Named Entity Recognition (NER). How can I train a machine learning model to successfully extract these fields from raw HTML? What would be the best approach (e.g., fine-tuning a transformer model, sequence labeling, or another method)?
Thanks in advance!
Upvotes: 0
Views: 98
Reputation: 1
So just scrap their website and by LLM prompt grab this information this is only easy and best solution I can say about this problem.
Upvotes: 0
Reputation: 1574
you can do it in several stages:
You may also use LLM (ChatGPT) in order to get what you want, as mentioned in the previous answer. But be careful to write your prompt correctly and add in it exactly the output format you want (json for example) and the json keys you need (for example "product_name", etc.).
As I do not have all the details, my answer might be incomplete.
Upvotes: 0
Reputation: 3
Training your own model on this will be a very hard task and propably goes outside of the scope of your internship. You would need large amounts of training data because the results are dependant on a lot of factors. One idea that I got reading your question though is to use a big LLM, such as ChatGPT or Google Gemini and provide it the full webpage you want to scrape in the prompt. These LLMs are capable of returning structured output (for example json), so you can basically describe them the structure of the desired output and what it should fill into each field using the information from the webpage.
Here is a link from google showing how to make Gemini return a json: https://ai.google.dev/gemini-api/docs/structured-output?hl=de&lang=python
Edit (for clarification): LLM Input/Prompt = Task + Webpage HTML => Model Output = Structured Json
Upvotes: 0