Automating Web Scraping with Machine Learning and NER for Product Data Extraction

I’m currently doing a Data & AI internship. My job is to build a product database by retrieving information (product name, image, description, part number/SKU, technical specifications, datasheet, etc.) from the manufacturers' websites.

The challenge is that there are over 300 different manufacturers, each with its own website and structure, making traditional web scraping impractical and hard to maintain. To overcome this, I’m considering using AI and machine learning to make my scraping agent adaptable to changes in the HTML structure of each page.

I have downloaded and manually labeled 50 product pages. Here’s what my dataset looks like:

 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   text                52 non-null     object
 1   product_name        49 non-null     object
 2   html_product_name   51 non-null     object
 3   image_url           50 non-null     object
 4   html_image_url      50 non-null     object
 5   description         32 non-null     object
 6   html_description    51 non-null     object
 7   part_number         35 non-null     object
 8   html_part_number    36 non-null     object
 9   html_specification  44 non-null     object
 10  datasheet_url       40 non-null     object
 11  html_datasheet_url  41 non-null     object
 12  specification       2 non-null      object

The text column contains the cleaned HTML of the product pages, while the other columns represent the target fields—the specific sections of the HTML that need to be identified and extracted.

This problem seems very similar to Named Entity Recognition (NER). How can I train a machine learning model to successfully extract these fields from raw HTML? What would be the best approach (e.g., fine-tuning a transformer model, sequence labeling, or another method)?

Thanks in advance!

Upvotes: 0

Answers (3)

Gh4rudxD

Reputation: 1

So just scrap their website and by LLM prompt grab this information this is only easy and best solution I can say about this problem.

Upvotes: 0

Catalina Chircu

Reputation: 1574

you can do it in several stages:

Scrapping information from the Web (you may use a scrapping library).
When you have that, use a NER for example in order to extract the name of manufacturers. You may get other information with other NLP tools. For example, for "specification" you may use a question-answering tool.

You may also use LLM (ChatGPT) in order to get what you want, as mentioned in the previous answer. But be careful to write your prompt correctly and add in it exactly the output format you want (json for example) and the json keys you need (for example "product_name", etc.).

As I do not have all the details, my answer might be incomplete.

Upvotes: 0

Carl Philip

Reputation: 3

Training your own model on this will be a very hard task and propably goes outside of the scope of your internship. You would need large amounts of training data because the results are dependant on a lot of factors. One idea that I got reading your question though is to use a big LLM, such as ChatGPT or Google Gemini and provide it the full webpage you want to scrape in the prompt. These LLMs are capable of returning structured output (for example json), so you can basically describe them the structure of the desired output and what it should fill into each field using the information from the webpage.

Here is a link from google showing how to make Gemini return a json: https://ai.google.dev/gemini-api/docs/structured-output?hl=de&lang=python

Edit (for clarification): LLM Input/Prompt = Task + Webpage HTML => Model Output = Structured Json

Upvotes: 0

Automating Web Scraping with Machine Learning and NER for Product Data Extraction

Answers (3)

Related Questions