LLM Web Scraping - Webpage to LLM Friendly Text - Fully Open Source

LLM Web Scraping

Webpage to LLM Friendly Text

LLM are good with extracting data from texts. So to scrape any webpage we provide the webpage text to the LLM in such a format that it becomes easy to extract data from them.

 
We use library like selenium, beautifulsoup to get page source html and get text from it. This may help to extract certain information but it can't extract image links, website links for the required product or information we are extracting eg: while scraping any e-commerce website if along with details like product-title, price etc you want the image and product main page link then preprocessing the html becomes important.

Below i have shared an open source repository to get LLM friendly text from webpage which can extract any data including websites and image links.
 
APIs like Jina Reader API and Firecrawl API can be used to get clean text from any webpage. If you want a complete open-source option and ability to modify the code as per your need(some websites may required such modfications) you can take a look at this repo https://github.com/m92vyas/llm-reader.git
 
The repo has details and examples on how to use the library so i will not add those details here.
 
So use the repo to convert any webpage to LLM ready text and design your prompts to extract any data or do any task on the text. Fully Open Source!!

In the future i will add some examples explaining how to modify the open source llm-reader code to convert any webpage when the existing code or Jina Reader or Firecrawl API does not perform well and also add a easy to use scraper which shall use the llm-reader.

Consider leaving your feedback and giving a star to the GitHub repo if it was of any help to you.

Comments

Popular posts from this blog

vLLM Parameter Tuning for Better Performance

Stable Diffusion 3 on Colab (Run the Full model without quantization)