Posts

A Data Leakage mistake often made while using GridSearchCV / RandomizedSearchCV

Image
Use of sci-kit learn pipeline allow us to fit only the train split while cross-validation thus avoiding data leakage while hyper-parameter tuning. We all know about the importance of creating a separate train and test set to avoid data leakage. We use the statistics of the train data only to transform our data i.e. in the language of sci-kit learn for any data transformation we ‘fit’ only on train data and not on the test/validation data. e.g.: if we want to standardize a given feature we calculate the mean and standard deviation of the train data only and use them to standardize the test/validation data. I have noticed that people after doing the above transformation directly pass the transformed dataset to the GridSearch CV or RandomizedSearch CV which internally performs similar train-test(validation) cross-validation splits of the transformed data and it has no mechanism to calculate the statistic of the train split only and leave the test/validation split out. i.e...

Stable Diffusion 3 on Colab (Run the Full model without quantization)

  Run the full Stable diffusion 3 model on colab (T4 gpu) without quantization with long prompts / extended context length & prompt weighing. The stable diffusion 3 Hugging Face page states “SD3 uses three text encoders, one of which is the very large T5-XXL model. This makes it challenging to run the model on GPUs with less than 24GB of VRAM, even when using fp16 precision.” and gives some options like using quantize version of the T5 text encoder or dropping it. CPU offload does not work in the free version of colab and sequential offload to cpu takes long time to generate the image. Good for us that colab or the T4 gpu has enough gpu memory to load all the three text encoders at once without any quantization and get text embeddings and then empty the gpu space just enough to load the transformer and vae and perform next steps for the image generation. So, the basic steps to prepare the pipeline will look like: load all the 3 text encoders with their tokenizer o...

vLLM Parameter Tuning for Better Performance

 vLLM Parameters     We all know that vLLM is fast and easy to use library for LLM inference and serving. We shall go through some parameters tuning to get better performance out of vLLM.   The vLLM Engine parameter we shall discuss are: -- max-num-batched-tokens -- max-model-len -- gpu-memory-utilization -- enable-prefix-caching -- enable-chunked-prefill -- enforce-eager max-model-len: TL;DR  as per your max token usage(input+output) - By default the max model length is the  max context length of the model you are using eg. for llama 3 8B instruct model it would be 8192 - If you can determine the max context length for your use case and if it is less than the max context length of your model, it is better to set this parameter to that value. - Along with preventing out of memory error while loading or using the model it will also help to set other parameters like max-num-batched-tokens and gpu-memory-utilization. - The value includes both input and output...

Understanding LSTM

Image
 Understanding LSTM By Guillaume Chevalier — File:The_LSTM_Cell.svg, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=109362147   The Cell state: As RNN finds it difficult to carry previous information for a long input to the final state, so in LSTM we keep a separate state called cell state where previously learned information is available for the model. The mechanism of how this cell state is maintained and the models ability to learn from new inputs is discussed below. Selectively Removing Information from the Cell State: Forget Gate Mechanism At time step ‘t’, we have a previous cell state vector(or matrix)(ct-1) which has encoded features from all inputs before it. We have a previous hidden state vector(ht-1) which encodes the influence of last input to the long term cell state. We have a new input vector(xt) which should make necessary changes to the encoding done so far. Both these vectors are transformed to the same vector space using two(W and U) lea...

LLM Web Scraping - Webpage to LLM Friendly Text - Fully Open Source

LLM Web Scraping Webpage to LLM Friendly Text LLM are good with extracting data from texts. So to scrape any webpage we provide the webpage text to the LLM in such a format that it becomes easy to extract data from them.   We use library like selenium, beautifulsoup to get page source html and get text from it. This may help to extract certain information but it can't extract image links, website links for the required product or information we are extracting eg: while scraping any e-commerce website if along with details like product-title, price etc you want the image and product main page link then preprocessing the html becomes important. Below i have shared an open source repository to get LLM friendly text from webpage which can extract any data including websites and image links.   APIs like Jina Reader API and Firecrawl API can be used to get clean text from any webpage. If you want a complete open-source option and ability to modify the code as per your need(some webs...