Anand Singh, Author at AlfaTechLab

Pretrained Models (VGG, ResNet, Inception, BERT)

July 11, 2025 by Anand Singh

अब हम Deep Learning में Pretrained Models की बात करेंगे, जो Transfer Learning की रीढ़ की हड्डी हैं।
ये models पहले से बहुत बड़े datasets पर train हो चुके हैं, और इन्हें विभिन्न tasks में reuse किया जा सकता है।

🔶 1. What are Pretrained Models?

Pretrained Models वे deep learning architectures होते हैं जिन्हें पहले से किसी बड़े dataset (जैसे ImageNet या Wikipedia) पर train किया गया है।
आप इन्हें reuse करके:

Feature Extraction कर सकते हैं
Fine-Tuning कर सकते हैं
Zero-shot tasks भी perform कर सकते हैं (कुछ models)

🎯 क्यों ज़रूरी हैं?

✅ Save time and computation
✅ बेहतर performance, खासकर छोटे datasets पर
✅ Common architectures को standard बनाना
✅ Foundation models का निर्माण

🔷 A. Pretrained Models in Computer Vision

1. VGGNet

🧠 Developed by: Visual Geometry Group, Oxford
📆 Year: 2014
📐 Architecture: Simple CNNs with 3×3 convolutions
🧱 Versions: VGG-16, VGG-19
⚠️ Downside: Large number of parameters, slow

from torchvision import models
vgg = models.vgg16(pretrained=True)

2. ResNet (Residual Network)

resnet = models.resnet50(pretrained=True)

3. Inception (GoogLeNet)

🧠 By: Google
📆 Year: 2014
🔄 Inception Module: Multiple filter sizes in parallel
🧠 Deep but Efficient
📊 Version: Inception-v1, v2, v3, v4

inception = models.inception_v3(pretrained=True)

🔷 B. Pretrained Models in Natural Language Processing (NLP)

4. BERT (Bidirectional Encoder Representations from Transformers)

🧠 By: Google AI
📆 Year: 2018
🔍 Key Idea: Bidirectional context + Masked Language Modeling
🌍 Trained On: Wikipedia + BookCorpus
✅ Used for: Text classification, Q&A, NER, etc.
🔁 Fine-tune specific to downstream tasks

from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

📊 Comparison Table

Model	Domain	Strengths	Weakness
VGG	Vision	Simplicity	Too many parameters
ResNet	Vision	Deep + residual connections	Slightly complex
Inception	Vision	Multi-scale processing	Harder to modify
BERT	NLP	Powerful language understanding	Large memory usage

🧠 Use Cases of Pretrained Models

Task	Model
Image Classification	ResNet, VGG
Object Detection	Faster R-CNN with ResNet
Semantic Segmentation	DeepLab, U-Net
Sentiment Analysis	BERT
Machine Translation	mBERT, T5
Question Answering	BERT, RoBERTa

📝 Practice Questions

Pretrained model क्या होता है?
VGG और ResNet में क्या अंतर है?
Inception module का उद्देश्य क्या है?
BERT किस तरीके से context को समझता है?
Vision और NLP में कौन-कौन से pretrained models आम हैं?

🧠 Summary

Feature	Vision	NLP
Basic CNN	VGG	–
Deep Network	ResNet	BERT
Advanced Structure	Inception	Transformer variants
Library	`torchvision`	`transformers` (HuggingFace)

Feature Extraction vs Fine-Tuning

July 11, 2025 by Anand Singh

Transfer Learning के अंदर दो मुख्य तरीके होते हैं:
👉 Feature Extraction
👉 Fine-Tuning
आइए इन दोनों को विस्तार से समझते हैं — ताकि आप सही स्थिति में सही तरीका चुन सकें।

🔶 1. Feature Extraction क्या होता है?

Feature Extraction का अर्थ है कि हम एक pre-trained model के शुरूआती layers (convolution या transformer blocks) को freeze कर देते हैं और उन्हें features extractor की तरह उपयोग करते हैं।

📌 केवल final classification layer को नए data पर train किया जाता है।

✅ Use when:

आपके पास कम training data है
Pre-trained features आपके नए task के लिए पर्याप्त हैं
Model को जल्दी train करना है

🧱 Example:

from torchvision import models
import torch.nn as nn

model = models.resnet18(pretrained=True)

# Freeze all convolution layers
for param in model.parameters():
    param.requires_grad = False

# Replace final classification layer
model.fc = nn.Linear(model.fc.in_features, 2)

🔶 2. Fine-Tuning क्या होता है?

Fine-Tuning में हम pre-trained model के कुछ या सभी layers को unfreeze करते हैं और उन्हें भी re-train करते हैं — ताकि वे नए task के अनुसार adjust हो सकें।

📌 यह तरीका अधिक compute-intensive है लेकिन अक्सर बेहतर results देता है।

✅ Use when:

आपके पास अधिक training data है
New task original task से काफ़ी अलग है
आप high accuracy चाहते हैं

🧱 Example:

# Unfreeze last few layers only (optional)
for name, param in model.named_parameters():
    if "layer4" in name or "fc" in name:
        param.requires_grad = True
    else:
        param.requires_grad = False

🔁 3. तुलना: Feature Extraction vs Fine-Tuning

Feature	Feature Extraction	Fine-Tuning
Layers Frozen	सभी convolutional layers	कुछ या सभी layers train होते हैं
Train Time	तेज़	धीरे
Data Requirement	कम	अधिक
Flexibility	सीमित	उच्च
Performance (accuracy)	ठीक	बेहतर (especially for complex tasks)
जब उपयोग करें	Data कम हो, similar task	Data अधिक हो, अलग task हो

🧠 Visual Diagram:

[Pretrained Model]
   ↓
 ┌─────────────┐      ┌───────────────┐
 │ Conv Layers │ ───► │ Classifier    │  → Output
 └─────────────┘      └───────────────┘
 ↑        ↑
 |        └─ Trained (Fine-tuning)
 └─ Frozen (Feature Extraction)

📝 Practice Questions:

Feature extraction और fine-tuning में क्या अंतर है?
Feature extraction कब उपयोग करना चाहिए?
Fine-tuning को कब avoid करना चाहिए?
PyTorch में किसी model के कौन से layers freeze/unfreeze होते हैं, कैसे देखें?
कौन-सी technique ज्यादा computation मांगती है?

🎯 Summary

Point	Feature Extraction	Fine-Tuning
Training Layers	केवल classifier	कुछ या सभी layers
Data Need	कम	ज़्यादा
Accuracy	Moderate	Better
Computation	Low	High
Flexibility	Low	High

What is Transfer Learning?

July 11, 2025 by Anand Singh

अब हम deep learning की दुनिया का एक बेहद सशक्त और उपयोगी कॉन्सेप्ट समझते हैं —
🔄 Transfer Learning जिसने model training को तेज़, आसान और अधिक accurate बना दिया है।

🔶 1. Definition (परिभाषा):

Transfer Learning एक ऐसा approach है जिसमें हम एक model को पहले से किसी एक task पर train करते हैं,
और फिर उसे दूसरे task के लिए reuse करते हैं — अक्सर बहुत कम training data के साथ।

🎯 “पहले सीखी गई जानकारी को नए task पर लागू करना।”

🧠 2. Traditional Learning vs Transfer Learning

Traditional Learning	Transfer Learning
हर नया task के लिए model को scratch से train किया जाता है	Pre-trained model को नए task पर fine-tune किया जाता है
Requires lots of data	Requires less data
Time & compute heavy	Fast training
Generalizes poorly	High accuracy with less effort

🔁 3. कैसे काम करता है?

Step-by-Step:

किसी बड़े dataset (जैसे ImageNet) पर CNN model को train करें
उस trained model को लें (जैसे ResNet, VGG, BERT, आदि)
Final layer को replace करें अपने नए task के अनुसार
Model को fine-tune करें (थोड़ा बहुत training करें)

📊 4. Visual Example:

[Train on ImageNet] → [ResNet with learned weights]  
                             ↓
          Remove final layer (1000 classes)  
                             ↓
        Add new final layer (e.g., 2 classes: Dog vs Cat)  
                             ↓
                 Train on small new dataset

🔧 5. PyTorch Example (CNN):

from torchvision import models
import torch.nn as nn

model = models.resnet18(pretrained=True)

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer for binary classification
model.fc = nn.Linear(model.fc.in_features, 2)

🎯 6. Use Cases of Transfer Learning

Domain	Example
Computer Vision	Image classification, object detection
NLP	Text classification, translation (e.g., BERT)
Audio	Speech recognition
Medical	Cancer detection from X-rays
Robotics	Control from simulation to real-world

🔍 7. Why is it so powerful?

✅ Saves training time
✅ Reduces need for large datasets
✅ Achieves high accuracy
✅ Pre-trained models often generalize well
✅ Enables democratization of deep learning

🧪 8. Famous Pre-trained Models

Domain	Model	Pre-trained On
Vision	ResNet, VGG, MobileNet	ImageNet
NLP	BERT, GPT, RoBERTa	Wikipedia, BookCorpus
Audio	Wav2Vec, Whisper	Large speech corpora

📝 Practice Questions:

Transfer learning क्या होता है और इसके क्या फायदे हैं?
Traditional learning की तुलना में transfer learning कैसे बेहतर है?
Transfer learning के real-world use cases बताइए।
PyTorch में किसी pre-trained model को कैसे modify करते हैं?
कौन-कौन से famous pre-trained models हैं?

📌 Summary

Concept	Description
Transfer Learning	पहले सीखे हुए model को नए काम में reuse करना
Advantage	कम data में अच्छा result
Process	Pre-train → Modify → Fine-tune
Use Cases	Vision, NLP, Audio, Healthcare

LSTM and GRU Networks

July 11, 2025 by Anand Singh

(लंबी याददाश्त वाले नेटवर्क – RNN का विकास)

अब हम Recurrent Neural Networks (RNN) के दो शक्तिशाली upgrades को समझते हैं —
👉 LSTM (Long Short-Term Memory) और GRU (Gated Recurrent Unit) जिन्होंने RNN की Vanishing Gradient जैसी समस्याओं का समाधान किया।

🔶 1. Why LSTM and GRU?

RNN बहुत लंबी sequence data को ठीक से process नहीं कर पाते क्योंकि gradients vanish हो जाते हैं।
इस समस्या को दूर करने के लिए Gated Mechanisms वाली architectures विकसित की गईं:

Problem	Solution
Memory fades	Memory Cells (LSTM)
Gradient vanishes	Gates control flow (LSTM/GRU)

🧠 2. LSTM (Long Short-Term Memory)

📌 Introduced by: Hochreiter & Schmidhuber (1997)

LSTM एक special RNN architecture है जो Memory Cell का उपयोग करता है।
इसमें तीन मुख्य गेट्स (gates) होते हैं जो यह नियंत्रित करते हैं कि information कितनी रखनी है, कितनी भूलनी है, और कितनी बाहर भेजनी है।

🔹 LSTM Cell Diagram:

         ┌────────────┐
         │  Forget    │   → decides what to forget
x_t ──►──┤   Gate     ├──┐
         └────────────┘  │
                         ▼
         ┌────────────┐
         │  Input     │   → decides what new info to store
         │   Gate     ├──┐
         └────────────┘  │
                         ▼
         ┌────────────┐
         │ Cell State │  ← updated memory
         └────────────┘
                         ▲
         ┌────────────┐  │
         │  Output    │  └──→ h_t (output/hidden)
         │   Gate     ├──────►
         └────────────┘

🔹 LSTM Equations:

Let’s denote:

🧠 3. GRU (Gated Recurrent Unit)

📌 Introduced by: Cho et al. (2014)

GRU को LSTM से सरल और तेज़ बनाया गया है। इसमें सिर्फ

दो gates होते हैं:

Update Gate (z)
Reset Gate (r)

GRU में अलग-अलग memory cell नहीं होता — hidden state को ही memory की तरह प्रयोग किया जाता है।

🔹 GRU Equations:

🔄 4. LSTM vs GRU – Comparison Table

Feature	LSTM	GRU
Gates	3 (Forget, Input, Output)	2 (Reset, Update)
Cell State	Yes (separate from h)	No (merged with h)
Complexity	Higher	Lower
Speed	Slower (more parameters)	Faster
Performance	Good for longer sequences	Comparable on many tasks
Use Case	Text, speech, time-series	Similar, but with simpler models

🔧 5. PyTorch Example: LSTM and GRU

LSTM:

import torch.nn as nn

lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=1, batch_first=True)
input = torch.randn(5, 8, 10)
h0 = torch.zeros(1, 5, 20)
c0 = torch.zeros(1, 5, 20)

output, (hn, cn) = lstm(input, (h0, c0))

GRU:

gru = nn.GRU(input_size=10, hidden_size=20, num_layers=1, batch_first=True)
output, hn = gru(input, h0)

📈 6. When to Use Which?

Scenario	Use
Complex dependencies	LSTM
Faster training needed	GRU
Simpler datasets	GRU
Large vocabulary/text	LSTM
Low memory environment	GRU

📝 Practice Questions:

LSTM और GRU में क्या अंतर है?
LSTM में Forget gate क्या करता है?
GRU में Cell State क्यों नहीं होता?
LSTM का output कैसे निकाला जाता है?
LSTM और GRU के फायदे और नुकसान क्या हैं?

🎯 Summary

Model	Memory	Gates	Use Case
RNN	Short	None	Simple sequences
LSTM	Long	3	Complex dependencies
GRU	Medium	2	Fast + good performance

Vanishing Gradient Problem in RNNs

July 11, 2025 by Anand Singh

(RNN में विलुप्त होता ग्रेडिएंट — कारण और समाधान)

अब हम RNN की सबसे बड़ी समस्या को समझेंगे —जिसके कारण deep RNNs को train करना कठिन हो जाता है:
🧨 Vanishing Gradient Problem

🔶 1. What is the Vanishing Gradient Problem?

जब neural network को train किया जाता है, तो हम backpropagation through time (BPTT) का उपयोग करते हैं ताकि हर time step पर gradient calculate किया जा सके।

लेकिन जैसे-जैसे sequence लंबा होता है और हम पीछे की ओर gradients propagate करते हैं —
gradient का मान बहुत छोटा (near zero) होता जाता है।
👉 इसे ही vanishing gradient कहते हैं।

🧮 2. Technical Explanation

RNN में hidden state update होता है:

⚠️ 3. Effects of Vanishing Gradient

Effect	Description
No learning	पुराने inputs से कोई सीख नहीं होता
Short memory	RNN केवल recent inputs पर निर्भर करता है
Shallow reasoning	Long-term dependencies समझ नहीं पाता
Poor performance	Especially in long sequences (e.g. paragraph-level text)

📉 4. Visualization

Imagine a gradient value like 0.8
→ Backprop through 50 steps:

Gradient → 0 के बहुत करीब हो जाता है
→ Model पुराने शब्दों/steps को भूल जाता है।

🧪 5. Real-life Example

Suppose आपने ये वाक्य दिया:

“The movie was long, but in the end, it was incredibly good.”

Prediction चाहिए “good” शब्द के लिए।

Vanilla RNN में model शायद “long” या “but” को देख कर negative guess कर ले —
क्योंकि beginning में मौजूद words की जानकारी gradient vanish होने की वजह से खो जाती है।

🧯 6. How to Solve Vanishing Gradient?

Solution	Description
✅ LSTM (Long Short-Term Memory)	Introduces gates to control memory
✅ GRU (Gated Recurrent Unit)	Simpler than LSTM, effective
🔁 Gradient Clipping	Gradient को limit किया जाता है
⏫ ReLU Activations	Vanishing कम होती है (compared to tanh)
🧠 Better Initialization	Xavier/He initialization
🧱 Skip Connections	जैसे ResNet में होता है

🧠 7. Summary Table

Feature	Normal RNN	LSTM/GRU
Memory	Short-term only	Long + short term
Gradient stability	Poor	Better
Sequence length handling	Weak	Strong
Complexity	Low	Medium to High

🔧 PyTorch: Gradient Clipping Example

from torch.nn.utils import clip_grad_norm_

clip_grad_norm_(model.parameters(), max_norm=1.0)

📝 Practice Questions:

Vanishing gradient क्या होता है?
यह समस्या RNN में क्यों होती है?
इसका क्या असर पड़ता है model की memory पर?
इस समस्या को कैसे हल किया जा सकता है?
LSTM और GRU इस समस्या से कैसे लड़ते हैं?

🎯 Summary

Concept	Explanation
Vanishing Gradient	Gradient बहुत छोटा हो जाता है
Result	Model पुरानी जानकारी भूल जाता है
Main Cause	Long multiplication of small numbers
Solutions	LSTM, GRU, Clipping, ReLU