Anand Singh, Author at AlfaTechLab

Linear Regression

July 12, 2025 by Anand Singh

🔷 परिचय:

Linear Regression सबसे सरल और प्रचलित Supervised Learning algorithm है।
इसका उद्देश्य है — किसी continuous value को predict करना, जैसे:

घर की कीमत
स्टूडेंट के मार्क्स
कर्मचारी का वेतन

🔶 फॉर्मूला:

🎯 Prediction Function:

जहाँ:

x= इनपुट
w = वज़न (weight)
b = बायस (bias)
y^ = अनुमानित आउटपुट (predicted output)

🔧 उपयोग:

क्षेत्र	उदाहरण
रियल एस्टेट	घर की कीमत का पूर्वानुमान
एजुकेशन	मार्क्स का अनुमान
हेल्थ	रोग की गंभीरता स्कोर

🔢 Cost Function (Loss):

Mean Squared Error (MSE):

🔬 Linear Regression in PyTorch

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

# Dummy dataset
X = torch.tensor([[1.0], [2.0], [3.0], [4.0]], dtype=torch.float32)
y = torch.tensor([[2.0], [4.0], [6.0], [8.0]], dtype=torch.float32)

# Linear Regression Model
model = nn.Linear(1, 1)

# Loss and Optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training loop
epochs = 1000
for epoch in range(epochs):
    y_pred = model(X)
    loss = criterion(y_pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 100 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

# Prediction
with torch.no_grad():
    test = torch.tensor([[5.0]])
    pred = model(test)
    print("Prediction for 5.0:", pred.item())

# Visualize
predicted = model(X).detach()
plt.scatter(X, y, label='Original')
plt.plot(X, predicted, label='Fitted line', color='red')
plt.legend()
plt.show()

📊 Summary Table:

Element	Description
Model Type	Regression
Input	Continuous/Real number
Output	Continuous value
Loss Function	Mean Squared Error (MSE)
Optimizer	SGD, Adam
Library Used	PyTorch

📝 Practice Questions:

Linear Regression का उद्देश्य क्या होता है?
Model का फॉर्मूला y^=w⋅x+b का मतलब समझाइए।
MSE (Mean Squared Error) को क्यों उपयोग करते हैं?
PyTorch में nn.Linear() क्या करता है?
Optimizer का कार्य क्या होता है?

Introduction of Supervised Learning Algorithms

July 12, 2025 by Anand Singh

Supervised Learning वह तकनीक है जिसमें मॉडल को ऐसे डेटा पर प्रशिक्षित किया जाता है जिसमें इनपुट के साथ-साथ सही आउटपुट (label) भी होता है।
उदाहरण:

Input (Features)	Output (Label)
उम्र = 30, वेतन = ₹40k	लोन स्वीकृत (Yes)

अब हम ऐसे प्रमुख एल्गोरिद्म्स को समझेंगे जो Supervised Learning में सबसे ज़्यादा उपयोग होते हैं।

🔷 🔹 Why Supervised Algorithms?

Feature	Benefit
Input-output mapping defined	आसानी से train और evaluate किया जा सकता है
Classification & Regression दोनों के लिए	बहुत versatile models उपलब्ध हैं
Scalability	छोटे से बड़े डेटासेट तक लागू होता है

🔶 Supervised Learning Algorithms के दो प्रमुख प्रकार:

प्रकार	उपयोग क्षेत्र	उदाहरण
Classification	Label पहचानना	Email Spam, Disease Detection
Regression	Value predict करना	House Price, Stock Prediction

🔷 1. Linear Regression (रेखीय प्रतिगमन)

📌 उपयोग:

Continuous Value Prediction
(जैसे घर की कीमत, तापमान)

🧮 फॉर्मूला:

y = w*x + b

✅ Python Example:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

🔷 2. Logistic Regression (तर्कशक्ति प्रतिगमन)

📌 उपयोग:

Binary Classification (Yes/No)

✅ Output:

Probability (0 to 1), फिर threshold लगाकर decision

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

🔷 3. Decision Tree

📌 उपयोग:

Classification और Regression दोनों के लिए
डाटा को बार-बार विभाजित करके निर्णय लेना।

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

🔷 4. Random Forest

📌 क्या है?

Multiple Decision Trees का ensemble
Voting या averaging के ज़रिए output देता है।

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

🔷 5. Support Vector Machine (SVM)

📌 उपयोग:

High-dimensional datasets में classification के लिए बेहतरीन

from sklearn.svm import SVC

model = SVC(kernel='linear')
model.fit(X_train, y_train)

🔷 6. K-Nearest Neighbors (KNN)

📌 उपयोग:

Instance-based learning — training में कोई model नहीं, prediction के समय नज़दीकी K-पड़ोसियों को देखता है।

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

🔷 7. Naive Bayes

📌 उपयोग:

Text classification जैसे spam detection
(Statistical probability आधारित)

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)

📊 Summary Table:

Algorithm	Type	Strengths	Use Case
Linear Regression	Regression	Simple, fast	Price prediction
Logistic Regression	Classification	Probabilistic output	Spam detection
Decision Tree	Both	Interpretability	Credit approval
Random Forest	Both	Accuracy, handles overfitting	Medical diagnosis
SVM	Classification	Works in high dimensions	Face recognition
KNN	Classification	No training, easy to implement	Pattern recognition
Naive Bayes	Classification	Fast, good for text	Sentiment analysis

📝 Practice Questions:

Linear Regression और Logistic Regression में क्या अंतर है?
Random Forest को Decision Tree से बेहतर क्यों माना जाता है?
SVM किस तरह से Classification करता है?
KNN में K का चुनाव कैसे किया जाता है?
Naive Bayes कब अच्छा और कब बेकार perform करता है?

Feature Selection & Feature Extraction

July 12, 2025 by Anand Singh

मशीन लर्निंग में सही फीचर्स (गुण) चुनना और नए उपयोगी फीचर्स बनाना मॉडल की दक्षता और सटीकता को कई गुना बढ़ा सकता है। यह प्रक्रिया दो भागों में बाँटी जाती है:
🔹 Feature Selection (चयन)
🔹 Feature Extraction (नव-निर्माण)

🔷 Why Feature Selection & Extraction?

Reason	Benefit
Less Complexity	Model simple और fast होता है
Overfitting से बचाव	Unnecessary features हटाने से accuracy बढ़ती है
Better Performance	Relevant features रखने से result अच्छा आता है
Visualization आसान होती है	Dimensionality घटाने से data समझना आसान होता है

🔶 1. Feature Selection (फीचर चयन)

📌 क्या है?

डेटा में से सबसे ज़रूरी और उपयोगी फीचर्स को चुनना, बाकी को हटाना। इससे model तेज़, सटीक और आसान बनता है।

✅ मुख्य तरीके:

तरीका	विवरण
Filter Methods	Statistics जैसे correlation, chi-square आदि के आधार पर फीचर्स चुनना
Wrapper Methods	हर फीचर सेट पर मॉडल train करके best चुनना (जैसे RFE)
Embedded Methods	मॉडल खुद feature चुनता है (जैसे Lasso, Decision Trees)

🛠️ Python Code Example (Correlation Method):

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Correlation Matrix
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

🔶 2. Feature Extraction (फीचर नव-निर्माण)

📌 क्या है?

मौजूदा फीचर्स से नए meaningful फीचर्स बनाना, या features को lower dimensions में compress करना।

उदाहरण:
Image data → Raw pixels को CNN features में बदला जाता है
Text data → TF-IDF या Word Embedding बनाया जाता है

✅ मुख्य तरीके:

तरीका	विवरण
PCA (Principal Component Analysis)	Variance-preserving compressed representation
LDA (Linear Discriminant Analysis)	Class separation के लिए feature reduce
Autoencoders	Deep Learning आधारित compressed features
TF-IDF / Word2Vec	Text से semantic features बनाना

🛠️ Python Code Example (PCA):

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Step 1: Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Step 2: Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print("Reduced Features:\n", X_pca)

🔍 Feature Selection vs Feature Extraction

Comparison	Feature Selection	Feature Extraction
उद्देश्य	सबसे अच्छे मौजूदा फीचर्स चुनना	नए meaningful फीचर्स बनाना
Feature Count	कम होता है	अलग set of features बनते हैं
Technique Examples	Correlation, RFE, Lasso	PCA, Autoencoders, Word2Vec
व्याख्या आसान है	हाँ	कभी-कभी नहीं (PCA जैसे में)

📊 Summary Table:

Task	Tool/Technique
Selection	Correlation, Chi-square, RFE
Embedded	Lasso, Decision Tree
Extraction	PCA, LDA, Autoencoder
Text Extraction	TF-IDF, Word2Vec, BERT

📝 Practice Questions:

Feature Selection और Feature Extraction में क्या अंतर है?
PCA का क्या उपयोग है और कब किया जाता है?
Wrapper method और Filter method में क्या फ़र्क है?
Autoencoder का उपयोग feature extraction में कैसे होता है?
Embedded Method का उदाहरण दीजिए।

Data Cleaning and Normalization

July 12, 2025 by Anand Singh

किसी भी मशीन लर्निंग मॉडल की सफलता इस बात पर निर्भर करती है कि आपने उसे कितना साफ और संतुलित डेटा दिया है।
गंदा डेटा = ग़लत मॉडल
इसलिए हमें सबसे पहले डेटा को साफ (clean) करना और फिर संतुलित (normalize) करना होता है।

🔷 🔹 Why Clean & Normalize Data?

Reason	Benefit
Missing/Error हटाना	Training के दौरान performance में सुधार
Scaling balance करना	मॉडल को सभी features को समान रूप से सीखने देना
Bias कम करना	एक feature का ज़रूरत से ज़्यादा प्रभाव न हो

🔶 1. डेटा क्लीनिंग (Data Cleaning)

🧹 क्या होता है?

डेटा से गलत, अधूरा या अव्यवस्थित जानकारी हटाना या सुधारना।

✅ मुख्य कार्य:

Technique	Use Case
Missing Value Handling	Null, NaN भरना या हटाना
Outlier Removal	बहुत ज़्यादा/कम values हटाना
Duplicate Removal	दोहराए हुए rows हटाना
Type Conversion	String → Int/Float बदलना
Inconsistent Label Fixing	जैसे “Male”, “male”, “MALE” को एक जैसा बनाना

🛠️ Python/Pandas Code:

import pandas as pd

df = pd.read_csv("data.csv")

# Null values को भरना
df.fillna(df.mean(), inplace=True)

# Duplicates हटाना
df.drop_duplicates(inplace=True)

# गलत values हटाना
df = df[df["age"] > 0]

🔶 2. नॉर्मलाइजेशन (Normalization)

📌 क्या होता है?

सभी numerical features को एक समान स्केल (जैसे 0 से 1) पर लाना ताकि कोई feature ज़्यादा हावी न हो।

उदाहरण: अगर एक feature की वैल्यू 1-10 के बीच है और दूसरे की 1000-100000, तो दूसरा model को ज्यादा influence करेगा। यही imbalance को normalization से हटाया जाता है।

✅ प्रमुख तरीके:

तकनीक	विवरण	फॉर्मूला
Min-Max Scaling	0 से 1 के बीच स्केल करता है	`X' = (X - min) / (max - min)`
Z-Score Standardization	Mean को 0 और Std को 1 बनाता है	`X' = (X - μ) / σ`
Robust Scaling	Median और IQR पर आधारित होता है	`X' = (X - median) / IQR`

🛠️ Sklearn Code:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled = scaler.fit_transform(df[['age', 'salary']])

📊 Summary Table:

स्टेप	उद्देश्य	टूल/तकनीक
Missing Values	NaN भरना/हटाना	fillna(), dropna()
Outliers	मॉडल accuracy बचाना	IQR, Z-score
Scaling	सभी features को बराबरी देना	MinMaxScaler, StandardScaler
Duplicates	डाटा दोहराव हटाना	drop_duplicates()
Type Conversion	डेटा को सही फॉर्मेट में लाना	astype(), to_numeric()

📝 Practice Questions:

Missing value को handle करने के कौन-कौन से तरीके होते हैं?
Z-score और Min-Max scaling में क्या अंतर है?
Robust Scaling कब उपयोगी होता है?
fillna() और dropna() में अंतर बताइए।
Normalization की ज़रूरत क्यों होती है?

Types of Data

July 12, 2025 by Anand Singh

मशीन लर्निंग में किसी भी मॉडल की सफलता इस बात पर निर्भर करती है कि उसे कैसा डेटा (Data) दिया गया है।
डेटा कई प्रकार का हो सकता है — जैसे संख्यात्मक, श्रेणीक (categorical), इमेज, या टेक्स्ट। हर प्रकार के डेटा के लिए अलग तकनीक और मॉडलिंग की ज़रूरत होती है।

🔷 🔹 Why Understand Data Types?

Reason	Benefit
सही preprocessing चुनना	Encoding, Scaling आदि के सही तरीके
Model compatibility	कौन सा मॉडल किस डेटा के साथ बेहतर काम करता है
Visualization & analysis	सही insight निकालना संभव होता है

🔶 1. संरचित डेटा (Structured Data)

टेबल के रूप में होता है (rows और columns)
Excel, CSV, SQL database जैसे स्रोत

✅ उदाहरण:

Name	Age	Gender	Salary
Raj	25	Male	₹30,000

🔶 2. अर्ध-संरचित डेटा (Semi-structured Data)

कुछ degree तक structure होता है
लेकिन rigid format नहीं होता
अक्सर key-value format में

✅ उदाहरण:

XML, JSON, YAML

{
  "name": "Raj",
  "age": 25,
  "salary": 30000
}

🔶 3. असंरचित डेटा (Unstructured Data)

किसी fix format में नहीं होता
मशीन के लिए सीधे समझना कठिन होता है

✅ उदाहरण:

Text (e.g. tweets, reviews)
Images
Audio / Video

🔶 4. आंकड़ों के आधार पर डेटा के प्रकार (By Statistical Nature):

डेटा प्रकार	विवरण	उदाहरण
🔹 Numerical	संख्यात्मक	उम्र, वेतन
🔹 Categorical	श्रेणीक	Gender, City
🔹 Ordinal	क्रमबद्ध	Rank (High, Medium, Low)
🔹 Time Series	समय आधारित	Stock prices
🔹 Text	शब्द आधारित	Chat messages
🔹 Image	चित्र आधारित	Face detection
🔹 Audio	ध्वनि आधारित	Voice command

📊 Summary Table:

Type	Format	Example	ML Techniques
Structured	Tables	CSV, Excel	Supervised Learning
Semi-Structured	Key-Value	JSON/XML	NLP, API Parsing
Unstructured	Free-form	Text, Image	Deep Learning
Numerical	Numbers	Salary, Height	Regression
Categorical	Labels	Gender, City	Classification
Ordinal	Ordered Labels	Low < Medium < High	Ranking Models
Time Series	Indexed by time	Stock, Sensor	RNN, LSTM
Text	Sentence/word	Reviews, Chat	NLP (BERT, RNN)
Image	Pixels	Photos	CNN
Audio	Frequency	Voice	Audio Processing (WaveNet, etc.)

📝 Practice Questions:

Structured और Unstructured डेटा में क्या अंतर है?
Semi-structured डेटा के 2 उदाहरण दीजिए।
Numerical और Ordinal डेटा में क्या फ़र्क है?
Time Series डेटा किस प्रकार के मॉडल के लिए उपयुक्त है?
ChatGPT या Alexa जैसे मॉडल कौन से डेटा पर काम करते हैं?