AI Logic School

Empowering Students with AI & Computational Thinking

Class 12 AI | Unit 2: Data Science Methodology – An Analytic Approach to Capstone Project | Free Notes & Q&A

CBSE Class 12 AI Unit 2: Data Science Methodology | AI Logic School
📚 CBSE Class XII | Subject Code 843

Unit 2: Data Science Methodology
An Analytic Approach

Complete Notes, Q&A, MCQs & Practicals | Session 2025–26 | KVS & CBSE Schools

Data Science Model Validation Confusion Matrix Precision & Recall F1-Score MSE & RMSE
8
Theory Marks
12
Practical Marks
10
Key Steps
4
Metrics
2
Validation Types
843
Subject Code
XII
Class
Unit 2
Chapter
20 hrs
Duration
Apr–May
KV Schedule

Introduction to Data Science Methodology

Data Science Methodology is a structured, repeatable framework that guides the process of solving real-world problems using data. Without a methodology, data science projects often fail due to unclear goals, messy data, or poorly evaluated models.

🎯 Why Do We Need a Methodology?

  • Provides a systematic approach to problem-solving
  • Reduces chances of errors and omissions
  • Aligns technical work with business goals
  • Makes the process repeatable and scalable
  • Helps teams collaborate effectively

10 Steps of Data Science Methodology

The IBM Data Science Methodology (based on CRISP-DM) follows these 10 stages:

1

Business Understanding

Clearly define the problem. Understand what the client needs. Translate business goals into data science objectives.

Example: A hospital wants to reduce patient readmissions. Goal = predict which patients are at risk.
2

Analytic Approach

Decide how to solve the problem based on the question type.

Question TypeApproach
Yes/No questionClassification
How much/many?Regression
Find groupsClustering
3

Data Requirements

Identify what data is needed, in what format, and from which sources.

4

Data Collection

Gather data from primary sources (surveys, sensors) and secondary sources (databases, APIs).

5

Data Understanding

Explore data — check types, missing values, outliers, and distributions. Visualize to understand patterns.

6

Data Preparation ⭐

The most time-consuming step (70–80% of total project time). Clean, transform, and engineer features.

Remember: "Garbage In = Garbage Out" — data quality determines model quality.
7

Modelling

Build the ML model. Select algorithm, train on training data, tune hyperparameters. First model is rarely final — iterate!

8

Evaluation

Test model on unseen data. Measure performance using accuracy, precision, recall, F1, MSE, RMSE.

9

Deployment

Integrate the model into real-world systems. Create APIs or dashboards for end users.

10

Feedback

Monitor model, collect feedback, watch for data drift, retrain periodically with new data.


Model Validation Techniques

Model validation checks how well a model generalizes to new, unseen data.

Technique 1: Train-Test Split

Dataset is split into Training Set (70–80%) and Test Set (20–30%).

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
AspectDetails
Split Ratio80:20 or 70:30 (Train:Test)
AdvantageSimple and fast
DisadvantageResults depend on how data is split

Technique 2: K-Fold Cross Validation

Data is split into K equal folds. Model trains and tests K times; each fold serves as test set once. Final score = average of K evaluations.

With K=5: data splits into 5 folds. Each fold takes turns as the test set. Final score = average of 5 scores.
CriteriaTrain-Test SplitK-Fold CV
SpeedFasterSlower
ReliabilityLowerHigher
Data UsagePartialAll data
Best ForLarge datasetsSmall datasets

Evaluation Metrics – Classification

The Confusion Matrix

A table comparing actual vs. predicted values for a classification model.

Predicted: Positive Predicted: Negative
Actual: Positive True Positive (TP)
Correctly said YES
False Negative (FN)
Missed a YES
Actual: Negative False Positive (FP)
Wrong alarm
True Negative (TN)
Correctly said NO

Key Metrics

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Overall correctness. Best for balanced datasets.

Precision

Precision = TP / (TP + FP)

Of all predicted positives, how many were correct? Use when False Positives are costly (e.g., spam filter).

Recall (Sensitivity)

Recall = TP / (TP + FN)

Of all actual positives, how many did we catch? Use when False Negatives are costly (e.g., disease detection).

F1-Score ⭐

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Harmonic mean of Precision and Recall. Best for imbalanced datasets. Range: 0 to 1 (higher is better).

Evaluation Metrics – Regression

For problems that predict a continuous value (house price, marks, temperature).

MAE – Mean Absolute Error

MAE = (1/n) × Σ |Actual − Predicted|

Average of absolute differences. Robust to outliers.

MSE – Mean Squared Error

MSE = (1/n) × Σ (Actual − Predicted)²

Penalizes large errors more. Sensitive to outliers.

RMSE – Root Mean Squared Error ⭐

RMSE = √MSE

Square root of MSE. Most interpretable metric. Same unit as the target variable.


Question & Answer Bank

Very Short Answer (1 Mark)

Q1. What is Data Science Methodology?
A structured, step-by-step framework used to solve real-world problems using data, from understanding the business problem to deploying and monitoring a model.
Q2. What does TP mean in a confusion matrix?
TP = True Positive. It means the model correctly predicted a positive case (predicted YES, and it was actually YES).
Q3. What is RMSE?
RMSE = Root Mean Squared Error = √MSE. It measures the average prediction error for regression models in the same units as the target variable.
Q4. What is overfitting?
Overfitting occurs when a model learns training data too well and performs poorly on new/unseen data. It shows high training accuracy but low test accuracy.

Short Answer (2–3 Marks)

Q5. Explain the difference between Precision and Recall with formulas.
Precision = TP/(TP+FP) — measures what fraction of predicted positives were actually positive. It minimizes false alarms. Use when False Positives are costly (e.g., spam filter).

Recall = TP/(TP+FN) — measures what fraction of actual positives were correctly found. It minimizes missed cases. Use when False Negatives are costly (e.g., disease detection).
Q6. Explain Business Understanding with an example.
Business Understanding is the first step of Data Science Methodology where we clearly define the problem.

Example: A school wants to identify students at risk of failing exams. Business goal = predict which students need extra support. We translate this into: "Classify students as at-risk or not-at-risk based on attendance, marks, and participation."

Long Answer (5 Marks)

Q7. What is the Confusion Matrix? Calculate Accuracy, Precision, Recall and F1-Score from the given data: TP=60, TN=30, FP=10, FN=10.
A confusion matrix is a table that shows TP, TN, FP, FN for a classification model.

Given: TP=60, TN=30, FP=10, FN=10, Total=110

Accuracy = (60+30)/(60+30+10+10) = 90/110 = 81.8%
Precision = 60/(60+10) = 60/70 = 85.7%
Recall = 60/(60+10) = 60/70 = 85.7%
F1-Score = 2×(0.857×0.857)/(0.857+0.857) = 85.7%

MCQ Practice Bank

1. The first step of Data Science Methodology is:
  • a) Data Collection
  • b) Business Understanding ✓
  • c) Modelling
  • d) Evaluation
2. Precision is defined as:
  • a) TP/(TP+FN)
  • b) TP/(TP+FP) ✓
  • c) (TP+TN)/Total
  • d) TN/(TN+FP)
3. F1-Score is the harmonic mean of:
  • a) Accuracy and Precision
  • b) Precision and Recall ✓
  • c) Recall and Accuracy
  • d) MSE and RMSE
4. Which step of Data Science Methodology takes maximum time?
  • a) Business Understanding
  • b) Modelling
  • c) Data Preparation ✓
  • d) Deployment
5. RMSE equals:
  • a) √MAE
  • b) √MSE ✓
  • c) MSE²
  • d) MAE/n

Practical Activities

Activity 1: Calculate MSE and RMSE (MS Excel)

StudentActualPredictedErrorError²
18580525
27075-525
3908824
46568-39
5787624
MSE = 67/5 = 13.4Sum=67
RMSE = √13.4 = 3.66

Activity 2: Python – Train-Test Split

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

📥 Download Full Notes

Get the complete Word document with all notes, Q&A, MCQs and practicals

⬇ Download .DOCX 📌 Save to Pinterest
CBSE Class 12 AI Subject Code 843 Unit 2 Data Science Confusion Matrix Precision Recall F1 Score KVS 2025-26 Free Notes Model Validation

Comments

Chat on WhatsApp