CBSE Class 12 AI Unit 2: Data Science Methodology | AI Logic School

📚 CBSE Class XII | Subject Code 843

Unit 2: Data Science Methodology
An Analytic Approach

Complete Notes, Q&A, MCQs & Practicals | Session 2025–26 | KVS & CBSE Schools

Data Science Model Validation Confusion Matrix Precision & Recall F1-Score MSE & RMSE

Theory Marks

Practical Marks

Key Steps

Metrics

Validation Types

843

Subject Code

XII

Class

Unit 2

Chapter

20 hrs

Duration

Apr–May

KV Schedule

Section 1

Introduction to Data Science Methodology

Data Science Methodology is a structured, repeatable framework that guides the process of solving real-world problems using data. Without a methodology, data science projects often fail due to unclear goals, messy data, or poorly evaluated models.

🎯 Why Do We Need a Methodology?

Provides a systematic approach to problem-solving
Reduces chances of errors and omissions
Aligns technical work with business goals
Makes the process repeatable and scalable
Helps teams collaborate effectively

Section 2

10 Steps of Data Science Methodology

The IBM Data Science Methodology (based on CRISP-DM) follows these 10 stages:

Business Understanding

Clearly define the problem. Understand what the client needs. Translate business goals into data science objectives.

Example: A hospital wants to reduce patient readmissions. Goal = predict which patients are at risk.

Analytic Approach

Decide how to solve the problem based on the question type.

Question Type	Approach
Yes/No question	Classification
How much/many?	Regression
Find groups	Clustering

Data Requirements

Identify what data is needed, in what format, and from which sources.

Data Collection

Gather data from primary sources (surveys, sensors) and secondary sources (databases, APIs).

Data Understanding

Explore data — check types, missing values, outliers, and distributions. Visualize to understand patterns.

Data Preparation ⭐

The most time-consuming step (70–80% of total project time). Clean, transform, and engineer features.

Remember: "Garbage In = Garbage Out" — data quality determines model quality.

Modelling

Build the ML model. Select algorithm, train on training data, tune hyperparameters. First model is rarely final — iterate!

Evaluation

Test model on unseen data. Measure performance using accuracy, precision, recall, F1, MSE, RMSE.

Deployment

Integrate the model into real-world systems. Create APIs or dashboards for end users.

Feedback

Monitor model, collect feedback, watch for data drift, retrain periodically with new data.

Section 3

Model Validation Techniques

Model validation checks how well a model generalizes to new, unseen data.

Technique 1: Train-Test Split

Dataset is split into Training Set (70–80%) and Test Set (20–30%).

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Aspect	Details
Split Ratio	80:20 or 70:30 (Train:Test)
Advantage	Simple and fast
Disadvantage	Results depend on how data is split

Technique 2: K-Fold Cross Validation

Data is split into K equal folds. Model trains and tests K times; each fold serves as test set once. Final score = average of K evaluations.

With K=5: data splits into 5 folds. Each fold takes turns as the test set. Final score = average of 5 scores.

Criteria	Train-Test Split	K-Fold CV
Speed	Faster	Slower
Reliability	Lower	Higher
Data Usage	Partial	All data
Best For	Large datasets	Small datasets

Section 4

Evaluation Metrics – Classification

The Confusion Matrix

A table comparing actual vs. predicted values for a classification model.

	Predicted: Positive	Predicted: Negative
Actual: Positive	True Positive (TP) Correctly said YES	False Negative (FN) Missed a YES
Actual: Negative	False Positive (FP) Wrong alarm	True Negative (TN) Correctly said NO

Key Metrics

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Overall correctness. Best for balanced datasets.

Precision

Precision = TP / (TP + FP)

Of all predicted positives, how many were correct? Use when False Positives are costly (e.g., spam filter).

Recall (Sensitivity)

Recall = TP / (TP + FN)

Of all actual positives, how many did we catch? Use when False Negatives are costly (e.g., disease detection).

F1-Score ⭐

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Harmonic mean of Precision and Recall. Best for imbalanced datasets. Range: 0 to 1 (higher is better).

Section 5

Evaluation Metrics – Regression

For problems that predict a continuous value (house price, marks, temperature).

MAE – Mean Absolute Error

MAE = (1/n) × Σ |Actual − Predicted|

Average of absolute differences. Robust to outliers.

MSE – Mean Squared Error

MSE = (1/n) × Σ (Actual − Predicted)²

Penalizes large errors more. Sensitive to outliers.

RMSE – Root Mean Squared Error ⭐

RMSE = √MSE

Square root of MSE. Most interpretable metric. Same unit as the target variable.

Section 6

Question & Answer Bank

Very Short Answer (1 Mark)

Q1. What is Data Science Methodology?

A structured, step-by-step framework used to solve real-world problems using data, from understanding the business problem to deploying and monitoring a model.

Q2. What does TP mean in a confusion matrix?

TP = True Positive. It means the model correctly predicted a positive case (predicted YES, and it was actually YES).

Q3. What is RMSE?

RMSE = Root Mean Squared Error = √MSE. It measures the average prediction error for regression models in the same units as the target variable.

Q4. What is overfitting?

Overfitting occurs when a model learns training data too well and performs poorly on new/unseen data. It shows high training accuracy but low test accuracy.

Short Answer (2–3 Marks)

Q5. Explain the difference between Precision and Recall with formulas.

Precision = TP/(TP+FP) — measures what fraction of predicted positives were actually positive. It minimizes false alarms. Use when False Positives are costly (e.g., spam filter).

Recall = TP/(TP+FN) — measures what fraction of actual positives were correctly found. It minimizes missed cases. Use when False Negatives are costly (e.g., disease detection).

Q6. Explain Business Understanding with an example.

Business Understanding is the first step of Data Science Methodology where we clearly define the problem.

Example: A school wants to identify students at risk of failing exams. Business goal = predict which students need extra support. We translate this into: "Classify students as at-risk or not-at-risk based on attendance, marks, and participation."

Long Answer (5 Marks)

Q7. What is the Confusion Matrix? Calculate Accuracy, Precision, Recall and F1-Score from the given data: TP=60, TN=30, FP=10, FN=10.

A confusion matrix is a table that shows TP, TN, FP, FN for a classification model.

Given: TP=60, TN=30, FP=10, FN=10, Total=110

Accuracy = (60+30)/(60+30+10+10) = 90/110 = 81.8%
Precision = 60/(60+10) = 60/70 = 85.7%
Recall = 60/(60+10) = 60/70 = 85.7%
F1-Score = 2×(0.857×0.857)/(0.857+0.857) = 85.7%

Section 7

MCQ Practice Bank

1. The first step of Data Science Methodology is:

a) Data Collection
b) Business Understanding ✓
c) Modelling
d) Evaluation

2. Precision is defined as:

a) TP/(TP+FN)
b) TP/(TP+FP) ✓
c) (TP+TN)/Total
d) TN/(TN+FP)

3. F1-Score is the harmonic mean of:

a) Accuracy and Precision
b) Precision and Recall ✓
c) Recall and Accuracy
d) MSE and RMSE

4. Which step of Data Science Methodology takes maximum time?

a) Business Understanding
b) Modelling
c) Data Preparation ✓
d) Deployment

5. RMSE equals:

a) √MAE
b) √MSE ✓
c) MSE²
d) MAE/n

Section 8

Practical Activities

Activity 1: Calculate MSE and RMSE (MS Excel)

Student	Actual	Predicted	Error	Error²
1	85	80	5	25
2	70	75	-5	25
3	90	88	2	4
4	65	68	-3	9
5	78	76	2	4
MSE = 67/5 = 13.4				Sum=67
RMSE = √13.4 = 3.66

Activity 2: Python – Train-Test Split

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

📥 Download Full Notes

Get the complete Word document with all notes, Q&A, MCQs and practicals

⬇ Download .DOCX 📌 Save to Pinterest

CBSE Class 12 AI Subject Code 843 Unit 2 Data Science Confusion Matrix Precision Recall F1 Score KVS 2025-26 Free Notes Model Validation

Continue learning

📖

Class 12 AI | Unit 2: Data Science Methodology – An Analytic Approach to Capstone Project | Free Notes & Q&A

Unit 2: Data Science MethodologyAn Analytic Approach