Class 12 AI | Unit 2: Data Science Methodology – An Analytic Approach to Capstone Project | Free Notes & Q&A
Unit 2: Data Science Methodology
An Analytic Approach
Complete Notes, Q&A, MCQs & Practicals | Session 2025–26 | KVS & CBSE Schools
📋 Table of Contents
Introduction to Data Science Methodology
Data Science Methodology is a structured, repeatable framework that guides the process of solving real-world problems using data. Without a methodology, data science projects often fail due to unclear goals, messy data, or poorly evaluated models.
🎯 Why Do We Need a Methodology?
- Provides a systematic approach to problem-solving
- Reduces chances of errors and omissions
- Aligns technical work with business goals
- Makes the process repeatable and scalable
- Helps teams collaborate effectively
10 Steps of Data Science Methodology
The IBM Data Science Methodology (based on CRISP-DM) follows these 10 stages:
Business Understanding
Clearly define the problem. Understand what the client needs. Translate business goals into data science objectives.
Analytic Approach
Decide how to solve the problem based on the question type.
| Question Type | Approach |
|---|---|
| Yes/No question | Classification |
| How much/many? | Regression |
| Find groups | Clustering |
Data Requirements
Identify what data is needed, in what format, and from which sources.
Data Collection
Gather data from primary sources (surveys, sensors) and secondary sources (databases, APIs).
Data Understanding
Explore data — check types, missing values, outliers, and distributions. Visualize to understand patterns.
Data Preparation ⭐
The most time-consuming step (70–80% of total project time). Clean, transform, and engineer features.
Modelling
Build the ML model. Select algorithm, train on training data, tune hyperparameters. First model is rarely final — iterate!
Evaluation
Test model on unseen data. Measure performance using accuracy, precision, recall, F1, MSE, RMSE.
Deployment
Integrate the model into real-world systems. Create APIs or dashboards for end users.
Feedback
Monitor model, collect feedback, watch for data drift, retrain periodically with new data.
Model Validation Techniques
Model validation checks how well a model generalizes to new, unseen data.
Technique 1: Train-Test Split
Dataset is split into Training Set (70–80%) and Test Set (20–30%).
| Aspect | Details |
|---|---|
| Split Ratio | 80:20 or 70:30 (Train:Test) |
| Advantage | Simple and fast |
| Disadvantage | Results depend on how data is split |
Technique 2: K-Fold Cross Validation
Data is split into K equal folds. Model trains and tests K times; each fold serves as test set once. Final score = average of K evaluations.
| Criteria | Train-Test Split | K-Fold CV |
|---|---|---|
| Speed | Faster | Slower |
| Reliability | Lower | Higher |
| Data Usage | Partial | All data |
| Best For | Large datasets | Small datasets |
Evaluation Metrics – Classification
The Confusion Matrix
A table comparing actual vs. predicted values for a classification model.
| Predicted: Positive | Predicted: Negative | |
|---|---|---|
| Actual: Positive | True Positive (TP) Correctly said YES |
False Negative (FN) Missed a YES |
| Actual: Negative | False Positive (FP) Wrong alarm |
True Negative (TN) Correctly said NO |
Key Metrics
Accuracy
Overall correctness. Best for balanced datasets.
Precision
Of all predicted positives, how many were correct? Use when False Positives are costly (e.g., spam filter).
Recall (Sensitivity)
Of all actual positives, how many did we catch? Use when False Negatives are costly (e.g., disease detection).
F1-Score ⭐
Harmonic mean of Precision and Recall. Best for imbalanced datasets. Range: 0 to 1 (higher is better).
Evaluation Metrics – Regression
For problems that predict a continuous value (house price, marks, temperature).
MAE – Mean Absolute Error
Average of absolute differences. Robust to outliers.
MSE – Mean Squared Error
Penalizes large errors more. Sensitive to outliers.
RMSE – Root Mean Squared Error ⭐
Square root of MSE. Most interpretable metric. Same unit as the target variable.
Question & Answer Bank
Very Short Answer (1 Mark)
Short Answer (2–3 Marks)
Recall = TP/(TP+FN) — measures what fraction of actual positives were correctly found. It minimizes missed cases. Use when False Negatives are costly (e.g., disease detection).
Example: A school wants to identify students at risk of failing exams. Business goal = predict which students need extra support. We translate this into: "Classify students as at-risk or not-at-risk based on attendance, marks, and participation."
Long Answer (5 Marks)
Given: TP=60, TN=30, FP=10, FN=10, Total=110
Accuracy = (60+30)/(60+30+10+10) = 90/110 = 81.8%
Precision = 60/(60+10) = 60/70 = 85.7%
Recall = 60/(60+10) = 60/70 = 85.7%
F1-Score = 2×(0.857×0.857)/(0.857+0.857) = 85.7%
MCQ Practice Bank
- a) Data Collection
- b) Business Understanding ✓
- c) Modelling
- d) Evaluation
- a) TP/(TP+FN)
- b) TP/(TP+FP) ✓
- c) (TP+TN)/Total
- d) TN/(TN+FP)
- a) Accuracy and Precision
- b) Precision and Recall ✓
- c) Recall and Accuracy
- d) MSE and RMSE
- a) Business Understanding
- b) Modelling
- c) Data Preparation ✓
- d) Deployment
- a) √MAE
- b) √MSE ✓
- c) MSE²
- d) MAE/n
Practical Activities
Activity 1: Calculate MSE and RMSE (MS Excel)
| Student | Actual | Predicted | Error | Error² |
|---|---|---|---|---|
| 1 | 85 | 80 | 5 | 25 |
| 2 | 70 | 75 | -5 | 25 |
| 3 | 90 | 88 | 2 | 4 |
| 4 | 65 | 68 | -3 | 9 |
| 5 | 78 | 76 | 2 | 4 |
| MSE = 67/5 = 13.4 | Sum=67 | |||
| RMSE = √13.4 = 3.66 | ||||
Activity 2: Python – Train-Test Split
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
📥 Download Full Notes
Get the complete Word document with all notes, Q&A, MCQs and practicals
⬇ Download .DOCX 📌 Save to Pinterest
Comments
Post a Comment