Data Science with Python
Data Science is the process of collecting, cleaning, analysing and visualising data to find useful insights and make decisions. Python is the most popular language for Data Science because of its powerful libraries.
NumPy — Numerical Python
You recorded temperature for 7 days in your city: 38, 40, 37, 42, 39, 41, 36°C. NumPy helps you find average temperature, maximum, minimum in just one line of code — like a super calculator for lists of numbers!
import numpy as np
# 7 days temperature data (in Celsius)
temperature = np.array([38, 40, 37, 42, 39, 41, 36])
print("All temperatures:", temperature)
print("Average temp: ", np.mean(temperature)) # 39.0
print("Maximum temp: ", np.max(temperature)) # 42
print("Minimum temp: ", np.min(temperature)) # 36
print("Std Deviation: ", np.std(temperature)) # spread of data
# Array operations - add 2 degrees to all values
print("If +2 degrees: ", temperature + 2)
| Function | What it does | Example |
|---|---|---|
| np.array() | Create an array | np.array([1,2,3]) |
| np.mean() | Calculate average | np.mean([10,20,30]) → 20 |
| np.median() | Find middle value | np.median([1,3,5]) → 3 |
| np.std() | Standard deviation | np.std([2,4,4,4,5,5,7,9]) → 2 |
| np.var() | Variance | np.var(data) |
| np.arange() | Create number sequence | np.arange(1,10,2) → [1,3,5,7,9] |
Pandas — Data Analysis Library
Think of Pandas like a super Excel sheet in Python. Your school has data of 500 students — name, marks, attendance, class. Pandas lets you load this data, filter students who scored above 80%, find the class average, and sort by marks — all with just a few lines of code!
import pandas as pd
# Create a student DataFrame (like a table)
data = {
'Name': ['Aarav','Priya','Rahul','Sneha','Arjun'],
'Marks': [85, 92, 78, 95, 88],
'Subject': ['AI','AI','AI','AI','AI'],
'Class': [11, 11, 11, 11, 11]
}
df = pd.DataFrame(data)
print(df) # Show full table
print("\nAverage marks:", df['Marks'].mean())
print("Top scorer:", df['Name'][df['Marks'].idxmax()])
# Filter students with marks above 85
toppers = df[df['Marks'] > 85]
print("\nToppers:\n", toppers)
CSV (Comma Separated Values) is a simple file format for storing data — like a spreadsheet saved as plain text. Example: student_data.csv contains hundreds of rows of student information.
import pandas as pd
# Load data from CSV file
df = pd.read_csv('student_data.csv')
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
print(df.shape) # (rows, columns)
print(df.describe()) # Statistics summary
print(df.isnull().sum()) # Check missing values
# Save to new CSV
df.to_csv('cleaned_data.csv', index=False)
| Function | Purpose |
|---|---|
| df.head(n) | Show first n rows (default 5) |
| df.describe() | Show count, mean, std, min, max of all columns |
| df.shape | Returns (number of rows, number of columns) |
| df.isnull() | Find missing/empty values in data |
| df.dropna() | Remove rows with missing values |
| df.fillna(value) | Fill missing values with a specific value |
Matplotlib — Data Visualization
Your teacher wants to show how the class performed in 5 subjects. Instead of showing a boring table of numbers, Matplotlib creates a bar graph or pie chart — making it easy to see which subject students did best in!
import matplotlib.pyplot as plt
months = ['Jan','Feb','Mar','Apr']
sales = [200, 350, 300, 450]
plt.plot(months, sales, marker='o')
plt.title('Monthly Sales')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()
import matplotlib.pyplot as plt
subjects = ['Maths','AI','English',
'Science','Hindi']
marks = [85, 92, 78, 88, 76]
plt.bar(subjects, marks, color='teal')
plt.title('Subject-wise Marks')
plt.ylabel('Marks')
plt.show()
import matplotlib.pyplot as plt
labels = ['Pass','Fail','Absent']
sizes = [75, 15, 10]
colors = ['green','red','gray']
plt.pie(sizes, labels=labels,
colors=colors, autopct='%1.1f%%')
plt.title('Exam Results')
plt.show()
import matplotlib.pyplot as plt
marks = [45,55,60,65,70,70,75,
80,80,80,85,90,90,95]
plt.hist(marks, bins=5,
color='steelblue')
plt.title('Marks Distribution')
plt.xlabel('Marks Range')
plt.ylabel('Number of Students')
plt.show()
import matplotlib.pyplot as plt
study_hours = [2,3,4,5,6,7,8]
exam_marks = [50,55,65,70,78,85,92]
plt.scatter(study_hours, exam_marks,
color='purple')
plt.title('Study Hours vs Marks')
plt.xlabel('Hours Studied')
plt.ylabel('Marks Scored')
plt.show()
Statistics for Data Science
Virat Kohli scored these runs in 7 matches: 45, 82, 67, 23, 95, 56, 38. Let's calculate the key statistics to understand his performance!
import numpy as np
from scipy import stats
# Virat's runs in 7 matches
runs = np.array([45, 82, 67, 23, 95, 56, 38])
mean = np.mean(runs) # Average: 58.0
median = np.median(runs) # Middle value: 56.0
mode_val = stats.mode(runs) # Most frequent (no repeat here)
std_dev = np.std(runs) # How spread out: ~23.6
variance = np.var(runs) # Variance: ~557
print(f"Mean (Average): {mean:.1f} runs")
print(f"Median (Middle): {median:.1f} runs")
print(f"Standard Deviation: {std_dev:.1f} runs")
print(f"Variance: {variance:.1f}")
# Interpretation:
# High std deviation = inconsistent player
# Low std deviation = consistent player
Sum of all values ÷ count. Like sharing pizza equally among friends.
Middle value when sorted. Used for salaries to avoid effect of very high earners.
Most frequent value. Like the most popular shoe size in a shop.
How spread out data is from the mean. Low = consistent, High = spread out.
- NumPy = for numerical operations on arrays and matrices
- Pandas = for tabular data (rows and columns) — like Excel in Python
- Matplotlib = for creating graphs and charts
- CSV = Comma Separated Values — most common data file format
- df.head() shows first 5 rows | df.describe() shows statistics
- Always check for missing values with df.isnull().sum() before analysis
Comments
Post a Comment