This page contains the NCERT Informatics Practices class 12 chapter 7 Project Based Learning from the book Informatics Practices. You can find the solutions for the chapter 7 of NCERT class 12 Informatics Practices, for the Project Related Details in this page. So is the case if you are looking for NCERT class 12 Informatics Practices related topic Project Based Learning project solutions.
PROJECTS
Project I : Online Shopping Platform
Description
Murugan plans to launch an online shopping platform— ‘APPAREL EASY’. He plans to have two broad categories of merchandise—Men, Women. Under both the categories— Clothing, Footwear and Accessories will be the sub-categories. Also, on his shopping platform, he is planning to launch two mega events— Festive Sale (a month before Diwali to Christmas), End of Season Sale (February and August). Murugan also wants to keep a record of his monthly revenue generation sales and category wise sales, with special focus on mega events. A record should also be kept on discounts being offered by the manufacturers, payment sites or any discount offered as a promotional campaign by the APPAREL EASY portal.
Specification
The details of the Men and Women apparels should be stored in a data file with fields as Apparel Code, Name, Category, Size, Price, Customer Name, Payment Mode, Discount Code, etc.
If the Category is Men, then apparels can be Men’s Trousers, Men’s Shirt, Men’s Jeans, Men’s T-shirt. If the Category is Women, then the apparels can be Skirts, Top, Pants, Jeans, Kurta, etc.
If the Payment Mode is Credit or Debit Card, then the Credit card number, name, CVV and validity should be entered.
If the Payment Mode is Cash on Delivery(COD), then no details to be asked.
Randomly select the merchandise to be put on sale. The selected merchandise should not be more than 70 per cent of the total merchandise.
The discount code can be either FEST (for Festive) or EOS (for End of Season). The discounts for FEST will be 10 per cent and for EOS will be 15 per cent.
You need to visualise the data structure, keeping all the requirements of Murugan in mind, and then implement it using Python Pandas. Thereafter, you need to design a software to store details of the merchandise to put them online for sale. At the same time, records of customers visiting the e-commerce site and the number of customers placing the order also have to be maintained. The data collected should be plotted appropriately to help Murugan make decisions for future marketing and promotion strategies.
1) What I planned to build (as a student)
Objectives
•
Store merchandise details in a data file (CSV) with fields like Apparel Code, Name, Category, Size, Price, Customer Name, Payment Mode, Discount Code etc.
•
Maintain records of:
•
customers visiting the site
•
customers placing orders
•
Compute:
•
monthly revenue generation
•
category-wise sales
•
special focus on mega events (Festive Sale, End of Season Sale)
•
Use Python Pandas to implement data handling and plot appropriately for decisions.
2) Data structure (CSV files I used)
I created 3 simple CSV files:
A)
merchandise.csv (Master list of items)
Columns:
•
ApparelCode, Name, GenderCategory, SubCategory, Size, Price
B)
visits.csv (Customer visits record)
Columns:
•
VisitID, VisitDate, CustomerName, GenderViewed, Device, DidOrder (Y/N)
C)
orders.csv (Orders placed)
Columns (as per requirements + a few useful ones):
•
OrderID, OrderDate, Month, Event, ApparelCode, Name, GenderCategory, SubCategory, Size, Price, PaymentMode, DiscountCode, DiscountPercent, FinalAmount
Discount rules:
•
FEST = 10%
•
EOS = 15%
Payment Mode rule:
•
If Credit/Debit Card → card details to be entered
•
If COD → no details asked
(In my project file, I stored only a masked card like XXXX-XXXX-XXXX-1234 for safety.)
3) Sample data I prepared (presented as a student)
A) Sample
merchandise.csv (10 rows shown)ApparelCode
Name
GenderCategory
SubCategory
Size
Price
MTS101
Men’s T-shirt
Men
Clothing
M
799
MSH201
Men’s Shirt
Men
Clothing
L
1,299
MJE301
Men’s Jeans
Men
Clothing
32
1,899
MTR401
Men’s Trousers
Men
Clothing
34
1,599
MFW501
Men’s Sneakers
Men
Footwear
9
2,499
MAC601
Men’s Wallet
Men
Accessories
NA
999
WKT101
Women’s Kurta
Women
Clothing
M
1,499
WTP201
Women’s Top
Women
Clothing
S
899
WFW301
Women’s Heels
Women
Footwear
6
2,199
WAC401
Women’s Handbag
Women
Accessories
NA
1,999
B) Random merchandise selection for sale (≤ 70%)
Suppose total merchandise in my file = 20 items.
Maximum allowed on sale = 70% of 20 = 14 items.
I randomly selected 14 items for the sale list (valid).
C) Sample
visits.csv (8 rows shown)VisitID
VisitDate
CustomerName
GenderViewed
Device
DidOrder
1
2025-02-02
Riya
Women
Mobile
Y
2
2025-02-02
Aman
Men
Laptop
N
3
2025-02-05
Neha
Women
Mobile
Y
4
2025-08-10
Kabir
Men
Mobile
Y
5
2025-08-11
Sara
Women
Laptop
N
6
2025-11-20
Rahul
Men
Mobile
Y
7
2025-12-05
Ananya
Women
Mobile
Y
8
2025-12-06
Mohit
Men
Laptop
N
From this, I can calculate:
•
Total visitors
•
Orders placed
•
Conversion rate = orders / visitors
D) Sample
orders.csv (10 rows shown)OrderID
OrderDate
Month
Event
ApparelCode
GenderCategory
SubCategory
Price
PaymentMode
DiscountCode
DiscountPercent
FinalAmount
101
2025-02-02
Feb
EOS
WKT101
Women
Clothing
1,499
COD
EOS
15
1,274.15
102
2025-02-05
Feb
EOS
MFW501
Men
Footwear
2,499
Card
EOS
15
2,124.15
103
2025-08-10
Aug
EOS
WAC401
Women
Accessories
1,999
Card
EOS
15
1,699.15
104
2025-08-11
Aug
EOS
MSH201
Men
Clothing
1,299
COD
EOS
15
1,104.15
105
2025-11-20
Nov
FEST
MJE301
Men
Clothing
1,899
Card
FEST
10
1,709.10
106
2025-11-21
Nov
FEST
WTP201
Women
Clothing
899
COD
FEST
10
809.10
107
2025-12-05
Dec
FEST
WFW301
Women
Footwear
2,199
Card
FEST
10
1,979.10
108
2025-12-06
Dec
FEST
MAC601
Men
Accessories
999
COD
FEST
10
899.10
109
2025-03-12
Mar
None
MTS101
Men
Clothing
799
COD
None
0
799.00
110
2025-04-18
Apr
None
WKT101
Women
Clothing
1,499
Card
None
0
1,499.00
4) What analysis I performed using Pandas
A) Monthly revenue generation
I grouped orders by
Month and summed FinalAmount.Sample output (from my sample orders above):
Month
Total Revenue (₹)
Feb
3,398.30
Aug
2,803.30
Nov
2,518.20
Dec
2,878.20
Mar
799.00
Apr
1,499.00
B) Category-wise sales (Men/Women + Clothing/Footwear/Accessories)
I grouped by:
•
GenderCategory•
SubCategoryand summed
FinalAmount.GenderCategory
SubCategory
Sales (₹)
Men
Clothing
3,312.25
Men
Footwear
2,124.15
Men
Accessories
899.10
Women
Clothing
3,582.25
Women
Footwear
1,979.10
Women
Accessories
1,699.15
C) Mega event focus (Festive Sale vs End of Season Sale)
Grouped by
Event and summed FinalAmount.Event
Revenue (₹)
EOS
6,201.60
FEST
5,396.40
None
2,298.00
D) Discounts record (FEST/EOS/None)
Grouped by
DiscountCode.Sample output:
DiscountCode
Orders Count
Total Discount Given (₹ approx.)
EOS
4
(1499×0.15 + 2499×0.15 + 1999×0.15 + 1299×0.15) = 1,050.00
FEST
4
(1899×0.10 + 899×0.10 + 2199×0.10 + 999×0.10) = 599.60
None
2
0
(This helps Murugan track which sale gives more discount and which gives more revenue.)
5) Python (Pandas) code
import pandas as pd
import matplotlib.pyplot as plt
mer = pd.read_csv("merchandise.csv")
vis = pd.read_csv("visits.csv")
ord = pd.read_csv("orders.csv")
# 1) Monthly revenue generation
monthly_rev = ord.groupby("Month")["FinalAmount"].sum().sort_index()
print(monthly_rev)
# 2) Category wise sales
cat_sales = ord.groupby(["GenderCategory","SubCategory"])["FinalAmount"].sum()
print(cat_sales)
# 3) Event-wise revenue (mega events focus)
event_rev = ord.groupby("Event")["FinalAmount"].sum()
print(event_rev)
# 4) Conversion rate (customers visiting vs placing order)
total_visits = len(vis)
total_orders = (vis["DidOrder"] == "Y").sum()
conversion_rate = (total_orders / total_visits) * 100
print("Conversion Rate:", round(conversion_rate, 2), "%")
# Plots (as asked: “plotted appropriately”)
monthly_rev.plot(kind="bar")
plt.title("Monthly Revenue Generation")
plt.ylabel("Revenue (₹)")
plt.show()
event_rev.plot(kind="bar")
plt.title("Revenue During Mega Events")
plt.ylabel("Revenue (₹)")
plt.show()
6) Conclusion
From my analysis, I can clearly see:
•
Which month earns the highest revenue (monthly revenue generation).
•
Which category (Men/Women) and sub-category (Clothing/Footwear/Accessories) sells more.
•
How much revenue comes during mega events like Festive Sale and End of Season Sale.
•
How discounts (FEST 10%, EOS 15%) affect revenue and sales.
Bar Chart depicting the APPAREL EASY: Monthly Revenue Generation (Sample Data)
Bar Chart depicting the APPAREL EASY: Revenue by Mega Events (Sample Data)
Bar Chart depicting the APPAREL EASY: Category-wise Sales (Gender + Sub-category)
Bar Chart depicting the APPAREL EASY: Customers Visiting vs Customers Placing Orders
Project II: Automating a Books Donation Camp
Description
Realising the importance of Reduce, Reuse and Recycle, the Bookworm club every year organises a Book Donation Camp. The Book Donation camp collects books and notebooks. The volunteers assess the condition of the books and categorise them as Fit, Needs mending, or Unfit. The unfit books’ pages are used to create paper bags and envelopes. The other categories of books are resold at half the price. They accept notebooks that have pages left in them. The pages are torn from the notebooks, and are attractively bound to create a new notebook and sold. They create a variety of recycled objects and sell them. They want to create a software for this purpose and store details about the camp. To be able to efficiently store, retrieve and visualise data, they need to implement the following using Pandas.
Specification
The details of collections are stored in a CSV file with column headings as Item category, Item ID, Item name, Item type, Condition.
If the Item Category is Book, then the Item Type can be either Academic or Non Academic, and Item Id shall be prefixed with a ‘B’. In case of Academic, class shall be entered.
If the Item Category is Notebook then the Item can be Single line, Four Line, Five Line, and Item Id shall be prefixed with an ‘N’.
Condition can only be Fit, Needs Mending or Unfit. After the items are refurbished, the data are stored in another CSV file containing the following column headings: Item id, Item name, Item Category, Quantity, Price. Item Category can be Paper bags, Notebook, Books. In case of books Class is also to be entered.
Another CSV file to store orders is created that stores Item Category, Item name, Quantity and Price. In case of an order, the refurbished CSV shall update the quantity.
To ensure effective decision making, it is required that different data are plotted using appropriate plots to show sales, items refurbished, and items collected.
1) Project idea and purpose
I am a volunteer in the Bookworm club. Every year we organise a Book Donation Camp to support Reduce, Reuse and Recycle. In this camp, we collect books and notebooks. Then volunteers check the condition of each book and categorise it as Fit, Needs mending, or Unfit. The unfit books’ pages are used to create paper bags and envelopes. The other books (Fit / Needs mending) are resold at half the price. For notebooks that still have usable pages, we tear the pages and attractively bind them to make a new notebook and sell it.
To store, retrieve, and visualise this data efficiently, we have to implement the work using Python Pandas.
2) What the book asks us to create (as a student)
According to the project specification, we need 3 CSV files:
A) Collections CSV (raw collection data)
Column headings must be: Item category, Item ID, Item name, Item type, Condition.
Rules:
•
If Item Category = Book:
•
Item Type = Academic or Non Academic
•
Item ID must start with “B”
•
If Academic, class shall be entered
•
If Item Category = Notebook:
•
Item type can be Single line / Four Line / Five Line
•
Item ID must start with “N”
•
Condition can only be Fit / Needs Mending / Unfit
B) Refurbished CSV (after refurbishing)
After refurbishment, store data in another CSV with headings:
Item id, Item name, Item Category, Quantity, Price.
•
Item Category here can be Paper bags, Notebook, Books
•
For books, Class is also to be entered
C) Orders CSV (orders placed by customers)
Another CSV for orders must store:
Item Category, Item name, Quantity, Price.
And when an order happens, the refurbished CSV shall update the quantity.
Plots/Diagrams (important!)
We must plot data using appropriate plots to show:
•
sales
•
items refurbished
•
items collected
3) My sample data (as a student)
File 1:
collections.csv (sample)Item category
Item ID
Item name
Item type
Condition
Book
B001
Mathematics (Class 10)
Academic
Fit
Book
B002
English Literature
Non Academic
Needs mending
Book
B003
Science (Class 9)
Academic
Unfit
Notebook
N001
Classwork Notebook
Single line
Fit
Notebook
N002
Rough Notebook
Four Line
Fit
Notebook
N003
Project Notebook
Five Line
Needs mending
Book
B004
Social Science (Class 8)
Academic
Fit
Book
B005
Short Stories
Non Academic
Unfit
Note (student assumption): The “class shall be entered” for academic books, so I wrote it in the Item name like “(Class 10)”. (In a real project, we can also add a separate Class column.)
File 2:
refurbished.csv (sample)(After refurbishment: books for resale, rebound notebooks, and paper bags/envelopes from unfit pages)
Item id
Item name
Item Category
Quantity
Price (₹)
Class
B001
Mathematics (Class 10)
Books
3
120
10
B002
English Literature
Books
2
80
B004
Social Science (Class 8)
Books
4
100
8
N001
Rebound Single line Notebook
Notebook
6
40
N002
Rebound Four Line Notebook
Notebook
5
45
PB01
Paper Bags (from unfit pages)
Paper bags
20
5
ENV1
Envelopes (from unfit pages)
Paper bags
30
3
Note (student assumption): Prices are just sample values. For books, I assumed “half price” selling as mentioned in the description.
File 3:
orders.csv (sample)Item Category
Item name
Quantity
Price (₹)
Books
Mathematics (Class 10)
1
120
Notebook
Rebound Single line Notebook
2
40
Paper bags
Paper Bags (from unfit pages)
5
5
Books
Social Science (Class 8)
2
100
Notebook
Rebound Four Line Notebook
1
45
Paper bags
Envelopes (from unfit pages)
10
3
4) Outputs I generated from the sample data
A) Items collected
Collected (count) by item category
•
Book = 5
•
Notebook = 3
Collected (count) by condition
•
Fit = 4
•
Needs mending = 2
•
Unfit = 2
B) Items refurbished (stock available)
Refurbished stock (total quantity) by category
•
Paper bags = 50
•
Notebook = 11
•
Books = 9
C) Sales summary (Revenue = Quantity × Price)
Sales (total revenue) by category
•
Books = ₹ 320
•
Notebook = ₹ 125
•
Paper bags = ₹ 55
5) Output diagrams (as required in Project 2)
As per the requirement, I plotted items collected, items refurbished, and sales using bar charts.
The diagrams I generated are:
1.
Items Collected (Count) by Item Category
2.
Items Refurbished (Total Quantity) by Category
3.
Sales (Total Revenue) by Category
4.
(Extra helpful plot) Collected Items by Condition (Fit / Needs mending / Unfit)
6) Python (Pandas) code I used
import pandas as pd
import matplotlib.pyplot as plt
# ---------- Step 1: Read CSV files ----------
collections = pd.read_csv("collections.csv")
refurbished = pd.read_csv("refurbished.csv")
orders = pd.read_csv("orders.csv")
# ---------- Step 2: Basic validation (as per specification) ----------
allowed_conditions = {"Fit", "Needs mending", "Needs Mending", "Unfit"}
# Condition validation
bad_conditions = collections[~collections["Condition"].isin(allowed_conditions)]
if not bad_conditions.empty:
print("Invalid conditions found:\n", bad_conditions)
# ID prefix validation
bad_book_ids = collections[(collections["Item category"] == "Book") & (~collections["Item ID"].astype(str).str.startswith("B"))]
bad_notebook_ids = collections[(collections["Item category"] == "Notebook") & (~collections["Item ID"].astype(str).str.startswith("N"))]
if not bad_book_ids.empty:
print("Invalid Book IDs:\n", bad_book_ids)
if not bad_notebook_ids.empty:
print("Invalid Notebook IDs:\n", bad_notebook_ids)
# ---------- Step 3: Items collected (counts) ----------
collected_by_category = collections.groupby("Item category")["Item ID"].count()
collected_by_condition = collections.groupby("Condition")["Item ID"].count()
print("Collected by category:\n", collected_by_category)
print("Collected by condition:\n", collected_by_condition)
# ---------- Step 4: Items refurbished (quantity) ----------
refurbished_qty = refurbished.groupby("Item Category")["Quantity"].sum()
print("Refurbished quantity by category:\n", refurbished_qty)
# ---------- Step 5: Sales (revenue) ----------
orders["Revenue"] = orders["Quantity"] * orders["Price"]
sales_by_category = orders.groupby("Item Category")["Revenue"].sum()
print("Sales (revenue) by category:\n", sales_by_category)
# ---------- Step 6: Update refurbished stock after orders ----------
# We update by matching (Item Category + Item name)
# (In real software, Item id would be best for matching.)
stock = refurbished.copy()
stock_key = stock["Item Category"].astype(str) + "||" + stock["Item name"].astype(str)
order_key = orders["Item Category"].astype(str) + "||" + orders["Item name"].astype(str)
stock["key"] = stock_key
orders["key"] = order_key
ordered_qty = orders.groupby("key")["Quantity"].sum().reset_index()
# Merge ordered qty into stock and subtract
stock = stock.merge(ordered_qty, on="key", how="left", suffixes=("", "_ordered"))
stock["Quantity_ordered"] = stock["Quantity_ordered"].fillna(0).astype(int)
stock["Quantity"] = stock["Quantity"] - stock["Quantity_ordered"]
# Ensure quantity does not go negative
stock.loc[stock["Quantity"] < 0, "Quantity"] = 0
# Save updated stock
stock.drop(columns=["key", "Quantity_ordered"]).to_csv("refurbished_updated.csv", index=False)
# ---------- Step 7: Plots (as required) ----------
plt.figure()
collected_by_category.plot(kind="bar")
plt.title("Items Collected (Count) by Item Category")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
plt.figure()
refurbished_qty.plot(kind="bar")
plt.title("Items Refurbished (Total Quantity) by Category")
plt.ylabel("Quantity")
plt.tight_layout()
plt.show()
plt.figure()
sales_by_category.plot(kind="bar")
plt.title("Sales (Total Revenue) by Category")
plt.ylabel("Revenue (₹)")
plt.tight_layout()
plt.show()
# Optional: by condition
plt.figure()
collected_by_condition.plot(kind="bar")
plt.title("Collected Items by Condition")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
7) Conclusion
This project helps the Bookworm club run the Book Donation Camp in a systematic way. We can store collection data, refurbish items, take orders, and automatically update quantity after each order. Most importantly, the plots for sales, items refurbished, and items collected help in effective decision making for the next camp.
Bar Chart depicting the Books Donation Camp: Items Collected (Count) by Item Category
Bar Chart depicting the Books Donation Camp: Items Refurbished (Total Quantity) by Category
Bar Chart depicting the Books Donation Camp: Sales (Total Revenue) by Category
Bar Chart depicting the Books Donation Camp: Collected Items by Condition
Project III: A survey of the effect of social networking sites on behaviour of teenagers
Description
With the Internet revolution everyone today is now connected. Teenagers spend a good amount of time on social networking sites, and it plays a vital role in their behaviour. It is considered that excessive use of social networking sites has sometimes a serious impact on the mental health of individuals. A well-crafted survey questionnaire can help in exploring and finding many facts.
Specifications
1.
Create a survey questionnaire using any of the freely available online tools (such as google forms) and store the responses in a CSV file.
2.
Prepare some data analysis questions that you expect them to answer
3.
Import the CSV file in Pandas DataFrame
4.
Perform statistical computation such as mean, median, etc., with respect to the identified questions
5.
Visualise the findings of the survey using appropriate charts.
1) Aim of the project
Because of the Internet revolution, most of us are connected all the time. Teenagers spend a good amount of time on social networking sites, and it can affect our behaviour. It is also considered that excessive use can sometimes have an impact on mental health. So, a well-crafted survey questionnaire can help us explore facts and patterns.
As per the project description, we must:
1.
Create a survey questionnaire using an online tool (like Google Forms) and store responses in a CSV
2.
Prepare data analysis questions
3.
Import CSV into Pandas DataFrame
4.
Do statistical computation like mean, median
5.
Visualise findings using appropriate charts
2) Google Survey (Google Forms) details
Form Title
“Survey: Effect of Social Networking Sites on Behaviour of Teenagers”
Form Description (what I wrote in the form)
“This is a short survey to understand how social networking sites affect teenagers’ behaviour, study routine, and mental health. The survey is anonymous and only for academic project work.”
Settings I selected (recommended)
✅
Collect responses (default)
✅
Limit to 1 response (optional, if school emails exist)
✅
Edit after submit = OFF (to keep data consistent)
✅
Show progress bar = ON (optional)
✅
Required questions = ON for key questions
Sections (simple structure)
Section A:
Basic Information
Section B:
Social Media Usage
Section C:
Behaviour / Study / Well-being
Section D:
Optional Comments
Questions included (with types + options)
Section A: Basic Information
1. Age (Short answer, response validation = number 12–19)
2. Class (Dropdown: 8, 9, 10, 11, 12)
Section B: Social Networking Sites Usage
3. How much time do you spend daily on social networking sites? (Multiple choice)
•
0–1 hour
•
1–2 hours
•
2–3 hours
•
3–4 hours
•
4–5 hours
•
5+ hours
4. Which social networking site do you use the most? (Multiple choice)
•
WhatsApp / Instagram / YouTube / Snapchat / Facebook / X (Twitter) / Other
5. Main purpose of using social networking sites (Multiple choice)
•
Chatting, Entertainment, Learning, Study groups, News, Socialising, Other
Section C: Behaviour / Study / Mental health
6. Do you use social networking sites just before sleeping? (Yes/No)
7. How many hours do you sleep per night (average)? (Short answer, number 3–10)
8. Mood after using social networking sites (Linear scale 1–5)
•
1 = Very bad, 5 = Very good
9. Concentration while studying (Linear scale 1–5)
•
1 = Very low, 5 = Very high
10. How often do you feel anxiety/stress due to social networking? (Multiple choice)
•
Never, Rarely, Sometimes, Often
11. Physical activity per day (minutes) (Short answer, number 0–180)
Section D: Optional comments
12. One positive and one negative effect you have noticed (Paragraph)
How I got the CSV file from Google Forms
•
Open the form → Responses tab
•
Click Download responses (.csv) (or open in Sheets → File → Download → CSV)
3) Sample responses (data I used)
I created a sample dataset of 30 students (like a real class survey) and used it for Pandas analysis + charts (you can see the “Project III – Sample Survey Responses” table shown above in chat).
Main columns in my CSV/DataFrame:
•
Age, Class
•
DailyHours (numeric hours/day)
•
MainPlatform, Purpose
•
SleepHours
•
MoodScore (1–5), ConcentrationScore (1–5)
•
AnxietyFrequency (Never/Rarely/Sometimes/Often)
•
ScreenBeforeBed (Yes/No)
•
PhysicalActivityMin
4) Data analysis questions (what I expected to answer)
These are the “analysis questions” based on the survey (as required):
1.
What is the average (mean) and median daily time spent on social networking sites?
2.
Which platform is used the most?
3.
Do students who spend more time tend to sleep less?
4.
Compare mood and concentration between:
•
High usage group (≥ 3 hours/day)
•
Low usage group (< 3 hours/day)
5.
How is anxiety/stress frequency distributed?
5) My results (from the sample data)
A) Mean / Median (statistical computation)
•
Mean DailyHours = 2.75 hours/day
•
Median DailyHours = 2.7 hours/day
•
Mean SleepHours = 6.98 hours/night
•
Median SleepHours = 7.0 hours/night
•
Mean MoodScore = 3.20 / 5
•
Mean ConcentrationScore = 3.03 / 5
B) Platform most used (counts)
•
Instagram: 10
•
WhatsApp: 9
•
YouTube: 8
•
Snapchat: 3
C) High vs Low usage comparison (mean values)
Usage group
Avg DailyHours
Avg SleepHours
Avg MoodScore
Avg Concentration
High (≥3h/day)
4.01
6.12
2.31
2.23
Low (<3h/day)
1.79
7.64
3.88
3.65
Observation (student conclusion): In this sample, the high-usage group sleeps less and has lower mood + concentration on average.
D) Anxiety frequency (counts)
•
Never: 7
•
Rarely: 7
•
Sometimes: 8
•
Often: 8
6) Charts/Graphs I generated (as required)
I generated and displayed these charts (using Pandas + Matplotlib):
1.
Daily hours distribution (bar chart)
2.
Most used platform (bar chart)
3.
Daily hours vs sleep hours (scatter plot)
4.
Mood score by usage group (bar chart of averages)
5.
Anxiety frequency distribution (bar chart)
7) Python program used for analysis + graphs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# -----------------------------
# Step 1: Read CSV into DataFrame
# -----------------------------
# Example filename exported from Google Forms:
df = pd.read_csv("social_media_survey_responses.csv")
# (Optional) If DailyHours is stored as ranges like "2–3 hours",
# convert it into numeric hours (approx midpoint) before analysis.
# -----------------------------
# Step 2: Statistical computations (mean, median, etc.)
# -----------------------------
print("Mean DailyHours:", df["DailyHours"].mean())
print("Median DailyHours:", df["DailyHours"].median())
print("Mean SleepHours:", df["SleepHours"].mean())
print("Median SleepHours:", df["SleepHours"].median())
# -----------------------------
# Step 3: Prepare analysis groups
# -----------------------------
df["UsageGroup"] = np.where(df["DailyHours"] >= 3, "High (>=3h/day)", "Low (<3h/day)")
group_stats = df.groupby("UsageGroup")[["DailyHours","SleepHours","MoodScore","ConcentrationScore"]].mean()
print("\nHigh vs Low usage comparison:\n", group_stats)
# -----------------------------
# Step 4: Visualise findings using charts
# -----------------------------
# Chart 1: Daily hours distribution (bins)
bins = [0, 1, 2, 3, 4, 5, 6]
labels = ["0-1", "1-2", "2-3", "3-4", "4-5", "5-6"]
df["HoursBin"] = pd.cut(df["DailyHours"], bins=bins, labels=labels, include_lowest=True, right=False)
hours_dist = df["HoursBin"].value_counts().reindex(labels).fillna(0)
plt.figure()
hours_dist.plot(kind="bar")
plt.title("Daily Time on Social Networking Sites (Hours/Day)")
plt.xlabel("Hours (binned)")
plt.ylabel("Number of students")
plt.tight_layout()
plt.show()
# Chart 2: Most used platform
plt.figure()
df["MainPlatform"].value_counts().plot(kind="bar")
plt.title("Most Used Social Networking Platform")
plt.xlabel("Platform")
plt.ylabel("Number of students")
plt.tight_layout()
plt.show()
# Chart 3: Daily hours vs sleep hours (scatter)
plt.figure()
plt.scatter(df["DailyHours"], df["SleepHours"])
plt.title("Daily Social Media Hours vs Sleep Hours")
plt.xlabel("Daily social media time (hours)")
plt.ylabel("Sleep per night (hours)")
plt.tight_layout()
plt.show()
# Chart 4: Mood by usage group
plt.figure()
df.groupby("UsageGroup")["MoodScore"].mean().plot(kind="bar")
plt.title("Average Mood Score by Usage Group")
plt.xlabel("Usage group")
plt.ylabel("Average Mood Score (1-5)")
plt.tight_layout()
plt.show()
# Chart 5: Anxiety frequency distribution
order = ["Never","Rarely","Sometimes","Often"]
plt.figure()
df["AnxietyFrequency"].value_counts().reindex(order).fillna(0).plot(kind="bar")
plt.title("Anxiety Frequency (Self-reported)")
plt.xlabel("Frequency")
plt.ylabel("Number of students")
plt.tight_layout()
plt.show()
The following is the content of the csv file (social_media_survey_responses.csv) used in the above code:
RespondentID
Age
Class
DailyHours
MainPlatform
Purpose
SleepHours
MoodScore
ConcentrationScore
AnxietyFrequency
ScreenBeforeBed
PhysicalActivityMin
1
15
10
2.0
WhatsApp
Chatting
7.5
4
4
Rarely
No
45
2
16
11
3.5
Instagram
Entertainment
6.5
3
3
Sometimes
Yes
30
3
14
9
1.0
YouTube
Learning
8.0
4
4
Never
No
60
4
17
12
4.5
Instagram
Socialising
6.0
2
2
Often
Yes
20
5
16
11
2.5
WhatsApp
Study groups
7.0
4
3
Rarely
Yes
40
6
15
10
3.0
YouTube
Entertainment
6.5
3
3
Sometimes
Yes
25
7
17
12
5.0
Snapchat
Socialising
5.5
2
2
Often
Yes
15
8
14
9
1.5
WhatsApp
Chatting
8.0
4
4
Never
No
55
9
16
11
2.0
Instagram
Entertainment
7.0
3
3
Sometimes
Yes
35
10
15
10
0.8
WhatsApp
Chatting
8.5
5
4
Never
No
70
11
17
12
4.0
Instagram
Entertainment
6.0
2
2
Often
Yes
10
12
16
11
3.2
YouTube
Learning
6.8
3
3
Sometimes
Yes
30
13
14
9
1.2
YouTube
Learning
8.2
4
4
Never
No
50
14
15
10
2.8
Instagram
Socialising
7.0
3
3
Sometimes
Yes
25
15
16
11
3.8
Snapchat
Socialising
6.2
2
2
Often
Yes
20
16
17
12
2.2
WhatsApp
Study groups
7.2
4
3
Rarely
No
40
17
15
10
1.7
YouTube
Learning
7.8
4
4
Rarely
No
55
18
16
11
4.8
Instagram
Entertainment
5.8
2
2
Often
Yes
15
19
14
9
2.1
WhatsApp
Chatting
7.6
4
4
Rarely
No
60
20
17
12
3.6
Instagram
Socialising
6.1
3
2
Sometimes
Yes
20
21
16
11
1.0
YouTube
Learning
8.0
4
4
Never
No
65
22
15
10
2.9
WhatsApp
Chatting
7.0
3
3
Sometimes
Yes
30
23
14
9
0.5
WhatsApp
Chatting
8.7
5
5
Never
No
80
24
17
12
5.5
Instagram
Entertainment
5.2
1
1
Often
Yes
5
25
16
11
3.1
Snapchat
Socialising
6.6
3
3
Sometimes
Yes
25
26
15
10
2.3
YouTube
Entertainment
7.2
3
3
Rarely
Yes
35
27
14
9
1.4
WhatsApp
Study groups
8.1
4
4
Never
No
55
28
17
12
4.2
Instagram
Socialising
6.0
2
2
Often
Yes
15
29
16
11
2.6
YouTube
Learning
7.1
4
3
Rarely
No
45
30
15
10
3.9
Instagram
Entertainment
6.3
2
2
Often
Yes
20
Bar Chart depicting the Survey: Daily Time on Social Networking Sites (Hours/Day)
Bar Chart depicting the Survey: Most used Social Networking Platform
Survey: Daily Social Media Hours vs Sleep Hours
Bar Chart depicting the Survey: Average Mood Score by Usage Group
Bar Chart depicting the Survey: Anxiety Frequency (Self-reported)
Project IV: Utilising an open data source to use a national, state or district level Dataset
Description
Open Government Data (OGD) Platform India www.data.gov.in is a platform for supporting open data initiative of Government of India. From this platform, let us consider the dataset “Special Tabulation on Adolescent and youth population classified by various parameters for India, States and Union Territories, 2011”. The dataset was contributed by the Ministry of Home Affairs, Government of India, and released under National Data Sharing and Accessibility Policy (NDSAP). The dataset was published on portal on 07/09/2015.
Statistics of the Data Set:
Number of rows: 12168
Number of columns: 123
Descriptions of some of the columns are given below:
State: Serial numbers given to states
Area Name: Name of the states and union territories
Total/Rural/Urban: Data about the total, rural or urban areas of a state or UT.
Adolescent and youth: Data for different age groups
Total Male: Total number of males
Total Female: Total number of females
SC-M: Total number of males of Scheduled Castes(SC)
SC-F: Total number of females of Scheduled Castes(SC)
ST-M: total number of males of Scheduled Tribes(ST)
ST-F: total number of females of Scheduled Tribes(ST)
Literates-M: total number of literate males
Literates-F: total number of literate females
LiteratesSC-M: total number of literate males of Scheduled Castes(SC)
LiteratesSC-F: total number of literate females of Scheduled Castes(SC)
LiteratesST-M: total number of literate males of Scheduled Tribes(ST)
LiteratesST-F: total number of literate females of Scheduled Tribes(ST)
Illiterates-M: total number of illiterate males
Illiterates-F: total number of illiterate females
IlliteratesSC-M: total number of illiterate males of Scheduled Castes(SC)
IlliteratesSC-F: total number of illiterate females of Scheduled Castes(SC)
IlliteratesST-M: total number of illiterate males of Scheduled Tribes(ST)
IlliteratesST-F: total number of illiterate females of Scheduled Tribes(ST)
MainWorker-M:total number of main worker males
MainWorker-F:total number of main worker females
MainWorkerSC-M: total number of main worker males of Scheduled Castes(SC)
MainWorkerSC-F: total number of main worker females of Scheduled Castes(SC)
MainWorkerST-M: total number of main worker males of Scheduled Tribes(ST)
MainWorkerST-F: total number of main worker females of Scheduled Tribes(ST)
MarginalWorker-M: total number of marginal worker males
MarginalWorker-F: total number of marginal worker females
MarginalWorkerSC-M: total number of marginal worker males of Scheduled Castes(SC)
MarginalWorkerSC-F: total number of marginal worker females of Scheduled Castes(SC)
MarginalWorkerST-M: total number of marginal worker males of Scheduled Tribes(ST)
MarginalWorkerST-F: total number of marginal worker females of Scheduled Tribes(ST)
Specifications
On such a large dataset, various types of questions can be answered by doing different analysis of data. Following is a list of some of the possible queries that can be answered by analysing the dataset:
1.
What is the total population, total male population and total female population aged 10 to 24 in India?
2.
Which State or Union Territory in India has the maximum number of illiterates in the youth ages?
3.
What is the percentage of people working as a marginal worker?
4.
List the top 5 states or union territories which have the maximum population working as a marginal worker.
5.
Compare the sex ratio of urban areas and rural areas using appropriate graph.
6.
Which state has the highest and the lowest percentage of literate Scheduled Tribes and Scheduled Castes?
7.
For each state, compare the no. of female marginal workers with no. of male marginal workers. Use appropriate graphs.
8.
What percentage of Scheduled Tribes lives in urban areas? Draw a pie chart showing the proportion of literate and illiterates scheduled tribes living in urban areas.
9.
What is the state wise ratio of literates vs. illiterates in all age groups?
10.
Which state is home to the maximum no. of ST in India? Which state has the minimum no. of ST in India?
11.
For each state, find the no. of literate females and no. of literate males. Draw a bar graph for the same.
Which state has the highest ratio of literate female vs literate male and which state has the minimum?
Which state has the highest ratio of literate female vs literate male and which state has the minimum?
A project work can be carried out by taking any 4–5 of the above questions and any other similar questions, and solving them step-by-step, with detailed explanation and documentation. As an example, in the following pages, we will solve the first question. This will give us an idea about how the other questions are to be answered.
Task 1: What is the total population, total male population and total female population aged 10 to 24 in India?
Solution:
Prerequisite: we need to first download the CSV file through the QR code given at the beginning of this chapter.
Step 1:
Read the CSV file in a DataFrame
Step 2:
Check the shape of the DataFrame
Step 3:
View the columns
Step 4:
Filter data
a.
Identify the columns that you wants to use for plotting
b.
Identify the number of rows required for plotting
Step 5:
Create a new DataFrame containing the filtered data
Step 6:
Rename the columns for ease of use
Step 7:
Group data as per the requirement
Step 8:
Plot data as a barchart for the DataFrame obtained in Step 7.
Task 1: What is the total population, total male population and total female population aged 10 to 24 in India?
Step 0: Import required libraries.
import pandas as pd
import matplotlib.pyplot as plt
Step 1: Read the CSV file in a DataFrame
Read the csv file in a DataFrame.
# add correct path on your computer
data = pd.read_csv("PCA_AY_2011_Revised.csv")
df = pd.DataFrame(data)
Step 2: Check the shape of the DataFrame
Check the shape of the DataFrame.
print(df.shape)
Output:
12168 rows and 123 columns.
Step 3: Display the columns.
View the coloumns
print(df.columns.values)
Output:
['Table No.'
'State Code'
'District Code'
'Area Name'
'Total/ Rural/ Urban'
'Adolescent and youth categories'
'Total Population - Persons'
'Total Population - Males'
'Total Population - Females'
'Scheduled Caste - Persons'
'Scheduled Caste - Males'
........................
........................
........................
'Scheduled Tribe Marginal Worker - Household Industry - Males'
'Scheduled Tribe Marginal Worker - Household Industry - Females'
'Scheduled Tribe Marginal Worker - Other Workers - Persons'
'Scheduled Tribe Marginal Worker - Other Workers - Males'
'Scheduled Tribe Marginal Worker - Other Workers - Females']
Note: Output contains 123 column names. Only few are shown here.
Step 4: Filter data
4 (a) Identify columns required for plotting: We consider only these columns:
•
Area Name•
Total/ Rural/ Urban•
Adolescent and youth categories•
Total Population - Persons (and related male/female columns are used later)4 (b) Rows Choice: We check values in
Area Name to decide filtering (example output shows repeated INDIA and then district names).Code:
print(df['Area Name'])
Output:
0 INDIA
1 INDIA
2 INDIA
3 INDIA
4 INDIA
...................
...................
...................
12163 District - South Andaman (03)
12164 District - South Andaman (03)
12165 District - South Andaman (03)
12166 District - South Andaman (03)
12167 District - South Andaman (03)
Name: Area Name, Length: 12168, dtype: object
Step 5: Create a new DataFrame containing the filtered data.
Suppose, we want to consider data for
‘Area Name’ = ‘INDIA’ only, Therefore, we shall create a new DataFrame df1 containing only the filtered data, using the following syntax:df.loc[row selection, column selection]df1 = df.loc[(df["Area Name"] == "INDIA"),
"Area Name": "Total Population - Females"]
Step 6: Rename the columns for ease of use
Column names are too long, so rename them.
Code:
df1.columns = ["Area",
"Class",
"Category",
"TotalPop",
"MalePop",
"FemalePop"]
Step 7: Group data as per the requirement
We want TotalPop, MalePop, FemalePop with respect to Category.
But Category has 6 types (like 10-14, 15-19, 20-24, etc.), so we group by Category and sum. Then we keep only 10-14, 15-19, 20-24 and remove the categories ‘Adolescent (10-19)’, ‘All Ages’, ‘Youth (15-24)’
Code:
d = df1.groupby("Category").sum(numeric_only=True)
# keep only the age groups needed for 10–24
d = d.drop(["Adolescent (10-19)", "All Ages", "Youth (15-24)"], axis=0)
The following data is obtained after Step 7 and this is the data used for graph.
The following is the final grouped data for the required age categories:
| Category | TotalPop | MalePop | FemalePop |
| -------- | ----------- | ----------- | ----------- |
| 10-14 | 26,54,18,424 | 13,88,37,670 | 12,65,80,754 |
| 15-19 | 24,10,52,898 | 12,79,64,792 | 11,30,88,106 |
| 20-24 | 22,28,48,444 | 11,51,69,386 | 10,76,79,058 |
Step 8: Plot data as a barchart
Plot the grouped DataFrame from Step 7 as a bar chart.
Code:
d.plot(kind="bar")
plt.show()
Python program used (complete, clean version)
All the above steps are compiled into a single pythong program below:
import pandas as pd
import matplotlib.pyplot as plt
# Step 1
data = pd.read_csv("PCA_AY_2011_Revised.csv") # add correct path on your computer
df = pd.DataFrame(data)
# Step 2
print(df.shape)
# Step 3
print(df.columns.values)
# Step 4 (understanding data)
print(df["Area Name"])
# Step 5 (filter India rows + required columns)
df1 = df.loc[(df["Area Name"] == "INDIA"), "Area Name":"Total Population - Females"]
# Step 6 (rename)
df1.columns = ["Area", "Class", "Category", "TotalPop", "MalePop", "FemalePop"]
# Step 7 (group and keep only 10–24 categories)
d = df1.groupby("Category").sum(numeric_only=True)
d = d.drop(["Adolescent (10-19)", "All Ages", "Youth (15-24)"], axis=0)
print(d)
# Step 8 (plot)
d.plot(kind="bar")
plt.xlabel("Adolescent and youth categories")
plt.ylabel("Population (in crores)")
plt.title("Population (10–24 years) in India: Total, Male, Female")
plt.tight_layout()
plt.show()
Bar Chart depicting Population (10–24 years) in India: Total, Male, Female
Task 2: Which State or Union Territory in India has the maximum number of illiterates in the youth ages?
Step 0: Import required libraries
Code:
import pandas as pd
Step 1: Read the data file into a DataFrame
Code:
# add correct path on your computer
df = pd.read_csv("PCA_AY_2011_Revised.csv")
Step 2: Filter only the rows needed for this question
What we need (logic):
•
Only State/UT totals → District Code == 0
•
Only Total (not rural/urban split) → Total/ Rural/ Urban == “Total”
•
Only Youth (15-24) → Adolescent and youth categories == “Youth (15-24)”
•
Exclude overall INDIA total → Area Name != “INDIA”
Code:
df1 = df.loc[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA"),
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "Youth (15-24)")
].copy()
Step 3: Compute “Illiterates” for Youth (15–24)
Illiterates (Persons) = Total Population – Persons − Literates – Total – Persons
Code:
df1["Illiterates - Youth - Persons"] = (
df1["Total Population - Persons"] - df1["Literates - Total - Persons"]
)
Step 4: Find the State/UT with the maximum illiterates
Code:
state_ill = df1.groupby("Area Name")["Illiterates - Youth - Persons"].sum().sort_values(ascending=False)
max_state = state_ill.index[0]
max_value = int(state_ill.iloc[0])
max_state, max_value
Final Answer (from your file)
✅
State/UT with maximum illiterates (Youth 15–24): State – UTTAR PRADESH (09)
✅
Number of illiterates (persons): 7,484,152
Task 3: What is the percentage of people working as a marginal worker?
Step 0: Import required libraries:
Code:
import pandas as pd
Step 1: Step 1: Read the data file in a DataFrame
Code:
df = pd.read_csv("PCA_AY_2011_Revised.csv")
Step 2: Check the shape of the DataFrame
Code:
print(df.shape)
Output:
12168 rows × 123 columns
Step 3: View the columns
View the columns to identify what we need.
Step result (analysis)
For marginal worker, we will use these columns (they exist in the dataset and match the “MarginalWorker” concept listed in the chapter).
•
Area Name
•
Total/ Rural/ Urban
•
Adolescent and youth categories
•
Main Worker – Total – Persons
•
Marginal Worker – Total – Persons
Step 4: Filter data (choose correct rows)
Step explanation
Filter the data by selecting required columns and required rows.
Step result (analysis + code)
Because the question asks for “people” (overall), we take INDIA, Total, All Ages.
Because the question asks for “people” (overall), we take INDIA, Total, All Ages.
df_india = df[
(df["Area Name"] == "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][["Main Worker - Total - Persons", "Marginal Worker - Total - Persons"]].copy()
Step 5–7: Compute the required percentage (final calculation)
Step explanation
After filtering, compute the required result from the selected data. (In Task 1 the book groups/sums; here we just compute the percentage from the INDIA totals.)
Step result (calculation + answer)
Data extracted (INDIA, Total, All Ages):
•
Main workers (Persons) = 36,25,65,571
•
Marginal workers (Persons) = 11,93,23,297
Now, “percentage of people working as a marginal worker” (most meaningful in this context) =
\dfrac{\text{Marginal Workers}}{\text{Main Workers} + \text{Marginal Workers}} × 100
main_workers = int(df_india.iloc[0]["Main Worker - Total - Persons"])
marg_workers = int(df_india.iloc[0]["Marginal Worker - Total - Persons"])
total_workers = main_workers + marg_workers
marginal_worker_percent = (marg_workers / total_workers) * 100
print(round(marginal_worker_percent, 2))
✅ Final Answer (INDIA):
Percentage of people (workers) working as marginal workers = 24.76%
import pandas as pd
df = pd.read_csv("PCA_AY_2011_Revised.csv")
# Filter INDIA, Total, All Ages
df_india = df[
(df["Area Name"] == "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][["Main Worker - Total - Persons", "Marginal Worker - Total - Persons"]].copy()
main_workers = int(df_india.iloc[0]["Main Worker - Total - Persons"])
marg_workers = int(df_india.iloc[0]["Marginal Worker - Total - Persons"])
total_workers = main_workers + marg_workers
marginal_worker_percent = (marg_workers / total_workers) * 100
print("Main workers (Persons):", main_workers)
print("Marginal workers (Persons):", marg_workers)
print("Total workers (Persons):", total_workers)
print("Marginal worker percentage:", round(marginal_worker_percent, 2), "%")
Task 4: List the top 5 states or union territories which have the maximum population working as a marginal worker.
Step 0: Import required libraries
Code:
import pandas as pd
import matplotlib.pyplot as plt
Step 1: Read the dataset into a DataFrame
Step explanation
Read the file in a DataFrame.
Code:
df = pd.read_csv("PCA_AY_2011_Revised.csv")
Step 2: Filter data (only what’s needed for Task 4)
Step explanation
Filter the data by choosing the correct rows/columns required for the question.
Step result (analysis + code)
To find “maximum population working as marginal worker” across states/UTs, we take:
•
State/UT totals only →
District Code == 0•
Exclude INDIA (because we want states/UTs) →
Area Name != "INDIA"•
Total (not rural/urban split) →
Total/ Rural/ Urban == "Total"•
Overall population →
Adolescent and youth categories == "All Ages"•
Required column →
Marginal Worker - Total - Personsdf_state = df[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][["Area Name", "Marginal Worker - Total - Persons"]].copy()
Step 3: Sort and select Top 5 (final list)
Step explanation
Group/arrange the data as per the requirement. (Here we just sort to get top 5.)
Step result (answer)
✅ Top 5 States/UTs by Marginal Workers (Persons):
•
State – UTTAR PRADESH (09) — 2,11,79,223
•
State – BIHAR (10) — 1,33,65,376
•
State – WEST BENGAL (19) — 90,69,725
•
State – MADHYA PRADESH (23) — 88,72,014
•
State – RAJASTHAN (08) — 88,28,287
Step 4: Plot a graph (useful for comparison)
Step explanation
Plot the required output using an appropriate graph.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("PCA_AY_2011_Revised.csv")
df_state = df[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][["Area Name", "Marginal Worker - Total - Persons"]].copy()
top5 = df_state.sort_values("Marginal Worker - Total - Persons", ascending=False).head(5)
print("Top 5 States/UTs by Marginal Workers (Persons):")
print(top5)
# Bar chart (helpful visual)
plt.figure()
plt.bar(top5["Area Name"], top5["Marginal Worker - Total - Persons"])
plt.title("Top 5 States/UTs: Marginal Workers (All Ages, Total)")
plt.xlabel("State / Union Territory")
plt.ylabel("Marginal Workers (Persons)")
plt.xticks(rotation=60, ha="right")
plt.tight_layout()
plt.show()
Bar Chart depicting the Top 5 States/UTs: Population Working as Marginal Workers (All Ages, Total)
Task 5: Compare the sex ratio of urban areas and rural areas using appropriate graph.
Sex Ratio formula used
Sex Ratio
{= \dfrac{\text{Male Population}}{\text{Female Population}} × 1000}
(= females per 1000 males)
Step 0: Import required libraries
Code:
import pandas as pd
import matplotlib.pyplot as plt
Step 1: Read the dataset in a DataFrame
Code:
df = pd.read_csv("PCA_AY_2011_Revised.csv")
Step 2: Filter only the required rows and columns
What we need (logic)
To compare Urban vs Rural sex ratio (India), we take:
•
Area Name = INDIA
•
Adolescent and youth categories = All Ages
•
Total/ Rural/ Urban = Rural and Urban
•
Columns used: Total Population – Males, Total Population – Females
Code:
df_ru = df[
(df["Area Name"] == "INDIA") &
(df["Adolescent and youth categories"] == "All Ages") &
(df["Total/ Rural/ Urban"].isin(["Rural", "Urban"]))
][["Total/ Rural/ Urban", "Total Population - Males", "Total Population - Females"]].copy()
df_ru.columns = ["Class", "MalePop", "FemalePop"]
Step 3: Compute sex ratio for Rural and Urban
Result (calculation + output data points):
g = df_ru.groupby("Class")[["MalePop", "FemalePop"]].sum()
g["SexRatio"] = (g["FemalePop"] / g["MalePop"]) * 1000
result = g.reset_index()
print(result)
Data points used for comparison
Class
MalePop
FemalePop
SexRatio
(Females per 1000 Males)
(Females per 1000 Males)
Rural
42,77,81,058
40,59,67,794
949.01
Urban
19,54,89,200
18,16,16,925
929.04
✅ Conclusion (from your dataset):
•
Rural sex ratio: 949.01
•
Urban sex ratio: 929.04
Step 4: Plot the comparison graph (required by the question)
Code:
plt.figure()
plt.bar(result["Class"], result["SexRatio"])
plt.title("Sex Ratio (Females per 1000 Males): Urban vs Rural (INDIA, All Ages)")
plt.xlabel("Area type")
plt.ylabel("Sex Ratio")
plt.tight_layout()
plt.show()
Here is the consolidated code:
# Project IV - Task 5 (Consolidated Code)
# Compare sex ratio of Urban vs Rural areas (INDIA) using a graph
import pandas as pd
import matplotlib.pyplot as plt
# Read the dataset (use the .xlsx-converted version for reliable reading)
df = pd.read_csv("PCA_AY_2011_Revised.csv")
# Filter: INDIA + All Ages + Rural/Urban only
df_ru = df[
(df["Area Name"] == "INDIA") &
(df["Adolescent and youth categories"] == "All Ages") &
(df["Total/ Rural/ Urban"].isin(["Rural", "Urban"]))
][["Total/ Rural/ Urban", "Total Population - Males", "Total Population - Females"]].copy()
# Rename columns for ease
df_ru.columns = ["Class", "MalePop", "FemalePop"]
# Group (safety) and compute Sex Ratio = (Female/Male)*1000
g = df_ru.groupby("Class")[["MalePop", "FemalePop"]].sum()
g["SexRatio"] = (g["FemalePop"] / g["MalePop"]) * 1000
# Final data points
result = g.reset_index()
print("Data points used (Urban vs Rural):")
print(result)
# Plot (required by the question)
plt.figure()
plt.bar(result["Class"], result["SexRatio"])
plt.title("Sex Ratio (Females per 1000 Males): Urban vs Rural (INDIA, All Ages)")
plt.xlabel("Area type")
plt.ylabel("Sex Ratio")
plt.tight_layout()
plt.show()
Bar Chart depicting the Sex Ratio (Females per 1000 Males): Urban vs Rural (INDIA, All Ages)
Task 6: Which state has the highest and the lowest percentage of literate Scheduled Tribes and Scheduled Castes?
Step 0: Import required libraries
Step explanation
Import the required libraries.
Code:
import pandas as pd
import numpy as np
Step 1: Read the dataset into a DataFrame
Step explanation
Read the file into a DataFrame.
Code:
df = pd.read_csv("PCA_AY_2011_Revised.csv")
Step 2: Filter the required rows (State/UT + Total + All Ages)
Step explanation
Filter data by selecting the correct rows and columns needed for the query.
Step result (logic + code)
For this question we need:
•
State/UT totals only → District Code == 0
•
Exclude INDIA → Area Name != “INDIA”
•
Total (not Rural/Urban separately) → Total/ Rural/ Urban == “Total”
•
All Ages → Adolescent and youth categories == “All Ages”
And these columns:
•
Scheduled Tribe – Persons, Literates – Scheduled Tribe – Persons
•
Scheduled Caste – Persons, Literates – Scheduled Caste – Persons
state_df = df[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][[
"Area Name",
"Scheduled Tribe - Persons",
"Literates - Scheduled Tribe - Persons",
"Scheduled Caste - Persons",
"Literates - Scheduled Caste - Persons"
]].copy()
Step 3: Compute literacy percentage for ST and SC
Step explanation
After filtering, compute the required values from the selected columns.
Step result (formula + code)
\text{Percentage literate} = \left(\dfrac{\text{Literate persons}}{\text{Total persons}}\right) \times 100
state_df["ST_Literate_Pct"] = np.where(
state_df["Scheduled Tribe - Persons"] > 0,
(state_df["Literates - Scheduled Tribe - Persons"] / state_df["Scheduled Tribe - Persons"]) * 100,
np.nan
)
state_df["SC_Literate_Pct"] = np.where(
state_df["Scheduled Caste - Persons"] > 0,
(state_df["Literates - Scheduled Caste - Persons"] / state_df["Scheduled Caste - Persons"]) * 100,
np.nan
)
(If a state/UT has 0 ST or 0 SC population, the percentage is not defined, so we skip it.)
Step 4: Find the highest and lowest state/UT (final answer)
Step explanation
Arrange results to identify the maximum and minimum values.
Step result (code)
st_valid = state_df.dropna(subset=["ST_Literate_Pct"])
sc_valid = state_df.dropna(subset=["SC_Literate_Pct"])
st_high = st_valid.loc[st_valid["ST_Literate_Pct"].idxmax()]
st_low = st_valid.loc[st_valid["ST_Literate_Pct"].idxmin()]
sc_high = sc_valid.loc[sc_valid["SC_Literate_Pct"].idxmax()]
sc_low = sc_valid.loc[sc_valid["SC_Literate_Pct"].idxmin()]
Final Answer (from our dataset)
Scheduled Tribes (ST) – Literate %
•
Highest: State – LAKSHADWEEP (31) → 81.11%
(Total ST persons = 61,120; Literate ST persons = 49,576)
•
Lowest: State – JAMMU & KASHMIR (01) → 40.29%
(Total ST persons = 14,93,299; Literate ST persons = 6,01,703)
Scheduled Castes (SC) – Literate %
•
Highest: State – DAMAN & DIU (25) → 83.33%
(Total SC persons = 6,124; Literate SC persons = 5,103)
•
Lowest: State – BIHAR (10) → 38.42%
(Total SC persons = 1,65,67,325; Literate SC persons = 63,65,150)
Consolidated code (Task 6 in one place)
import pandas as pd
import numpy as np
df = pd.read_csv("PCA_AY_2011_Revised.csv")
state_df = df[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][[
"Area Name",
"Scheduled Tribe - Persons",
"Literates - Scheduled Tribe - Persons",
"Scheduled Caste - Persons",
"Literates - Scheduled Caste - Persons"
]].copy()
state_df["ST_Literate_Pct"] = np.where(
state_df["Scheduled Tribe - Persons"] > 0,
(state_df["Literates - Scheduled Tribe - Persons"] / state_df["Scheduled Tribe - Persons"]) * 100,
np.nan
)
state_df["SC_Literate_Pct"] = np.where(
state_df["Scheduled Caste - Persons"] > 0,
(state_df["Literates - Scheduled Caste - Persons"] / state_df["Scheduled Caste - Persons"]) * 100,
np.nan
)
st_valid = state_df.dropna(subset=["ST_Literate_Pct"])
sc_valid = state_df.dropna(subset=["SC_Literate_Pct"])
st_high = st_valid.loc[st_valid["ST_Literate_Pct"].idxmax()]
st_low = st_valid.loc[st_valid["ST_Literate_Pct"].idxmin()]
sc_high = sc_valid.loc[sc_valid["SC_Literate_Pct"].idxmax()]
sc_low = sc_valid.loc[sc_valid["SC_Literate_Pct"].idxmin()]
print("ST Highest:", st_high["Area Name"], round(st_high["ST_Literate_Pct"], 2))
print("ST Lowest :", st_low["Area Name"], round(st_low["ST_Literate_Pct"], 2))
print("SC Highest:", sc_high["Area Name"], round(sc_high["SC_Literate_Pct"], 2))
print("SC Lowest :", sc_low["Area Name"], round(sc_low["SC_Literate_Pct"], 2))
Task 7: For each state, compare the no. of female marginal workers with no. of male marginal workers. Use appropriate graphs.
Step 0: Import required libraries
Result (code):
import pandas as pd
import matplotlib.pyplot as plt
Step 1: Read the dataset into a DataFrame
Result (action/code):
df = pd.read_csv("PCA_AY_2011_Revised.csv" )
Step 2: Filter the required rows (State/UT + Total + All Ages)
Result (logic + code):
To compare for each State/UT, we take:
•
State/UT totals: District Code == 0
•
Exclude INDIA total: Area Name != “INDIA”
•
Overall totals: Total/ Rural/ Urban == “Total”
•
All ages: Adolescent and youth categories == “All Ages”
state_df = df[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][[
"Area Name",
"Marginal Worker - Total - Males",
"Marginal Worker - Total - Females"
]].copy()
Step 3: Arrange data “state-wise” for comparison
Result (analysis):
I prepared the full state-wise table (this is the data used for the graph).
Quick view (Top 5 by total marginal workers)
(These 5 rows appeared in the output right after the chart.)
•
State – UTTAR PRADESH (09): Male 1,24,26,463, Female 87,52,760
•
State – BIHAR (10): Male 79,51,499, Female 54,13,877
•
State – WEST BENGAL (19): Male 50,37,768, Female 40,31,957
•
State – MADHYA PRADESH (23): Male 37,84,905, Female 50,87,109
•
State – RAJASTHAN (08): Male 30,53,539, Female 57,74,748
Step 4: Plot the comparison graph (required)
Result: ✅ I generated and displayed the graph in the chat:
•
Stacked bar chart for all States/UTs (Male + Female marginal workers per state)
This is appropriate because it lets us compare male and female marginal workers across many states in one view, as asked.
Consolidated Python code (Task 7 in one place)
import pandas as pd
import matplotlib.pyplot as plt
# Read dataset (converted from the uploaded .xls)
df = pd.read_csv("PCA_AY_2011_Revised.csv")
# Filter: State/UT totals, Total, All Ages; exclude INDIA
state_df = df[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][[
"Area Name",
"Marginal Worker - Total - Males",
"Marginal Worker - Total - Females"
]].copy()
state_df = state_df.rename(columns={
"Area Name": "State/UT",
"Marginal Worker - Total - Males": "MaleMarginalWorkers",
"Marginal Worker - Total - Females": "FemaleMarginalWorkers"
})
# Sort for cleaner plotting
state_df["TotalMarginalWorkers"] = state_df["MaleMarginalWorkers"] + state_df["FemaleMarginalWorkers"]
state_df = state_df.sort_values("TotalMarginalWorkers", ascending=False).reset_index(drop=True)
# Plot (stacked bar, good for many states/UTs)
plt.figure(figsize=(16, 7))
x = range(len(state_df))
plt.bar(x, state_df["MaleMarginalWorkers"], label="Male marginal workers")
plt.bar(
x,
state_df["FemaleMarginalWorkers"],
bottom=state_df["MaleMarginalWorkers"],
label="Female marginal workers"
)
plt.title("State/UT-wise Marginal Workers: Male vs Female (All Ages, Total)")
plt.xlabel("State / Union Territory")
plt.ylabel("Number of marginal workers (persons)")
plt.xticks(list(x), state_df["State/UT"], rotation=90)
plt.tight_layout()
plt.show()
# Optional: print top 10 for quick discussion
print(state_df.head(10)[["State/UT","MaleMarginalWorkers","FemaleMarginalWorkers","TotalMarginalWorkers"]])
The following is the data that will be ultimately used by the graph:
State/UT
MaleMarginalWorkers
FemaleMarginalWorkers
TotalMarginalWorkers
State – UTTAR PRADESH (09)
1,24,26,463
87,52,760
2,11,79,223
State – BIHAR (10)
79,51,499
54,13,877
1,33,65,376
State – WEST BENGAL (19)
50,37,768
40,31,957
90,69,725
State – MADHYA PRADESH (23)
37,84,905
50,87,109
88,72,014
State – RAJASTHAN (08)
30,53,539
57,74,748
88,28,287
State – ODISHA (21)
31,08,242
37,25,804
68,34,046
State – ANDHRA PRADESH (28)
27,25,514
36,60,014
63,85,528
State – JHARKHAND (20)
31,90,327
30,89,352
6,279,679
State – MAHARASHTRA (27)
26,27,561
30,37,427
56,64,988
State – TAMIL NADU (33)
24,73,784
24,68,716
49,42,500
State – KARNATAKA (29)
19,20,279
25,55,137
44,75,416
State – GUJARAT (24)
14,33,219
29,69,154
44,02,373
State – CHHATTISGARH (22)
15,36,412
24,02,099
39,38,511
State – ASSAM (18)
15,06,918
17,75,649
32,82,567
State – KERALA (32)
12,71,741
10,17,575
22,89,316
State – HARYANA (06)
9,46,036
9,55,189
19,01,225
State – JAMMU & KASHMIR (01)
8,89,302
7,89,262
16,78,564
State – HIMACHAL PRADESH (02)
6,04,384
8,92,537
14,96,921
State – PUNJAB (03)
8,09,526
6,36,900
14,46,426
State – UTTARAKHAND (05)
4,81,161
5,20,490
10,01,651
State – TRIPURA (16)
1,57,445
2,35,057
3,92,502
State – MANIPUR (14)
1,24,273
2,06,174
3,30,447
State – NCT OF DELHI (07)
1,99,316
80,404
2,79,720
State – MEGHALAYA (17)
1,18,189
1,45,855
2,64,044
State – NAGALAND (13)
1,05,153
1,27,790
2,32,943
State – ARUNACHAL PRADESH (12)
49,164
59,772
1,08,936
State – GOA (30)
62,569
38,626
1,01,195
State – SIKKIM (11)
33,845
43,896
77,741
State – MIZORAM (15)
27,435
44,240
71,675
State – PUDUCHERRY (34)
26,522
18,757
45,279
State – DADRA & NAGAR HAVELI (26)
10,168
16,694
26,862
State – ANDAMAN & NICOBAR ISLANDS (35)
17,270
9,355
26,625
State – CHANDIGARH (04)
10,969
7,238
18,207
State – LAKSHADWEEP (31)
6,181
1,768
7,949
State – DAMAN & DIU (25)
2,820
2,016
4,836
Bar Chart depicting the State/UT-wise Marginal Workers: Male vs Female (All Ages, Total)
Task 8: What percentage of Scheduled Tribes lives in urban areas? Draw a pie chart showing the proportion of literate and illiterates scheduled tribes living in urban areas.
Step 0: Import required libraries
Result (code):
import pandas as pd
import matplotlib.pyplot as plt
Step 1: Read the dataset in a DataFrame
Result (code):
df = pd.read_excel("PCA_AY_2011_Revised.csv")
Step 2: Filter the required rows (INDIA, All Ages, Total + Urban)
We need:
•
Area Name = INDIA
•
Adolescent and youth categories = All Ages
•
Total/ Rural/ Urban = Total and Urban
•
Columns needed: Scheduled Tribe – Persons, Literates – Scheduled Tribe – Persons
Result (code):
df_india = df[
(df["Area Name"] == "INDIA") &
(df["Adolescent and youth categories"] == "All Ages") &
(df["Total/ Rural/ Urban"].isin(["Total", "Urban"]))
][[
"Total/ Rural/ Urban",
"Scheduled Tribe - Persons",
"Literates - Scheduled Tribe - Persons"
]].copy()
df_india.columns = ["Class", "ST_Persons", "ST_Literate"]
df_india["ST_Illiterate"] = df_india["ST_Persons"] - df_india["ST_Literate"]
Step 3: Calculate the percentage of Scheduled Tribes living in urban areas
Formula:
{\text{\% ST in Urban} = \dfrac{\text{Urban ST Persons}}{\text{Total ST Persons}} × 100}
Data (from our file):
•
Total ST Persons (INDIA, Total, All Ages) = 10,45,45,716
•
Urban ST Persons (INDIA, Urban, All Ages) = 1,04,61,872
Result:
{\text{\%} = \dfrac{1,04,61,872}{10,45,45,716} × 100}
✅ Answer (Part 1): 10.01% of Scheduled Tribes live in urban areas.
Step 4: Pie chart for Literate vs Illiterate ST living in urban areas
Task says: Draw a pie chart showing the proportion of literate and illiterates ST living in urban areas.
Urban ST (INDIA, Urban, All Ages):
•
Literate ST (Urban) = 70,03,778
•
Illiterate ST (Urban) = 34,58,094
So the pie chart proportions are:
•
Literate: 66.9%
•
Illiterate: 33.1%
✅ Pie chart has been generated and shown in the chat above.
Result (code):
urban_lit = 7003778
urban_illit = 3458094
plt.figure()
plt.pie(
[urban_lit, urban_illit],
labels=["Literate ST (Urban)", "Illiterate ST (Urban)"],
autopct="%1.1f%%"
)
plt.title("Scheduled Tribes in Urban Areas (INDIA, All Ages): Literate vs Illiterate")
plt.tight_layout()
plt.show()
Consolidated Python code (Task 8 in one place)
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("PCA_AY_2011_Revised.csv")
# Filter: INDIA, All Ages, Total + Urban
df_india = df[
(df["Area Name"] == "INDIA") &
(df["Adolescent and youth categories"] == "All Ages") &
(df["Total/ Rural/ Urban"].isin(["Total", "Urban"]))
][[
"Total/ Rural/ Urban",
"Scheduled Tribe - Persons",
"Literates - Scheduled Tribe - Persons"
]].copy()
df_india.columns = ["Class", "ST_Persons", "ST_Literate"]
df_india["ST_Illiterate"] = df_india["ST_Persons"] - df_india["ST_Literate"]
pivot = df_india.set_index("Class")
# Part 1: % ST living in Urban areas
st_total = float(pivot.loc["Total", "ST_Persons"])
st_urban = float(pivot.loc["Urban", "ST_Persons"])
urban_pct = (st_urban / st_total) * 100
print("Total ST (INDIA):", int(st_total))
print("Urban ST (INDIA):", int(st_urban))
print("% ST living in Urban areas:", round(urban_pct, 2), "%")
# Part 2: Pie chart (Urban ST literate vs illiterate)
urban_lit = float(pivot.loc["Urban", "ST_Literate"])
urban_illit = float(pivot.loc["Urban", "ST_Illiterate"])
plt.figure()
plt.pie([urban_lit, urban_illit],
labels=["Literate ST (Urban)", "Illiterate ST (Urban)"],
autopct="%1.1f%%")
plt.title("Scheduled Tribes in Urban Areas (INDIA, All Ages): Literate vs Illiterate")
plt.tight_layout()
plt.show()
Bar Chart depicting the Scheduled Tribes in Urban Areas (INDIA, All Ages): Literate vs Illiterate
Task 9: What is the state wise ratio of literates vs. illiterates in all age groups?
Note: A graph isn’t strictly mandatory (the question doesn’t say “draw”), but because it’s state-wise (many states/UTs), a bar graph makes the comparison much clearer, so I’m including it.
Step 0: Import required libraries (as per the book’s method)
Result (code):
import pandas as pd
import matplotlib.pyplot as plt
Step 1: Read the data into a DataFrame
Result (code):
df = pd.read_csv("PCA_AY_2011_Revised.csv")
Step 2: Filter the required rows (State/UT totals + Total + All Ages)
What we select (student logic):
•
Only State/UT totals (not districts) → `District Code = 0`
•
Only Total (not Rural/Urban separately)
•
Only All Ages
•
Use these columns (dataset has literates/illiterates info):
•
`Total Population – Persons`
•
`Literates – Total – Persons`
Result (code):
state_df = df[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][[
"Area Name",
"Total Population - Persons",
"Literates - Total - Persons"
]].copy()
Step 3: Compute illiterates and the ratio (Literate vs Illiterate)
Formula used:
•
Illiterates (Persons) = Total Persons − Literates (Persons)
•
Ratio (L vs I) = Literates / Illiterates
Result (code):
state_df["IlliteratePersons"] = state_df["Total Population - Persons"] - state_df["Literates - Total - Persons"]
state_df["Literate_to_Illiterate_Ratio"] = state_df["Literates - Total - Persons"] / state_df["IlliteratePersons"]
Step 4: Arrange state-wise ratios and identify extremes
✅
Highest L:I ratio: Kerala = 5.34
•
Literates: 2,81,35,824
•
Illiterates: 52,70,237
✅
Lowest L:I ratio: Bihar = 1.02
•
Literates: 5,25,04,553
•
Illiterates: 5,15,94,899
(Interpretation: Kerala has ~5.34 literates for every 1 illiterate, while Bihar is close to 1:1.)
Quick view
•
Top 5 ratios: Kerala (5.34), Lakshadweep (4.41), Goa (3.98), Daman & Diu (3.44), Andaman & Nicobar (3.41)
•
Bottom 5 ratios: Jammu & Kashmir (1.29), Rajasthan (1.26), Jharkhand (1.25), Arunachal Pradesh (1.24), Bihar (1.02)
Step 5: Graph (recommended for state-wise comparison)
Data used to generate the graph is as follows:
State/UT
TotalPersons
LiteratePersons
IlliteratePersons
Literate_to_Illiterate_Ratio
Literate_per_100_Illiterates
State – KERALA (32)
3,34,06,061
2,81,35,824
52,70,237
5.34
533.86
State – LAKSHADWEEP (31)
64,473
52,553
11,920
4.41
440.88
State – GOA (30)
14,58,545
11,65,487
2,93,058
3.98
397.70
State – DAMAN & DIU (25)
2,43,247
1,88,406
54,841
3.44
343.55
State – ANDAMAN & NICOBAR ISLANDS (35)
3,80,581
2,94,281
86,300
3.41
341.00
State – MIZORAM (15)
10,97,206
8,48,175
2,49,031
3.41
340.59
State – PUDUCHERRY (34)
12,47,953
9,57,309
2,90,644
3.29
329.38
State – TRIPURA (16)
36,73,917
28,04,783
8,69,134
3.23
322.71
State – CHANDIGARH (04)
10,55,450
8,05,438
2,50,012
3.22
322.16
State – NCT OF DELHI (07)
1,67,87,941
1,27,37,767
40,50,174
3.14
314.50
State – HIMACHAL PRADESH (02)
68,64,602
50,39,736
18,24,866
2.76
276.17
State – SIKKIM (11)
6,10,577
4,44,952
1,65,625
2.69
268.65
State – MAHARASHTRA (27)
11,23,74,333
8,15,54,290
3,08,20,043
2.65
264.61
State – TAMIL NADU (33)
7,21,47,030
5,18,37,507
2,03,09,523
2.55
255.24
State – UTTARAKHAND (05)
1,00,86,292
68,80,953
32,05,339
2.15
214.67
State – GUJARAT (24)
6,04,39,692
4,10,93,358
1,93,46,334
2.12
212.41
State – NAGALAND (13)
19,78,502
13,42,434
6,36,068
2.11
211.05
State – PUNJAB (03)
2,77,43,338
1,87,07,137
90,36,201
2.07
207.02
State – WEST BENGAL (19)
9,12,76,115
6,15,38,281
2,97,37,834
2.07
206.94
State – MANIPUR (14)
28,55,794
19,08,476
9,47,318
2.01
201.46
State – KARNATAKA (29)
6,10,95,297
4,06,47,322
2,04,47,975
1.99
198.78
State – HARYANA (06)
2,53,51,462
1,65,98,988
87,52,474
1.90
189.65
State – DADRA & NAGAR HAVELI (26)
3,43,709
2,23,230
1,20,479
1.85
185.29
State – ODISHA (21)
4,19,74,218
2,67,42,595
1,52,31,623
1,76
175.57
State – ASSAM (18)
3,12,05,576
1,91,77,977
1,20,27,599
1.59
159.45
State – CHHATTISGARH (22)
2,55,45,198
1,53,79,922
1,01,65,276
1.51
151.30
State – MEGHALAYA (17)
29,66,889
17,85,005
11,81,884
1.51
151.03
State – ANDHRA PRADESH (28)
8,45,80,777
5,05,56,760
3,40,24,017
1.49
148.59
State – MADHYA PRADESH (23)
7,26,26,809
4,28,51,169
2,97,75,640
1.44
143.91
State – UTTAR PRADESH (09)
19,98,12,341
11,43,97,555
8,54,14,786
1.34
133.93
State – JAMMU & KASHMIR (01)
1,25,41,302
70,67,233
54,74,069
1.29
129.10
State – RAJASTHAN (08)
6,85,48,437
3,82,75,282
3,02,73,155
1.26
126.43
State – JHARKHAND (20)
3,29,88,134
1,83,28,069
1,46,60,065
1.25
125.02
State – ARUNACHAL PRADESH (12)
13,83,727
7,66,005
6,17,722
1.24
124.00
State – BIHAR (10)
10,40,99,452
5,25,04,553
5,15,94,899
1.02
101.76
The following code is used to generate the graph:
# Sort by ratio for a clean comparison chart
state_df_sorted = state_df.sort_values("Literate_to_Illiterate_Ratio", ascending=False).reset_index(drop=True)
# Bar chart (state-wise ratios)
plt.figure(figsize=(16, 7))
x = range(len(state_df_sorted))
plt.bar(x, state_df_sorted["Literate_to_Illiterate_Ratio"])
plt.title("State/UT-wise Ratio of Literates to Illiterates (All Ages, Total)")
plt.xlabel("State / Union Territory (sorted by ratio)")
plt.ylabel("Literate : Illiterate (L/I) ratio")
plt.xticks(list(x), state_df_sorted["State/UT"], rotation=90)
plt.tight_layout()
plt.show()
Consolidated Code:
# State-wise ratio of literates vs illiterates in all age groups (+ graph)
import pandas as pd
import matplotlib.pyplot as plt
# Step 1: Read dataset
df = pd.read_csv("PCA_AY_2011_Revised.csv")
# Step 2: Filter State/UT totals, Total, All Ages; exclude INDIA
state_df = df[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][["Area Name", "Total Population - Persons", "Literates - Total - Persons"]].copy()
# Rename for ease
state_df = state_df.rename(columns={
"Area Name": "State/UT",
"Total Population - Persons": "TotalPersons",
"Literates - Total - Persons": "LiteratePersons"
})
# Step 3: Compute Illiterates + Literate:Illiterate ratio
state_df["IlliteratePersons"] = state_df["TotalPersons"] - state_df["LiteratePersons"]
state_df["Literate_to_Illiterate_Ratio"] = state_df["LiteratePersons"] / state_df["IlliteratePersons"]
# Step 4: Sort for reporting + find highest/lowest
state_df_sorted = state_df.sort_values("Literate_to_Illiterate_Ratio", ascending=False).reset_index(drop=True)
highest = state_df_sorted.iloc[0]
lowest = state_df_sorted.iloc[-1]
print("Highest L:I ratio:", highest["State/UT"], round(highest["Literate_to_Illiterate_Ratio"], 3))
print(" Literates:", int(highest["LiteratePersons"]), "Illiterates:", int(highest["IlliteratePersons"]))
print("Lowest L:I ratio:", lowest["State/UT"], round(lowest["Literate_to_Illiterate_Ratio"], 3))
print(" Literates:", int(lowest["LiteratePersons"]), "Illiterates:", int(lowest["IlliteratePersons"]))
# Step 5: Graph (recommended for state-wise comparison)
plt.figure(figsize=(16, 7))
x = range(len(state_df_sorted))
plt.bar(x, state_df_sorted["Literate_to_Illiterate_Ratio"])
plt.title("State/UT-wise Ratio of Literates to Illiterates (All Ages, Total)")
plt.xlabel("State / Union Territory (sorted by ratio)")
plt.ylabel("Literate : Illiterate (L/I) ratio")
plt.xticks(list(x), state_df_sorted["State/UT"], rotation=90)
plt.tight_layout()
plt.show()
# (Optional) Save output for reuse
state_df_sorted.to_csv("task9_literate_vs_illiterate_ratio_statewise.csv", index=False)
print("Saved: task9_literate_vs_illiterate_ratio_statewise.csv")
Bar Chart depicting the State/UT-wise Ratio of Literates to Illiterates (All Ages, Total)
Task 10: Which state is home to the maximum no. of ST in India? Which state has the minimum no. of ST in India?
This task does not need a graph (it’s just “maximum” and “minimum”).
Step 1: Read the dataset in a DataFrame
Result (code):
import pandas as pd
df = pd.read_csv("PCA_AY_2011_Revised.csv")
Step 2: Filter only State/UT totals (Total, All Ages)
Result (code):
st_df = df[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][["Area Name", "Scheduled Tribe - Persons"]].copy()
st_df.columns = ["State/UT", "ST_Persons"]
Step 3: Find maximum and minimum ST population
Result (code):
st_sorted = st_df.sort_values("ST_Persons", ascending=False).reset_index(drop=True)
max_state = st_sorted.iloc[0]["State/UT"]
max_value = int(st_sorted.iloc[0]["ST_Persons"])
min_state = st_sorted.iloc[-1]["State/UT"]
min_value = int(st_sorted.iloc[-1]["ST_Persons"])
Final Answer (from your dataset)
✅
Maximum ST population: State – MADHYA PRADESH (23) → 1,53,16,784
✅
Minimum ST population (including 0): State – PUDUCHERRY (34) → 0
Note (student-style): In this dataset, Puducherry has 0 ST population, so it becomes the “minimum”.
If your teacher expects “minimum but not zero”, then:
✅
Minimum ST population (non-zero): State – DAMAN & DIU (25) → 15,363
Consolidated code
import pandas as pd
df = pd.read_csv("PCA_AY_2011_Revised.csv")
st_df = df[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][["Area Name", "Scheduled Tribe - Persons"]].copy()
st_df.columns = ["State/UT", "ST_Persons"]
st_sorted = st_df.sort_values("ST_Persons", ascending=False).reset_index(drop=True)
max_state = st_sorted.iloc[0]["State/UT"]
max_value = int(st_sorted.iloc[0]["ST_Persons"])
min_state = st_sorted.iloc[-1]["State/UT"]
min_value = int(st_sorted.iloc[-1]["ST_Persons"])
print("Maximum ST:", max_state, max_value)
print("Minimum ST (including 0):", min_state, min_value)
# Optional: Minimum non-zero ST
nonzero = st_sorted[st_sorted["ST_Persons"] > 0]
if len(nonzero) > 0:
min_nz_state = nonzero.iloc[-1]["State/UT"]
min_nz_value = int(nonzero.iloc[-1]["ST_Persons"])
print("Minimum ST (non-zero):", min_nz_state, min_nz_value)
Tabular Data (for reference)
State/UT
ST_Persons
State – MADHYA PRADESH (23)
1,53,16,784
State – MAHARASHTRA (27)
1,05,10,213
State – ODISHA (21)
95,90,756
State – RAJASTHAN (08)
92,38,534
State – GUJARAT (24)
89,17,174
State – JHARKHAND (20)
86,45,042
State – CHHATTISGARH (22)
78,22,902
State – ANDHRA PRADESH (28)
59,18,073
State – WEST BENGAL (19)
52,96,953
State – KARNATAKA (29)
42,48,987
State – ASSAM (18)
38,84,371
State – MEGHALAYA (17)
25,55,861
State – NAGALAND (13)
17,10,973
State – JAMMU & KASHMIR (01)
14,93,299
State – BIHAR (10)
13,36,573
State – MANIPUR (14)
11,67,422
State – TRIPURA (16)
11,66,813
State – UTTAR PRADESH (09)
11,34,273
State – MIZORAM (15)
10,36,115
State – ARUNACHAL PRADESH (12)
9,51,821
State – TAMIL NADU (33)
7,94,697
State – KERALA (32)
4,84,839
State – HIMACHAL PRADESH (02)
3,92,126
State – UTTARAKHAND (05)
2,91,903
State – SIKKIM (11)
2,06,360
State – DADRA & NAGAR HAVELI (26)
1,78,564
State – GOA (30)
1,49,275
State – LAKSHADWEEP (31)
61,120
State – ANDAMAN & NICOBAR ISLANDS (35)
28,530
State – DAMAN & DIU (25)
15,363
State – PUNJAB (03)
0
State – CHANDIGARH (04)
0
State – HARYANA (06)
0
State – NCT OF DELHI (07)
0
State – PUDUCHERRY (34)
0
Task 11: For each state, find the no. of literate females and no. of literate males. Draw a bar graph for the same. Which state has the highest ratio of literate female vs literate male and which state has the minimum?
Step 1: Read the dataset
Result (code):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("PCA_AY_2011_Revised.csv")
Step 2: Filter the data (State/UT totals, Total, All Ages)
What I filter (student logic):
•
State/UT totals only → `District Code = 0`
•
Only “Total” (not Rural/Urban separately)
•
Only “All Ages”
•
Exclude “INDIA” because we need state-wise data
Result (code):
state_lit = df[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][["Area Name", "Literates - Total - Males", "Literates - Total - Females"]].copy()
state_lit = state_lit.rename(columns={
"Area Name": "State/UT",
"Literates - Total - Males": "LiterateMales",
"Literates - Total - Females": "LiterateFemales"
})
Step 3: Calculate the ratio (Literate Females vs Literate Males)
Formula used:
{\text{Female-to-Male Literate Ratio} = \dfrac{\text{Literate Females}}{\text{Literate Males}}}
Result (code):
state_lit["Female_to_Male_Literate_Ratio"] = (
state_lit["LiterateFemales"] / state_lit["LiterateMales"]
)
state_lit_sorted = state_lit.sort_values(
"Female_to_Male_Literate_Ratio", ascending=False
).reset_index(drop=True)
Step 4: Identify the highest and lowest ratio states
Result (final answer from our dataset):
✅ Highest female:male literate ratio: State – KERALA (32)
•
Literate Females = 1,44,30,921
•
Literate Males = 1,37,04,903
•
Ratio = 1.0530
✅ Lowest female:male literate ratio: State – DAMAN & DIU (25)
•
Literate Females = 63,763
•
Literate Males = 1,24,643
•
Ratio = 0.5116
Step 5: Draw the bar graph (required)
✅ I generated and displayed the bar graph (male vs female literates for each state/UT, sorted by female/male ratio).
Result (graph code):
plt.figure(figsize=(18, 7))
x = range(len(state_lit_sorted))
bar_width = 0.45
plt.bar([i - bar_width / 2 for i in x], state_lit_sorted["LiterateMales"],
width=bar_width, label="Literate males")
plt.bar([i + bar_width / 2 for i in x], state_lit_sorted["LiterateFemales"],
width=bar_width, label="Literate females")
plt.title("State/UT-wise Literates (All Ages, Total): Male vs Female")
plt.xlabel("State / Union Territory (sorted by Female/Male ratio)")
plt.ylabel("Number of literates (persons)")
plt.xticks(list(x), state_lit_sorted["State/UT"], rotation=90)
plt.legend()
plt.tight_layout()
plt.show()
Data used for Task 11 (CSV)
Here is the extracted state-wise dataset used for Task 11:
State/UT
LiterateMales
LiterateFemales
Female_to_Male_Literate_Ratio
State – KERALA (32)
1,37,04,903
1,44,30,921
1.05
State – MEGHALAYA (17)
9,13,879
8,71,126
0.95
State – MIZORAM (15)
4,38,529
4,09,646
0.93
State – PUDUCHERRY (34)
4,97,378
4,59,931
0.92
State – GOA (30)
6,15,823
5,49,664
0.89
State – LAKSHADWEEP (31)
28,023
24,530
0.88
State – TRIPURA (16)
15,01,369
1,30,34,14
0.87
State – NAGALAND (13)
7,23,957
6,18,477
0.85
State – TAMIL NADU (33)
2,80,40,491
2,37,97,016
0.85
State – MANIPUR (14)
10,39,858
8,68,618
0.84
State – HIMACHAL PRADESH (02)
27,52,590
22,87,146
0.83
State – WEST BENGAL (19)
3,38,18,810
2,77,19,471
0.82
State – ASSAM (18)
1,05,68,639
86,09,338
0.81
State – KARNATAKA (29)
2,25,08,471
1,81,38,851
0.81
State – MAHARASHTRA (27)
4,52,57,584
3,62,96,706
0.80
State – PUNJAB (03)
1,04,36,056
82,71,081
0.79
State – ANDAMAN & NICOBAR ISLANDS (35)
1,64,377
1,29,904
0.79
State – ANDHRA PRADESH (28)
2,82,51,243
2,23,05,517
0.79
State – UTTARAKHAND (05)
38,63,708
30,17,245
0.78
State – ODISHA (21)
1,50,89,681
1,16,52,914
0.77
State – SIKKIM (11)
2,51,269
1,93,683
0.77
State – NCT OF DELHI (07)
71,94,856
55,42,911
0.77
State – GUJARAT (24)
2,34,74,873
1,76,18,485
0.75
State – CHHATTISGARH (22)
88,07,893
65,72,029
0.75
State – ARUNACHAL PRADESH (12)
4,39,868
3,26,137
0.74
State – CHANDIGARH (04)
4,65,346
3,40,092
0.73
State – MADHYA PRADESH (23)
2,51,74,328
1,76,76,841
0.70
State – HARYANA (06)
97,94,067
68,04,921
0.69
State – JHARKHAND (20)
1,08,82,519
74,45,550
0.68
State – UTTAR PRADESH (09)
6,82,34,964
4,61,62,591
0.68
State – BIHAR (10)
3,16,08,023
2,08,96,530
0.66
State – JAMMU & KASHMIR (01)
42,64,671
28,02,562
0.66
State – RAJASTHAN (08)
2,36,88,412
1,45,86,870
0.62
State – DADRA & NAGAR HAVELI (26)
1,42,521
80,709
0.57
State – DAMAN & DIU (25)
1,24,643
63,763
0.51
Consolidated code for Task 11 (one place)
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("PCA_AY_2011_Revised.csv")
state_lit = df[
(df["District Code"] == 0) &
(df["Area Name"] != "INDIA") &
(df["Total/ Rural/ Urban"] == "Total") &
(df["Adolescent and youth categories"] == "All Ages")
][["Area Name", "Literates - Total - Males", "Literates - Total - Females"]].copy()
state_lit = state_lit.rename(columns={
"Area Name": "State/UT",
"Literates - Total - Males": "LiterateMales",
"Literates - Total - Females": "LiterateFemales"
})
state_lit["Female_to_Male_Literate_Ratio"] = (
state_lit["LiterateFemales"] / state_lit["LiterateMales"]
)
state_lit_sorted = state_lit.sort_values(
"Female_to_Male_Literate_Ratio", ascending=False
).reset_index(drop=True)
highest = state_lit_sorted.iloc[0]
lowest = state_lit_sorted.iloc[-1]
print("Highest ratio:", highest["State/UT"],
"Females:", int(highest["LiterateFemales"]),
"Males:", int(highest["LiterateMales"]),
"Ratio:", round(float(highest["Female_to_Male_Literate_Ratio"]), 4))
print("Lowest ratio:", lowest["State/UT"],
"Females:", int(lowest["LiterateFemales"]),
"Males:", int(lowest["LiterateMales"]),
"Ratio:", round(float(lowest["Female_to_Male_Literate_Ratio"]), 4))
plt.figure(figsize=(18, 7))
x = range(len(state_lit_sorted))
bar_width = 0.45
plt.bar([i - bar_width / 2 for i in x], state_lit_sorted["LiterateMales"],
width=bar_width, label="Literate males")
plt.bar([i + bar_width / 2 for i in x], state_lit_sorted["LiterateFemales"],
width=bar_width, label="Literate females")
plt.title("State/UT-wise Literates (All Ages, Total): Male vs Female")
plt.xlabel("State / Union Territory (sorted by Female/Male ratio)")
plt.ylabel("Number of literates (persons)")
plt.xticks(list(x), state_lit_sorted["State/UT"], rotation=90)
plt.legend()
plt.tight_layout()
plt.show()
state_lit_sorted.to_csv("task11_statewise_literate_male_female.csv", index=False)
print("Saved: task11_statewise_literate_male_female.csv")
State/UT-wise Literates (All Ages, Total): Male vs Female