Plotting Data using Matplotlib

This page contains the NCERT Informatics Practices class 12 chapter 4 Plotting Data using Matplotlib from the book Informatics Practices. You can find the solutions for the chapter 4 of NCERT class 12 Informatics Practices, for the Short Answer Questions, Long Answer Questions and Projects/Assignments Questions in this page. So is the case if you are looking for NCERT class 12 Informatics Practices related topic Plotting Data using Matplotlib question and answers.

Buy Class 12 Informatics Practices Books NOW!

EXERCISE

What is the purpose of the Matplotlib library?

The purpose of the Matplotlib library is to enable data visualization in Python. It’s a powerful tool used for creating static, interactive, and animated visualizations in Python. Matplotlib allows users to generate a wide range of plots and graphs that are highly customizable. It helps in visually interpreting and presenting data in a more understandable and appealing format.

The key points to understand about Matplotlib are:

–

Matplotlib for Visualization: It is used to create 2D graphs and plots by using python scripts.

–

Interactivity: The library provides an interactive environment across platforms.

–

Customization: Users can customize every aspect of a plot.

–

Types of Plots: Matplotlib can produce a variety of plots, including line, bar, scatter, and histogram.

What are some of the major components of any graphs or plot?

The major components of any graphs or plots are as follows:

Title: It provides a brief, clear indication of what the graph or plot is about. It’s essential for quickly identifying the purpose of the visualization.

Axis Labels: These labels on the x-axis and y-axis clarify what each axis represents, ensuring the data can be accurately interpreted.

Data Points: In various forms like dots in scatter plots or bars in bar charts, these represent the actual data values.

Legend: This component is crucial when multiple datasets are plotted on the same graph. It helps in distinguishing between the different data series.

Plotting Area: The space where the data is plotted, showing how data values vary along the axes.

Whiskers (in Box Plots): These extend from the box to the highest and lowest values in the dataset, excluding outliers. They provide a visual representation of the range of the data.

Outliers: These are data points that are significantly different from the rest of the data. They are visually represented and are important for understanding the data’s variability.

Name the function which is used to save the plot.

The function used to save a plot in Matplotlib is `savefig()`. This function is part of the Matplotlib library, which is extensively used in Python for creating a variety of plots and graphs. The `savefig()` function allows you to save the current figure created by your plot commands to a file. This is extremely useful for keeping a record of your visual data analysis or for preparing figures for reports or presentations.

The typical usage of the `savefig()` function involves specifying the filename and, optionally, parameters like the file format (e.g., PNG, JPEG, PDF), resolution (DPI), and dimensions of the saved figure. For instance, to save a plot as a PNG file, you might use the command `plt.savefig('filename.png')`.

Write short notes on different customisation options available with any plot.

There are various customization options available in Matplotlib for enhancing plots and graphs. Here are some key customization features:

Changing Line Width and Style: The linewidth property can be adjusted to change the width of lines in a line chart. Similarly, the linestyle property allows choosing different styles like solid, dotted, dashed, or dashdot for the lines.

Customizing Histograms: The edgecolor, linestyle, and linewidth of histograms can be altered. Also, histograms can be customized with properties like fill (to fill bars with color) and hatch (to fill bars with patterns like ‘-‘, ‘+’, ‘x’, etc.).

Customizing Bar Charts: For bar charts, the edgecolor of bars, linestyle, linewidth, and color can be customized.

Color Customization: A variety of color options are available to enhance the visual appeal of plots. This includes standard color codes and names.

Markers: In line charts and scatter plots, different markers can be used to represent data points. These include symbols like circles, stars, triangles, etc.

Adding Titles and Labels: Titles can be added to graphs for better understanding. Axis labels for both x-axis and y-axis can also be customized to make the data representation clearer.

Legend Placement: When a plot contains multiple datasets, legends can be used for distinction. The placement and style of legends can be adjusted as per the requirement.

These customization options play a crucial role in making the plots more informative, visually appealing, and easier to understand. The flexibility to customize various aspects of a plot is one of the key strengths of Matplotlib, making it a widely used library for data visualization in Python.

What is the purpose of a legend?

The purpose of a legend in a plot or graph is to provide a clear and understandable explanation of the various data represented on the graph. A legend is particularly crucial when a plot contains more than one dataset or various types of data. It helps in distinguishing between these different datasets by associating each one with a specific color, shape, or style represented in the legend.

Key points about the purpose of a legend in a plot are:

–

Identification: It helps in identifying what each color, pattern, or shape in the plot represents.

–

Clarity: Enhances clarity, especially in complex graphs with multiple datasets or variables.

–

Ease of Understanding: Makes it easier for the viewer to understand the data being presented, thereby improving the communication of information.

The legend is an essential component for effective data visualization, ensuring that the viewer can accurately interpret the information presented in the graph.

Define Pandas visualisation.

Pandas visualization is a functionality provided by the Pandas library in Python, primarily used for creating a wide range of informative and interactive plots and graphs directly from DataFrame and Series data structures. It simplifies the process of data visualization by providing a high-level interface for drawing attractive and informative statistical graphics.

Key points about Pandas visualization are:

–

Integration with Matplotlib: Pandas visualization is built on top of Matplotlib, a comprehensive library for creating static, animated, and interactive visualizations in Python. This integration allows for extensive customization and flexibility in plotting.

–

Ease of Use: By using the `.plot()` method available on Pandas Series and DataFrame objects, it becomes straightforward to generate various types of plots, including line plots, bar plots, histograms, scatter plots, etc.

–

Customizable: The `.plot()` method in Pandas accepts numerous arguments for customization, allowing users to specify the type of plot, aesthetics, and other plot details directly.

–

Versatile: It supports a wide range of plot types and can be used for various data visualization needs, from basic line charts to complex histograms and scatter plots.

What is open data? Name any two websites from which we can download open data.

Open data refers to data that is freely available to everyone to use and republish as they wish, without restrictions from copyright or other mechanisms of control. The concept of open data is particularly important in the field of research and education, as it promotes transparency, accessibility, and innovation by allowing free access to data for analysis.

Two websites that provide open data are:

Open Government Data (OGD) Platform India (data.gov.in): This platform supports the Open Data initiative of the Government of India. It offers large datasets on various projects and parameters, enabling users to access a wide range of data for different purposes.

Another example of a website offering open data, commonly used globally, is Kaggle (kaggle.com). Kaggle is a well-known platform for data science competitions, and it also provides a vast repository of datasets that are freely available for educational and research purposes.

These websites are valuable resources for students and researchers looking to access a wide range of data sets for analysis, projects, and learning purposes.

Give an example of data comparison where we can use the scatter plot.

Example 1:

A practical example of using a scatter plot for data comparison is seen in the case study of a seller named Prayatna, who deals in designer bags and wallets. Here’s the detailed scenario and how the scatter plot is utilized:

–

Scenario: During a sales season, Prayatna offers varying discounts on his products. These discounts range from 10% to 50% over a period of 5 weeks.

–

Data Collection: For each discount level, Prayatna meticulously records the corresponding sales figures in Rupees.

–

Creating the Scatter Plot: To analyze the impact of discounts on sales, Prayatna uses a scatter plot, where:

–

The X-axis represents the discount percentage (ranging from 10% to 50%).

–

The Y-axis indicates the sales in Rupees.

–

Each dot on the scatter plot correlates a specific discount rate with the sales it generated.

–

Analysis Through Scatter Plot: This visual representation allows Prayatna to effectively discern patterns or trends in the data. For instance, if the plot shows dots trending upwards as the discount percentage increases, it might indicate that higher discounts lead to higher sales.

This example showcases how scatter plots are powerful tools in visualizing and understanding the relationship between two variables — in this case, the discount rate and sales revenue.

Example 2:

Example: Academic Performance Analysis

–

Scenario: A school wants to analyze the relationship between the number of hours students spend on study and their academic performance.

–

Data Collection:

–

The school collects data on the number of hours each student spends studying per week.

–

It also records their overall academic performance, measured through their grades or GPA (Grade Point Average).

–

Using the Scatter Plot:

–

The X-axis of the scatter plot represents the number of study hours per week.

–

The Y-axis shows the students’ GPA.

–

Each point on the scatter plot represents a student, correlating their study hours with their GPA.

–

Interpretation and Analysis:

–

By examining the scatter plot, the school can observe if there is a correlation between study time and academic performance.

–

For example, if the plot shows that points tend to be higher on the Y-axis (higher GPA) for students with more study hours (further along the X-axis), it would suggest a positive correlation between study time and academic performance.

This example demonstrates the use of a scatter plot in an educational setting, providing a visual tool to understand and analyze the impact of study habits on academic success. Scatter plots are particularly useful in such scenarios as they can reveal patterns and correlations in data that might not be immediately obvious.

Name the plot which displays the statistical summary.

Note:

Give appropriate title, set xlabel and ylabel while attempting the following questions.

The type of plot that displays the statistical summary of a given dataset is known as the Box Plot.

Details about Box Plot:

–

Statistical Summary: A box plot visually represents the statistical summary of a dataset, including key measures like the minimum value, the first quartile (Q1), median (Q2), the third quartile (Q3), and the maximum value.

–

Whiskers: It includes “whiskers” which extend from the box to the highest and lowest values in the dataset, excluding outliers. This helps in depicting the range of the data.

–

Outliers: The box plot also aids in identifying outliers – observations that are significantly different from the rest of the data.

Creating a Box Plot:

–

Title: When creating a box plot, it’s important to give it an appropriate title for easy identification of what the plot represents. For instance, “Performance Analysis” could be a suitable title if the plot is about student performance.

–

X-axis Label (xlabel): This should indicate what each box represents, for example, “Subjects” if the boxes represent different academic subjects.

–

Y-axis Label (ylabel): This should denote the range of values, such as “Marks” if the plot is displaying the range of marks obtained by students.

The box plot is a powerful tool for summarizing data distributions and is especially useful in comparing distributions between several groups or datasets. It provides a concise visual summary of the central tendency, variability, and skewness of the data, along with potential outliers.

10.

Plot the following data using a line plot:

2000

2800

3000

2500

2300

2500

1000

•

Before displaying the plot display “Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday” in place of Day 1, 2, 3, 4, 5, 6, 7

•

Change the color of the line to ‘Magenta’.

Based on the given data, a line plot can be created with the following specifications:

–

X-axis (Days): The days of the week are displayed as “Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday” instead of Day 1, 2, 3, 4, 5, 6, 7.

–

Y-axis (Tickets Sold): The number of tickets sold each day is plotted along the Y-axis, showing the variation in ticket sales throughout the week.

–

Line Color: The color of the line in the plot is set to ‘Magenta’.

This line plot visually represents the ticket sales data over a week, providing a clear view of the sales trend for each day. Such plots are instrumental in analyzing patterns and making informed decisions based on the observed trends.

The following is the code:

import matplotlib.pyplot as plt

# Data
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
tickets_sold = [2000, 2800, 3000, 2500, 2300, 2500, 1000]

# Creating the line plot
plt.plot(days, tickets_sold, color='magenta')

# Adding title and labels
plt.title('Weekly Tickets Sales')
plt.xlabel('Day of the Week')
plt.ylabel('Tickets Sold')

# Displaying the plot
plt.show()

Weekly Tickets Sales

11.

Collect data about colleges in Delhi University or any other university of your choice and number of courses they run for Science, Commerce and Humanities, store it in a CSV file and present it using a bar plot.

Step 1: Store Data in a CSV File

import pandas as pd
        
# Data about colleges and courses
data = {
    "College": ["College A", "College B", "College C", "College D", "College E",
        "College F", "College G", "College H", "College I", "College J"],
    "Science": [12, 15, 9, 13, 10, 14, 11, 8, 16, 10],
    "Commerce": [8, 10, 7, 9, 6, 11, 13, 12, 5, 14],
    "Humanities": [14, 12, 16, 11, 15, 10, 9, 14, 13, 8]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save DataFrame to CSV
df.to_csv('college_courses.csv', index=False)

Step 2: Create a Bar Plot

import matplotlib.pyplot as plt
        
# Read the data back from the CSV file
df = pd.read_csv('college_courses.csv')

# Create a bar plot
df.plot(x='College', kind='bar', figsize=(10, 6))

# Adding titles and labels
plt.title('Number of Courses Offered by Colleges')
plt.xlabel('College')
plt.ylabel('Number of Courses')
plt.xticks(rotation=45)
plt.legend(title='Streams')

# Display the plot
plt.show()

University College Courses

Explanation

–

The first part of the code creates a DataFrame with the provided data and saves it to a CSV file named ‘college_courses.csv’.

–

The second part reads the data from the CSV file and creates a stacked bar plot using Matplotlib. Each bar represents a college, and each bar segment shows the number of courses offered in Science, Commerce, and Humanities streams.

–

The plot is then customized with a title, axis labels, and a legend.

12.

Collect and store data related to the screen time of students in your class separately for boys and girls and present it using a boxplot.

The following is the data for my classmates’ screen time and demonstration of how to store this data and present it using a boxplot in Python.

Data Collection and Storage:

–

Data on screen time (in hours per day) for boys and girls in my class is collected.

–

For simplicity, data is collected for 10 boys and 10 girls in my class.

Data:

–

Boys’ screen time (in hours): [3, 4.5, 2, 5, 6, 3.5, 4, 5.5, 2.5, 4]

–

Girls’ screen time (in hours): [2, 3.5, 4, 4.5, 3, 2.5, 5, 3.5, 4, 4.5]

Python Code to Store Data and Create Boxplot:

import pandas as pd
import matplotlib.pyplot as plt

# Hypothetical data
boys_screen_time = [3, 4.5, 2, 5, 6, 3.5, 4, 5.5, 2.5, 4]
girls_screen_time = [2, 3.5, 4, 4.5, 3, 2.5, 5, 3.5, 4, 4.5]

# Creating DataFrame
data = {'Boys': boys_screen_time, 'Girls': girls_screen_time}
df = pd.DataFrame(data)

# Creating a boxplot
plt.figure(figsize=(8, 6))
df.boxplot(column=['Boys', 'Girls'])
plt.title('Screen Time of Students in Class (Boys vs Girls)')
plt.ylabel('Hours per Day')
plt.grid(False)
plt.show()

Screen Time of Students in Class (Boys vs Girls)

Explanation:

–

The code first creates a DataFrame with the screen time data for boys and girls.

–

It then uses Matplotlib to create a boxplot that visually represents this data. The boxplot shows the distribution of screen time, including the median, quartiles, and potential outliers.

13.

Explain the findings of the boxplot of Figure 4.18 Figure 4.17 by filling the following blanks:

The median for the five subjects is _____ , ______, _______, ______, ______

The highest value for the five subjects is : _____ , ______, _______, ______, ______

The lowest value for the five subjects is : _____ , ______, _______, ______, ______

______________ subject has two outliers with the value ________ and ________

______________ subject shows minimum variation

⚠Note: It is actually Figure 4.17 in the text book (there is a typo in the given question and it is given as Figure 4.18)

Figure 4.17 A boxplot of “Marks.csv”

From the boxplot above, we see that

The median for the five subjects is 80, 50, 55, 56, 76.

The highest value for the five subjects is : 95, 95, 90, 94, 95

The lowest value for the five subjects is : 60, 33, 39, 48, 54

Social_Studies subject has two outliers with the value 54 and 95

Social_Studies subject shows minimum variation

The following is the box plot diagram displayed in the Figure 4.17

14.

Collect the minimum and maximum temperature of your city for a month and present it using a histogram plot.

The minimum and maximum temperature data for my city for a month is collected. The data is demonstrated and presented using a histogram plot in Python.

Temperature Data for My City in January:

–

Minimum Temperatures (°C): [7, 8, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7]

–

Maximum Temperatures (°C): [19, 20, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19]

Python Code to Create a Histogram Plot:

import pandas as pd
import matplotlib.pyplot as plt

# Temperature data for My City in January
min_temps = [7, 8, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7]
max_temps = [19, 20, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19]

# Creating a DataFrame
df = pd.DataFrame({'Min Temperature': min_temps, 'Max Temperature': max_temps})

# Creating a histogram plot
df.plot(kind='hist', bins=10, alpha=0.7)
plt.title('Temperature Distribution in My City for January')
plt.xlabel('Temperature (°C)')
plt.ylabel('Frequency')
plt.show()

Temperature Distribution in My City in a Month

Explanation of the Code:

–

The script first creates a DataFrame with the hypothetical temperature data.

–

Then, it uses Matplotlib to create a histogram plot that visually represents the distribution of minimum and maximum temperatures for the month of January in My City.

–

The `bins` parameter in the `plot` method determines the number of bins (intervals) in the histogram. The `alpha` parameter controls the transparency of the histogram bars.

15.

Conduct a class census by preparing a questionnaire. The questionnaire should contain a minimum of five questions. Questions should relate to students, their family members, their class performance, their health etc. Each student is required to fill up the questionnaire. Compile the information in numerical terms (in terms of percentage). Present the information through a bar, scatter–diagram. (NCERT Geography class IX, Page 60)

1. Introduction

I am a student of class XII Z
-section

I conducted a class census in my class to collection information about my classmates, their families, their health and their academic paerformance.

•

Total number of students in my class (including me): 40

•

All the 40 students filled up the questionnaire.

2. Questionnaire Used

Class Census Questionnaire – Class 12 Z

Name:

Gender:

•

(a) Boy

•

(b) Girl

Family Members at home

•

(a) 3 or Less

•

(b) 4-5

•

Number of Siblings:

•

(a) 0

•

(b) 1

•

Breakfast daily?

•

(a) Yes

•

(b) No

Health status (self-reported)

•

(a) Generally healthy

•

(b) Seasonal Allergy-Asthma

•

Study hours per day (outside school): hours (numeric)

Latest test score: % (numeric)

3. Data Compilation and Analysis

After collecting the filled questionnaires from all the students, I compiled the data in numerical terms (in terms of percentage) for each question.

(a) Gender of Students

Boys

{\dfrac{22}{40} × 100 = 55\%}

Girls

{\dfrac{18}{40} × 100 = 45\%}

Total

100%

Bar Chart depicting the percentage of boys and girls in the class

Observation: In my class the number of boys (21) is slightly more than the number of girls (19)

(b) Family members at home

3 or Less

{\dfrac{8}{40} × 100 = 20\%}

4-5

{\dfrac{24}{40} × 100 = 60\%}

6 or more members

{\dfrac{8}{40} × 100 = 20\%}

Total

100%

Bar Chart depicting the percentage of students with specific family members at home.

Observation: Most of my classmates come from medium sized famiies with 4-5 members.

{\dfrac{6}{40} × 100 = 15\%}

{\dfrac{18}{40} × 100 = 45\%}

{\dfrac{16}{40} × 100 = 40\%}

Total

100%

Bar Chart depicting the percentage of students with specific siblings.

Observation: Majority of the students have atleast 1 sibling.

(e) Morning Breakfast before arriving at school.

Yes

{\dfrac{30}{40} × 100 = 75\%}

{\dfrac{10}{40} × 100 = 25\%}

Total

100%

Bar Chart depicting the percentage of studets coming to school with/without breakfast.

Observation: A large majority of my classmates (75%) finish their breakfast before coming to school.

(f) Daily physical activity (≥30 min)

Yes

{\dfrac{26}{30} × 100 = 65\%}

{\dfrac{14}{10} × 100 = 35\%}

Total

100%

Bar Chart depicting the percentage of students engaged in daily physical activity.

Observation: A large majority of my classmates (65%) have a daily physical activity of 30 minutes or more.

(g) Health Status (self-reported)

Good

{\dfrac{32}{40} × 100 = 80\%}

Seasonal

{\dfrac{6}{40} × 100 = 15\%}

Other

{\dfrac{2}{40} × 100 = 5\%}

Total

100%

Bar Chart depicting the percentage of students with specific siblings.

Observation: Majority of the students(80%) have good health status.

(h) Study Hours vs Test Score:

The following is the data collected related to the study hours and latest test scores.

S01

2.4

S02

1.2

S03

2.8

S04

2.9

S05

0.3

S06

0.9

S07

2.2

S08

1.8

S09

2.1

S10

1.3

S11

2.8

S12

2.7

S13

2.2

S14

3.0

S15

2.6

S16

1.9

S17

2.0

S18

1.3

S19

2.7

S20

1.9

S21

1.3

S22

2.3

S23

1.8

S24

1.6

S25

3.3

S26

2.1

S27

2.1

S28

3.2

S29

2.5

S30

2.9

S31

2.1

S32

2.6

S33

2.3

S34

0.8

S35

1.8

S36

2.1

S37

2.1

S38

1.7

S39

2.2

S40

2.0

Scatter plot depicting the study hours vs test score.

16.

Visit data.gov.in , search for the following in “catalogs” option of the website:

•

Final population Totals, India and states

•

State Wise literacy rate

Download them and create a CSV file containing population data and literacy rate of the respective state. Also add a column Region to the CSV file that should contain the values East, West, North and South. Plot a scatter plot for each region where X axis should be population and Y axis should be Literacy rate. Change the marker to a diamond and size as the square root of the literacy rate.

Group the data on the column region and display a bar chart depicting average literacy rate for each region.

What I did:

visited data.gov.in site.

Clicked on Catalog.

In the search box, provided the input “Final population Totals, India and states”

Clicked on teh “Download” link available under the section “Primary Census Abstract 2011 – India”

This downloaded the excel file “PCA0000_2011_MDDS.xls” (You’ll see the file name only after downloading. On the site, you’ll just see a “Download” link/button)

Note: This downloaded excel document contains data related the population also. So, used it for finding the statewise literacy rate also (didn’t download another file for literacy).

Kept only STATE level rows and TRU = TOTAL (so we get one total row per state/UT)

Took

•

Population = Total populiation Person

•

0-6 Population = Population in the age group 0-6 Person

•

Literates = Literates Population Person

Computed literacy rate (standard Census style):

{\text{Literacy Rate} = \dfrac{\text{Literates (7+)}}{\text{Population (7+)}} × 100}

Where Population (7+) = Total Population – (0-6 Population)

Added a “Region” column with values East/West/North/South (simple school-level grouping).

10.

I created the combined CSV file with columns.

State, Population, Literacy Rate, Region

11.

Plotted the scatter plots and bar chart as per the question and as specified below.

The following is the code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# -----------------------------
# 1) Read the Census PCA file
# -----------------------------
# Put your file path here
path = "PCA0000_2011_MDDS.xls"

# The sheet name in this file is typically "PCA"
df = pd.read_excel(path, sheet_name="PCA")

# -----------------------------------------------
# 2) Keep only STATE level totals (TRU = Total)
# -----------------------------------------------
state_df = df[(df["Level"] == "STATE") & (df["TRU"] == "Total")].copy()
    
# ----------------------------------------------------------
# 3) Build population + compute literacy rate (Census logic)
#    Literacy Rate (%) = Literates (7+) / Population (7+) * 100
# ----------------------------------------------------------
state_df["Population"] = state_df["Total Population Person"]
state_df["Pop_0_6"] = state_df["Population in the age group 0-6 Person"]
state_df["Pop_7_plus"] = state_df["Population"] - state_df["Pop_0_6"]

state_df["Literates_7_plus"] = state_df["Literates Population Person"]
state_df["LiteracyRate"] = (state_df["Literates_7_plus"] / state_df["Pop_7_plus"]) * 100

# -------------------------------------
# 4) Add Region column (E/W/N/S mapping)
# -------------------------------------
region_map = {
    # North
    "JAMMU & KASHMIR": "North",
    "HIMACHAL PRADESH": "North",
    "PUNJAB": "North",
    "CHANDIGARH": "North",
    "UTTARAKHAND": "North",
    "HARYANA": "North",
    "NCT OF DELHI": "North",
    "RAJASTHAN": "North",
    "UTTAR PRADESH": "North",

    # East (includes North-East + eastern India)
    "BIHAR": "East",
    "SIKKIM": "East",
    "ARUNACHAL PRADESH": "East",
    "NAGALAND": "East",
    "MANIPUR": "East",
    "MIZORAM": "East",
    "TRIPURA": "East",
    "MEGHALAYA": "East",
    "ASSAM": "East",
    "WEST BENGAL": "East",
    "JHARKHAND": "East",
    "ODISHA": "East",
    "CHHATTISGARH": "East",
    "ANDAMAN & NICOBAR ISLANDS": "East",

    # West
    "MADHYA PRADESH": "West",
    "GUJARAT": "West",
    "DAMAN & DIU": "West",
    "DADRA & NAGAR HAVELI": "West",
    "MAHARASHTRA": "West",
    "GOA": "West",

    # South
    "ANDHRA PRADESH": "South",
    "KARNATAKA": "South",
    "LAKSHADWEEP": "South",
    "KERALA": "South",
    "TAMIL NADU": "South",
    "PUDUCHERRY": "South",
}

state_df["Region"] = state_df["Name"].map(region_map).fillna("North")

# ------------------------------------
# 5) Final table + export CSV
# ------------------------------------
final_df = state_df[["Name", "Population", "LiteracyRate", "Region"]].copy()
final_df.rename(columns={"Name": "State"}, inplace=True)

final_df["Population"] = final_df["Population"].astype("int64")
final_df["LiteracyRate"] = final_df["LiteracyRate"].round(2)

final_df.to_csv("state_population_literacy_region.csv", index=False)
print("Saved: state_population_literacy_region.csv")

# ------------------------------------
# 6) Scatter plots (one per region)
# ------------------------------------
regions = ["North", "South", "East", "West"]

for r in regions:
    sub = final_df[final_df["Region"] == r].copy()

    plt.figure(figsize=(8, 5))
    plt.scatter(
        sub["Population"],
        sub["LiteracyRate"],
        marker="D",                     # diamond marker
        s=np.sqrt(sub["LiteracyRate"])  # size = sqrt(literacy rate)
    )

    plt.title(f"Population vs Literacy Rate ({r} Region)")
    plt.xlabel("Population (Total Persons) (in crores)")
    plt.ylabel("Average Literacy Rate (%)")

    # Optional: label each point with the state name
    for _, row in sub.iterrows():
        plt.annotate(
            row["State"],
            (row["Population"], row["LiteracyRate"]),
            fontsize=7,
            xytext=(3, 3),
            textcoords="offset points"
        )

    plt.tight_layout()
    plt.show()

# ------------------------------------
# 7) Bar chart: average literacy by region
# ------------------------------------
avg_lit = final_df.groupby("Region", as_index=False)["LiteracyRate"].mean()

plt.figure(figsize=(7, 4.5))
plt.bar(avg_lit["Region"], avg_lit["LiteracyRate"])

plt.title("Average Literacy Rate by Region (Census 2011 PCA)")
plt.xlabel("Region")
plt.ylabel("Average Literacy Rate (%)")


plt.tight_layout()
plt.show()

Scatter Plot: Population vs Literacy Rate (North Region)

Scatter Plot: Population vs Literacy Rate (North Region).

Scatter Plot: Population vs Literacy Rate (South Region)

Scatter Plot: Population vs Literacy Rate (South Region).

Scatter Plot: Population vs Literacy Rate (East Region)

Scatter Plot: Population vs Literacy Rate (East Region).

Scatter Plot: Population vs Literacy Rate (West Region)

GOA MADHYA PRADESH: Population = 72626809, Literacy Rate = 69.32% GUJARAT: Population = 60439692, Literacy Rate = 78.03% DAMAN & DIU: Population = 243247, Literacy Rate = 87.1% DADRA & NAGAR HAVELI: Population = 343709, Literacy Rate = 76.24% MAHARASHTRA: Population = 112374333, Literacy Rate = 82.34% GOA: Population = 1458545, Literacy Rate = 88.7%

Scatter Plot: Population vs Literacy Rate (West Region).

Bar Chart depicting the Average Literacy Rate by Region (Census 2011 PCA)

The output csv will be as follows:

JAMMU & KASHMIR

1,25,41,302

67.16

North

HIMACHAL PRADESH

68,64,602

82.8

North

PUNJAB

2,77,43,338

75.84

North

CHANDIGARH

10,55,450

86.05

North

UTTARAKHAND

1,00,86,292

78.82

North

HARYANA

2,53,51,462

75.55

North

NCT OF DELHI

1,67,87,941

86.21

North

RAJASTHAN

6,85,48,437

66.11

North

UTTAR PRADESH

19,98,12,341

67.68

North

BIHAR

10,40,99,452

61.8

East

SIKKIM

6,10,577

81.42

East

ARUNACHAL PRADESH

13,83,727

65.38

East

NAGALAND

19,78,502

79.55

East

MANIPUR

25,70,390

79.21

East

MIZORAM

10,97,206

91.33

East

TRIPURA

36,73,917

87.22

East

MEGHALAYA

29,66,889

74.43

East

ASSAM

3,12,05,576

72.19

East

WEST BENGAL

9,12,76,115

76.26

East

JHARKHAND

3,29,88,134

66.41

East

ODISHA

4,19,74,218

72.87

East

CHHATTISGARH

2,55,45,198

70.28

East

MADHYA PRADESH

7,26,26,809

69.32

West

GUJARAT

6,04,39,692

78.03

West

DAMAN & DIU

2,43,247

87.1

West

DADRA & NAGAR HAVELI

3,43,709

76.24

West

MAHARASHTRA

11,23,74,333

82.34

West

ANDHRA PRADESH

8,45,80,777

67.02

South

KARNATAKA

6,10,95,297

75.36

South

GOA

14,58,545

88.7

West

LAKSHADWEEP

64,473

91.85

South

KERALA

3,34,06,061

94.0

South

TAMIL NADU

7,21,47,030

80.09

South

PUDUCHERRY

12,47,953

85.85

South

ANDAMAN & NICOBAR ISLANDS

3,80,581

86.63

East

Note: Note that the reference file is that of 2011. So, if you’re wondering why few states are missing that’s because they were not yet formed in 2011.

Our mapping will run fine, but two things to note:

•

Chhattisgarh is usually treated as Central (not East). Since your question only allows East/West/North/South, you can keep it in East (acceptable), but it’s not standard.

•

Madhya Pradesh is also usually Central, not West. Same logic: your choice is acceptable for a 4-region simplification, but not strictly “correct” geographically.

If you want a more common simplification:

•

Put Madhya Pradesh in North (or keep West as you did)

•

Put Chhattisgarh in East (or North)