Plotting Data using Matplotlib

This page contains the NCERT Informatics Practicesclass 12 chapter 4 Plotting Data using Matplotlib from the book Informatics Practices. You can find the solutions for the chapter 4 of NCERT class 12 Informatics Practices, for the Short Answer Questions, Long Answer Questions and Projects/Assignments Questions in this page. So is the case if you are looking for NCERT class 12 Informatics Practices related topic Plotting Data using Matplotlib question and answers.
EXERCISE
1.
What is the purpose of the Matplotlib library?
The purpose of the Matplotlib library is to enable data visualization in Python. It’s a powerful tool used for creating static, interactive, and animated visualizations in Python. Matplotlib allows users to generate a wide range of plots and graphs that are highly customizable. It helps in visually interpreting and presenting data in a more understandable and appealing format.
The key points to understand about Matplotlib are:
Matplotlib for Visualization: It is used to create 2D graphs and plots by using python scripts.
Interactivity: The library provides an interactive environment across platforms.
Customization: Users can customize every aspect of a plot.
Types of Plots: Matplotlib can produce a variety of plots, including line, bar, scatter, and histogram.
2.
What are some of the major components of any graphs or plot?
The major components of any graphs or plots are as follows:
1.
Title: It provides a brief, clear indication of what the graph or plot is about. It’s essential for quickly identifying the purpose of the visualization.
2.
Axis Labels: These labels on the x-axis and y-axis clarify what each axis represents, ensuring the data can be accurately interpreted.
3.
Data Points: In various forms like dots in scatter plots or bars in bar charts, these represent the actual data values.
4.
Legend: This component is crucial when multiple datasets are plotted on the same graph. It helps in distinguishing between the different data series.
5.
Plotting Area: The space where the data is plotted, showing how data values vary along the axes.
6.
Whiskers (in Box Plots): These extend from the box to the highest and lowest values in the dataset, excluding outliers. They provide a visual representation of the range of the data.
7.
Outliers: These are data points that are significantly different from the rest of the data. They are visually represented and are important for understanding the data’s variability.
3.
Name the function which is used to save the plot.
The function used to save a plot in Matplotlib is `savefig()`. This function is part of the Matplotlib library, which is extensively used in Python for creating a variety of plots and graphs. The `savefig()` function allows you to save the current figure created by your plot commands to a file. This is extremely useful for keeping a record of your visual data analysis or for preparing figures for reports or presentations.
The typical usage of the `savefig()` function involves specifying the filename and, optionally, parameters like the file format (e.g., PNG, JPEG, PDF), resolution (DPI), and dimensions of the saved figure. For instance, to save a plot as a PNG file, you might use the command `plt.savefig('filename.png')`.
4.
Write short notes on different customisation options available with any plot.
There are various customization options available in Matplotlib for enhancing plots and graphs. Here are some key customization features:
1.
Changing Line Width and Style: The linewidth property can be adjusted to change the width of lines in a line chart. Similarly, the linestyle property allows choosing different styles like solid, dotted, dashed, or dashdot for the lines.
2.
Customizing Histograms: The edgecolor, linestyle, and linewidth of histograms can be altered. Also, histograms can be customized with properties like fill (to fill bars with color) and hatch (to fill bars with patterns like ‘-‘, ‘+’, ‘x’, etc.).
3.
Customizing Bar Charts: For bar charts, the edgecolor of bars, linestyle, linewidth, and color can be customized.
4.
Color Customization: A variety of color options are available to enhance the visual appeal of plots. This includes standard color codes and names.
5.
Markers: In line charts and scatter plots, different markers can be used to represent data points. These include symbols like circles, stars, triangles, etc.
6.
Adding Titles and Labels: Titles can be added to graphs for better understanding. Axis labels for both x-axis and y-axis can also be customized to make the data representation clearer.
7.
Legend Placement: When a plot contains multiple datasets, legends can be used for distinction. The placement and style of legends can be adjusted as per the requirement.
These customization options play a crucial role in making the plots more informative, visually appealing, and easier to understand. The flexibility to customize various aspects of a plot is one of the key strengths of Matplotlib, making it a widely used library for data visualization in Python.
5.
What is the purpose of a legend?
The purpose of a legend in a plot or graph is to provide a clear and understandable explanation of the various data represented on the graph. A legend is particularly crucial when a plot contains more than one dataset or various types of data. It helps in distinguishing between these different datasets by associating each one with a specific color, shape, or style represented in the legend.
Key points about the purpose of a legend in a plot are:
Identification: It helps in identifying what each color, pattern, or shape in the plot represents.
Clarity: Enhances clarity, especially in complex graphs with multiple datasets or variables.
Ease of Understanding: Makes it easier for the viewer to understand the data being presented, thereby improving the communication of information.
The legend is an essential component for effective data visualization, ensuring that the viewer can accurately interpret the information presented in the graph.
6.
Define Pandas visualisation.
Pandas visualization is a functionality provided by the Pandas library in Python, primarily used for creating a wide range of informative and interactive plots and graphs directly from DataFrame and Series data structures. It simplifies the process of data visualization by providing a high-level interface for drawing attractive and informative statistical graphics.
Key points about Pandas visualization are:
Integration with Matplotlib: Pandas visualization is built on top of Matplotlib, a comprehensive library for creating static, animated, and interactive visualizations in Python. This integration allows for extensive customization and flexibility in plotting.
Ease of Use: By using the `.plot()` method available on Pandas Series and DataFrame objects, it becomes straightforward to generate various types of plots, including line plots, bar plots, histograms, scatter plots, etc.
Customizable: The `.plot()` method in Pandas accepts numerous arguments for customization, allowing users to specify the type of plot, aesthetics, and other plot details directly.
Versatile: It supports a wide range of plot types and can be used for various data visualization needs, from basic line charts to complex histograms and scatter plots.
7.
What is open data? Name any two websites from which we can download open data.
Open data refers to data that is freely available to everyone to use and republish as they wish, without restrictions from copyright or other mechanisms of control. The concept of open data is particularly important in the field of research and education, as it promotes transparency, accessibility, and innovation by allowing free access to data for analysis.
Two websites that provide open data are:
1.
Open Government Data (OGD) Platform India (data.gov.in): This platform supports the Open Data initiative of the Government of India. It offers large datasets on various projects and parameters, enabling users to access a wide range of data for different purposes.
2.
Another example of a website offering open data, commonly used globally, is Kaggle (kaggle.com). Kaggle is a well-known platform for data science competitions, and it also provides a vast repository of datasets that are freely available for educational and research purposes.
These websites are valuable resources for students and researchers looking to access a wide range of data sets for analysis, projects, and learning purposes.
8.
Give an example of data comparison where we can use the scatter plot.
Example 1:
A practical example of using a scatter plot for data comparison is seen in the case study of a seller named Prayatna, who deals in designer bags and wallets. Here’s the detailed scenario and how the scatter plot is utilized:
Scenario: During a sales season, Prayatna offers varying discounts on his products. These discounts range from 10% to 50% over a period of 5 weeks.
Data Collection: For each discount level, Prayatna meticulously records the corresponding sales figures in Rupees.
Creating the Scatter Plot: To analyze the impact of discounts on sales, Prayatna uses a scatter plot, where:
The X-axis represents the discount percentage (ranging from 10% to 50%).
The Y-axis indicates the sales in Rupees.
Each dot on the scatter plot correlates a specific discount rate with the sales it generated.

Analysis Through Scatter Plot: This visual representation allows Prayatna to effectively discern patterns or trends in the data. For instance, if the plot shows dots trending upwards as the discount percentage increases, it might indicate that higher discounts lead to higher sales.
This example showcases how scatter plots are powerful tools in visualizing and understanding the relationship between two variables — in this case, the discount rate and sales revenue.
Example 2:
Example: Academic Performance Analysis
Scenario: A school wants to analyze the relationship between the number of hours students spend on study and their academic performance.
Data Collection:
The school collects data on the number of hours each student spends studying per week.
It also records their overall academic performance, measured through their grades or GPA (Grade Point Average).

Using the Scatter Plot:
The X-axis of the scatter plot represents the number of study hours per week.
The Y-axis shows the students’ GPA.
Each point on the scatter plot represents a student, correlating their study hours with their GPA.

Interpretation and Analysis:
By examining the scatter plot, the school can observe if there is a correlation between study time and academic performance.
For example, if the plot shows that points tend to be higher on the Y-axis (higher GPA) for students with more study hours (further along the X-axis), it would suggest a positive correlation between study time and academic performance.

This example demonstrates the use of a scatter plot in an educational setting, providing a visual tool to understand and analyze the impact of study habits on academic success. Scatter plots are particularly useful in such scenarios as they can reveal patterns and correlations in data that might not be immediately obvious.
9.
Name the plot which displays the statistical summary.
Note:
Give appropriate title, set xlabel and ylabel while attempting the following questions.
The type of plot that displays the statistical summary of a given dataset is known as the Box Plot.
Details about Box Plot:
Statistical Summary: A box plot visually represents the statistical summary of a dataset, including key measures like the minimum value, the first quartile (Q1), median (Q2), the third quartile (Q3), and the maximum value.
Whiskers: It includes “whiskers” which extend from the box to the highest and lowest values in the dataset, excluding outliers. This helps in depicting the range of the data.
Outliers: The box plot also aids in identifying outliers – observations that are significantly different from the rest of the data.
Creating a Box Plot:
Title: When creating a box plot, it’s important to give it an appropriate title for easy identification of what the plot represents. For instance, “Performance Analysis” could be a suitable title if the plot is about student performance.
X-axis Label (xlabel): This should indicate what each box represents, for example, “Subjects” if the boxes represent different academic subjects.
Y-axis Label (ylabel): This should denote the range of values, such as “Marks” if the plot is displaying the range of marks obtained by students.
The box plot is a powerful tool for summarizing data distributions and is especially useful in comparing distributions between several groups or datasets. It provides a concise visual summary of the central tendency, variability, and skewness of the data, along with potential outliers.
10.
Plot the following data using a line plot:
Day
1
2
3
4
5
6
7
Tickets sold
2000
2800
3000
2500
2300
2500
1000

Before displaying the plot display “Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday” in place of Day 1, 2, 3, 4, 5, 6, 7
Change the color of the line to ‘Magenta’.

Based on the given data, a line plot can be created with the following specifications:
X-axis (Days): The days of the week are displayed as “Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday” instead of Day 1, 2, 3, 4, 5, 6, 7.
Y-axis (Tickets Sold): The number of tickets sold each day is plotted along the Y-axis, showing the variation in ticket sales throughout the week.
Line Color: The color of the line in the plot is set to ‘Magenta’.
This line plot visually represents the ticket sales data over a week, providing a clear view of the sales trend for each day. Such plots are instrumental in analyzing patterns and making informed decisions based on the observed trends.
The following is the code:
import matplotlib.pyplot as plt

# Data
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
tickets_sold = [2000, 2800, 3000, 2500, 2300, 2500, 1000]

# Creating the line plot
plt.plot(days, tickets_sold, color='magenta')

# Adding title and labels
plt.title('Weekly Tickets Sales')
plt.xlabel('Day of the Week')
plt.ylabel('Tickets Sold')

# Displaying the plot
plt.show()
Weekly Tickets Sales 1000 1250 1500 1750 2000 2250 2500 2750 3000 Weekly Tickets Sales Day of the week Monday Tuesday Wednesday Thursday Friday Saturday Sunday Tickets Sold
Weekly Tickets Sales
11.
Collect data about colleges in Delhi University or any other university of your choice and number of courses they run for Science, Commerce and Humanities, store it in a CSV file and present it using a bar plot.
Step 1: Store Data in a CSV File
import pandas as pd
        
# Data about colleges and courses
data = {
    "College": ["College A", "College B", "College C", "College D", "College E",
        "College F", "College G", "College H", "College I", "College J"],
    "Science": [12, 15, 9, 13, 10, 14, 11, 8, 16, 10],
    "Commerce": [8, 10, 7, 9, 6, 11, 13, 12, 5, 14],
    "Humanities": [14, 12, 16, 11, 15, 10, 9, 14, 13, 8]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save DataFrame to CSV
df.to_csv('college_courses.csv', index=False)
Step 2: Create a Bar Plot
import matplotlib.pyplot as plt
        
# Read the data back from the CSV file
df = pd.read_csv('college_courses.csv')

# Create a bar plot
df.plot(x='College', kind='bar', figsize=(10, 6))

# Adding titles and labels
plt.title('Number of Courses Offered by Colleges')
plt.xlabel('College')
plt.ylabel('Number of Courses')
plt.xticks(rotation=45)
plt.legend(title='Streams')

# Display the plot
plt.show()
University College Courses 0 2 4 6 8 10 12 14 16 Number of Courses Offered by Colleges College Number of Courses College A College B College C College D College E College F College G College H College I College J

University College Courses
Explanation
The first part of the code creates a DataFrame with the provided data and saves it to a CSV file named ‘college_courses.csv’.
The second part reads the data from the CSV file and creates a stacked bar plot using Matplotlib. Each bar represents a college, and each bar segment shows the number of courses offered in Science, Commerce, and Humanities streams.
The plot is then customized with a title, axis labels, and a legend.
12.
Collect and store data related to the screen time of students in your class separately for boys and girls and present it using a boxplot.
The following is the data for my classmates’ screen time and demonstration of how to store this data and present it using a boxplot in Python.
Data Collection and Storage:
Data on screen time (in hours per day) for boys and girls in my class is collected.
For simplicity, data is collected for 10 boys and 10 girls in my class.
Data:
Boys’ screen time (in hours): [3, 4.5, 2, 5, 6, 3.5, 4, 5.5, 2.5, 4]
Girls’ screen time (in hours): [2, 3.5, 4, 4.5, 3, 2.5, 5, 3.5, 4, 4.5]
Python Code to Store Data and Create Boxplot:
import pandas as pd
import matplotlib.pyplot as plt

# Hypothetical data
boys_screen_time = [3, 4.5, 2, 5, 6, 3.5, 4, 5.5, 2.5, 4]
girls_screen_time = [2, 3.5, 4, 4.5, 3, 2.5, 5, 3.5, 4, 4.5]

# Creating DataFrame
data = {'Boys': boys_screen_time, 'Girls': girls_screen_time}
df = pd.DataFrame(data)

# Creating a boxplot
plt.figure(figsize=(8, 6))
df.boxplot(column=['Boys', 'Girls'])
plt.title('Screen Time of Students in Class (Boys vs Girls)')
plt.ylabel('Hours per Day')
plt.grid(False)
plt.show()
Screen Time of Students in Class (Boys vs Girls) 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Screen Time of Students in Class (Boys vs Girls) Hours per Day Boys Girls
Screen Time of Students in Class (Boys vs Girls)
Explanation:

The code first creates a DataFrame with the screen time data for boys and girls.

It then uses Matplotlib to create a boxplot that visually represents this data. The boxplot shows the distribution of screen time, including the median, quartiles, and potential outliers.

13.
Explain the findings of the boxplot of Figure 4.18 Figure 4.17 by filling the following blanks:
a)
The median for the five subjects is _____ , ______, _______, ______, ______
b)
The highest value for the five subjects is : _____ , ______, _______, ______, ______
c)
The lowest value for the five subjects is : _____ , ______, _______, ______, ______
d)
______________ subject has two outliers with the value ________ and ________
e)
______________ subject shows minimum variation

Note: It is actually Figure 4.17 in the text book (there is a typo in the given question and it is given as Figure 4.18)
Screen Time of Students in Class (Boys vs Girls) 30 40 50 60 70 80 90 Performance Analysis Subjects Marks English Maths Hindi Science Social_Studies
Figure 4.17 A boxplot of “Marks.csv”
From the boxplot above, we see that
a)
The median for the five subjects is 80, 50, 55, 56, 76.
b)
The highest value for the five subjects is : 95, 95, 90, 94, 95
c)
The lowest value for the five subjects is : 60, 33, 39, 48, 54
d)
Social_Studies subject has two outliers with the value 54 and 95
e)
Social_Studies subject shows minimum variation
The following is the box plot diagram displayed in the Figure 4.17
14.
Collect the minimum and maximum temperature of your city for a month and present it using a histogram plot.
The minimum and maximum temperature data for my city for a month is collected. The data is demonstrated and presented using a histogram plot in Python.
Temperature Data for My City in January:
Minimum Temperatures (°C): [7, 8, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7]
Maximum Temperatures (°C): [19, 20, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19]
Python Code to Create a Histogram Plot:
import pandas as pd
import matplotlib.pyplot as plt

# Temperature data for My City in January
min_temps = [7, 8, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7, 6, 7, 8, 7]
max_temps = [19, 20, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19, 18, 19, 20, 19]

# Creating a DataFrame
df = pd.DataFrame({'Min Temperature': min_temps, 'Max Temperature': max_temps})

# Creating a histogram plot
df.plot(kind='hist', bins=10, alpha=0.7)
plt.title('Temperature Distribution in My City for January')
plt.xlabel('Temperature (°C)')
plt.ylabel('Frequency')
plt.show()
Temperature Distribution in My City in a Month 0 2 4 6 8 10 12 14 Temperature Distribution in My City for a Month Temperature (°C) Frequency 6 8 10 12 14 16 18 20 Min. Temperature Max. Temperature
Temperature Distribution in My City in a Month
Explanation of the Code:
The script first creates a DataFrame with the hypothetical temperature data.
Then, it uses Matplotlib to create a histogram plot that visually represents the distribution of minimum and maximum temperatures for the month of January in My City.
The `bins` parameter in the `plot` method determines the number of bins (intervals) in the histogram. The `alpha` parameter controls the transparency of the histogram bars.
15.
Conduct a class census by preparing a questionnaire. The questionnaire should contain a minimum of five questions. Questions should relate to students, their family members, their class performance, their health etc. Each student is required to fill up the questionnaire. Compile the information in numerical terms (in terms of percentage). Present the information through a bar, scatter–diagram. (NCERT Geography class IX, Page 60)
1. Introduction
I am a student of class XII Z
-section
I conducted a class census in my class to collection information about my classmates, their families, their health and their academic paerformance.
Total number of students in my class (including me): 40
All the 40 students filled up the questionnaire.
2. Questionnaire Used
Class Census Questionnaire – Class 12 Z
1.
Name:
2.
Gender:
(a) Boy
(b) Girl

3.
Family Members at home
(a) 3 or Less
(b) 4-5
(c) 6 or more members

4.
Number of Siblings:
(a) 0
(b) 1
(c) 2+

5.
Breakfast daily?
(a) Yes
(b) No

6.
Health status (self-reported)
(a) Generally healthy
(b) Seasonal Allergy-Asthma
(c) Other

7.
Study hours per day (outside school): hours (numeric)
8.
Latest test score: % (numeric)

3. Data Compilation and Analysis
After collecting the filled questionnaires from all the students, I compiled the data in numerical terms (in terms of percentage) for each question.
(a) Gender of Students
Gender
No. of Students
Percentage
Boys
22
{\dfrac{22}{40} × 100 = 55\%}
Girls
18
{\dfrac{18}{40} × 100 = 45\%}
Total
40
100%
Bar Chart depicting the percentage of boys and girls in the class 0 20 40 60 80 100 Number of Boys and Girls Boys Girls Group Percentage No. of boys is 55% No. of girls is 45% 55 45

Bar Chart depicting the percentage of boys and girls in the class
Observation: In my class the number of boys (21) is slightly more than the number of girls (19)
(b) Family members at home
Family size (members)
No. of students
Percentage
3 or Less
8
{\dfrac{8}{40} × 100 = 20\%}
4-5
24
{\dfrac{24}{40} × 100 = 60\%}
6 or more members
8
{\dfrac{8}{40} × 100 = 20\%}
Total
40
100%
Bar Chart depicting the percentage of students with specific family members at home 0 20 40 60 80 100 Number of Family members ≤ 3 4-5 ≥ 6 Family Size Percentage 20% of students have less than or equal to 3 family members. 60% of students have 4-5 family members. 20% of students have 6 or more family members 20 60 20

Bar Chart depicting the percentage of students with specific family members at home.
Observation: Most of my classmates come from medium sized famiies with 4-5 members.
(c) Siblings
Siblings
No. of students
Percentage
0
6
{\dfrac{6}{40} × 100 = 15\%}
1
18
{\dfrac{18}{40} × 100 = 45\%}
2+
16
{\dfrac{16}{40} × 100 = 40\%}
Total
40
100%
Bar Chart depicting the percentage of students with specific siblings 0 20 40 60 80 100 Number of Siblings 0 1 ≥ 2 Siblings Percentage 15% of students do not have siblings. 45% of students have only one sibling 40% of students have 2 or more siblings 15 45 40

Bar Chart depicting the percentage of students with specific siblings.
Observation: Majority of the students have atleast 1 sibling.
(e) Morning Breakfast before arriving at school.
Response
No. of students
Percentage
Yes
30
{\dfrac{30}{40} × 100 = 75\%}
No
10
{\dfrac{10}{40} × 100 = 25\%}
Total
40
100%
Bar Chart depicting the percentage of boys and girls in the class 0 20 40 60 80 100 With/without breakfast Yes No Breakfast Percentage 75% of students finish their breakfast before the school starts. 25% of students come to school before having their breakfast. 75 25

Bar Chart depicting the percentage of studets coming to school with/without breakfast.
Observation: A large majority of my classmates (75%) finish their breakfast before coming to school.
(f) Daily physical activity (≥30 min)
Response
No. of students
Percentage
Yes
26
{\dfrac{26}{30} × 100 = 65\%}
No
14
{\dfrac{14}{10} × 100 = 35\%}
Total
40
100%
Bar Chart depicting the percentage of boys and girls in the class 0 20 40 60 80 100 Daily physical activity (≥30 min) Yes No Daily Physical Activity Percentage 65% of students perform daily phsical activity of 30 minutes or more. 35% of students perform less than 30 minutes of physical activity/no physical activity. 65 35

Bar Chart depicting the percentage of students engaged in daily physical activity.
Observation: A large majority of my classmates (65%) have a daily physical activity of 30 minutes or more.
(g) Health Status (self-reported)
Health
No. of students
Percentage
Good
32
{\dfrac{32}{40} × 100 = 80\%}
Seasonal
6
{\dfrac{6}{40} × 100 = 15\%}
Other
2
{\dfrac{2}{40} × 100 = 5\%}
Total
40
100%
Bar Chart depicting the percentage of students as per their health status 0 20 40 60 80 100 Number of Siblings Good Seasonal Other Health Status Percentage 80% of students of good health status. 15% of students have seasonal allergy/asthma. 5% of students have other health issues. 80 15 5

Bar Chart depicting the percentage of students with specific siblings.
Observation: Majority of the students(80%) have good health status.
(h) Study Hours vs Test Score:
The following is the data collected related to the study hours and latest test scores.
Student
Study
Hours
(per day)
Latest
Test
Score
(%)
S01
2.4
75
S02
1.2
61
S03
2.8
70
S04
2.9
77
S05
0.3
49
S06
0.9
56
S07
2.2
74
S08
1.8
66
S09
2.1
71
S10
1.3
59
S11
2.8
78
S12
2.7
76
S13
2.2
67
S14
3.0
82
S15
2.6
75
S16
1.9
70
S17
2.0
66
S18
1.3
63
S19
2.7
79
S20
1.9
67
S21
1.3
61
S22
2.3
73
S23
1.8
68
S24
1.6
63
S25
3.3
85
S26
2.1
70
S27
2.1
71
S28
3.2
88
S29
2.5
77
S30
2.9
83
S31
2.1
68
S32
2.6
74
S33
2.3
77
S34
0.8
52
S35
1.8
64
S36
2.1
69
S37
2.1
72
S38
1.7
66
S39
2.2
73
S40
2.0
70
Scatter Plot depicting the study hours vs test score. 50 55 60 65 70 75 80 85 Scatter Diagram: Study Hours vs Test Score (n = 40) 0.5 1.0 1.5 2.0 2.5 3.0 Study Hours per day (outside school) Last test score (%) Study Hours per day = 2.4, Last test score is 75% Study Hours per day = 1.2, Last test score is 61% Study Hours per day = 2.8, Last test score is 70% Study Hours per day = 2.9, Last test score is 77% Study Hours per day = 0.3, Last test score is 49% Study Hours per day = 0.9, Last test score is 56% Study Hours per day = 2.2, Last test score is 74% Study Hours per day = 1.8, Last test score is 66% Study Hours per day = 2.1, Last test score is 71% Study Hours per day = 1.3, Last test score is 59% Study Hours per day = 2.8, Last test score is 78% Study Hours per day = 2.7, Last test score is 76% Study Hours per day = 2.2, Last test score is 67% Study Hours per day = 3, Last test score is 82% Study Hours per day = 2.6, Last test score is 75% Study Hours per day = 1.9, Last test score is 70% Study Hours per day = 2, Last test score is 66% Study Hours per day = 1.3, Last test score is 63% Study Hours per day = 2.7, Last test score is 79% Study Hours per day = 1.9, Last test score is 67% Study Hours per day = 1.3, Last test score is 61% Study Hours per day = 2.3, Last test score is 73% Study Hours per day = 1.8, Last test score is 68% Study Hours per day = 1.6, Last test score is 63% Study Hours per day = 3.3, Last test score is 85% Study Hours per day = 2.1, Last test score is 70% Study Hours per day = 2.1, Last test score is 71% Study Hours per day = 3.2, Last test score is 88% Study Hours per day = 2.5, Last test score is 77% Study Hours per day = 2.9, Last test score is 83% Study Hours per day = 2.1, Last test score is 68% Study Hours per day = 2.6, Last test score is 74% Study Hours per day = 2.3, Last test score is 77% Study Hours per day = 0.8, Last test score is 52% Study Hours per day = 1.8, Last test score is 64% Study Hours per day = 2.1, Last test score is 69% Study Hours per day = 2.1, Last test score is 72% Study Hours per day = 1.7, Last test score is 66% Study Hours per day = 2.2, Last test score is 73% Study Hours per day = 2, Last test score is 70%

Scatter plot depicting the study hours vs test score.
16.
Visit data.gov.in , search for the following in “catalogs” option of the website:
Final population Totals, India and states
State Wise literacy rate

Download them and create a CSV file containing population data and literacy rate of the respective state. Also add a column Region to the CSV file that should contain the values East, West, North and South. Plot a scatter plot for each region where X axis should be population and Y axis should be Literacy rate. Change the marker to a diamond and size as the square root of the literacy rate.
Group the data on the column region and display a bar chart depicting average literacy rate for each region.

What I did:
1.
visited data.gov.in site.
2.
Clicked on Catalog.
3.
In the search box, provided the input “Final population Totals, India and states”
4.
Clicked on teh “Download” link available under the section “Primary Census Abstract 2011 – India”
5.
This downloaded the excel file “PCA0000_2011_MDDS.xls” (You’ll see the file name only after downloading. On the site, you’ll just see a “Download” link/button)
Note: This downloaded excel document contains data related the population also. So, used it for finding the statewise literacy rate also (didn’t download another file for literacy).

6.
Kept only STATE level rows and TRU = TOTAL (so we get one total row per state/UT)
7.
Took
Population = Total populiation Person
0-6 Population = Population in the age group 0-6 Person
Literates = Literates Population Person

8.
Computed literacy rate (standard Census style):
{\text{Literacy Rate} = \dfrac{\text{Literates (7+)}}{\text{Population (7+)}} × 100}
Where Population (7+) = Total Population – (0-6 Population)

9.
Added a “Region” column with values East/West/North/South (simple school-level grouping).
10.
I created the combined CSV file with columns.
State, Population, Literacy Rate, Region

11.
Plotted the scatter plots and bar chart as per the question and as specified below.
The following is the code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# -----------------------------
# 1) Read the Census PCA file
# -----------------------------
# Put your file path here
path = "PCA0000_2011_MDDS.xls"

# The sheet name in this file is typically "PCA"
df = pd.read_excel(path, sheet_name="PCA")

# -----------------------------------------------
# 2) Keep only STATE level totals (TRU = Total)
# -----------------------------------------------
state_df = df[(df["Level"] == "STATE") & (df["TRU"] == "Total")].copy()
    
# ----------------------------------------------------------
# 3) Build population + compute literacy rate (Census logic)
#    Literacy Rate (%) = Literates (7+) / Population (7+) * 100
# ----------------------------------------------------------
state_df["Population"] = state_df["Total Population Person"]
state_df["Pop_0_6"] = state_df["Population in the age group 0-6 Person"]
state_df["Pop_7_plus"] = state_df["Population"] - state_df["Pop_0_6"]

state_df["Literates_7_plus"] = state_df["Literates Population Person"]
state_df["LiteracyRate"] = (state_df["Literates_7_plus"] / state_df["Pop_7_plus"]) * 100

# -------------------------------------
# 4) Add Region column (E/W/N/S mapping)
# -------------------------------------
region_map = {
    # North
    "JAMMU & KASHMIR": "North",
    "HIMACHAL PRADESH": "North",
    "PUNJAB": "North",
    "CHANDIGARH": "North",
    "UTTARAKHAND": "North",
    "HARYANA": "North",
    "NCT OF DELHI": "North",
    "RAJASTHAN": "North",
    "UTTAR PRADESH": "North",

    # East (includes North-East + eastern India)
    "BIHAR": "East",
    "SIKKIM": "East",
    "ARUNACHAL PRADESH": "East",
    "NAGALAND": "East",
    "MANIPUR": "East",
    "MIZORAM": "East",
    "TRIPURA": "East",
    "MEGHALAYA": "East",
    "ASSAM": "East",
    "WEST BENGAL": "East",
    "JHARKHAND": "East",
    "ODISHA": "East",
    "CHHATTISGARH": "East",
    "ANDAMAN & NICOBAR ISLANDS": "East",

    # West
    "MADHYA PRADESH": "West",
    "GUJARAT": "West",
    "DAMAN & DIU": "West",
    "DADRA & NAGAR HAVELI": "West",
    "MAHARASHTRA": "West",
    "GOA": "West",

    # South
    "ANDHRA PRADESH": "South",
    "KARNATAKA": "South",
    "LAKSHADWEEP": "South",
    "KERALA": "South",
    "TAMIL NADU": "South",
    "PUDUCHERRY": "South",
}

state_df["Region"] = state_df["Name"].map(region_map).fillna("North")

# ------------------------------------
# 5) Final table + export CSV
# ------------------------------------
final_df = state_df[["Name", "Population", "LiteracyRate", "Region"]].copy()
final_df.rename(columns={"Name": "State"}, inplace=True)

final_df["Population"] = final_df["Population"].astype("int64")
final_df["LiteracyRate"] = final_df["LiteracyRate"].round(2)

final_df.to_csv("state_population_literacy_region.csv", index=False)
print("Saved: state_population_literacy_region.csv")

# ------------------------------------
# 6) Scatter plots (one per region)
# ------------------------------------
regions = ["North", "South", "East", "West"]

for r in regions:
    sub = final_df[final_df["Region"] == r].copy()

    plt.figure(figsize=(8, 5))
    plt.scatter(
        sub["Population"],
        sub["LiteracyRate"],
        marker="D",                     # diamond marker
        s=np.sqrt(sub["LiteracyRate"])  # size = sqrt(literacy rate)
    )

    plt.title(f"Population vs Literacy Rate ({r} Region)")
    plt.xlabel("Population (Total Persons) (in crores)")
    plt.ylabel("Average Literacy Rate (%)")

    # Optional: label each point with the state name
    for _, row in sub.iterrows():
        plt.annotate(
            row["State"],
            (row["Population"], row["LiteracyRate"]),
            fontsize=7,
            xytext=(3, 3),
            textcoords="offset points"
        )

    plt.tight_layout()
    plt.show()

# ------------------------------------
# 7) Bar chart: average literacy by region
# ------------------------------------
avg_lit = final_df.groupby("Region", as_index=False)["LiteracyRate"].mean()

plt.figure(figsize=(7, 4.5))
plt.bar(avg_lit["Region"], avg_lit["LiteracyRate"])

plt.title("Average Literacy Rate by Region (Census 2011 PCA)")
plt.xlabel("Region")
plt.ylabel("Average Literacy Rate (%)")


plt.tight_layout()
plt.show()
Scatter Plot: Population vs Literacy Rate (North Region)
Scatter plot depicting the Population vs Literacy Rate (North Region) 67.5 70.0 72.5 75.0 77.5 80.0 82.5 85.0 Population vs Literacy Rate (North Region) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Population (Total Persons) (in crores) Literacy Rate (%) JAMMU & KASHMIR HIMACHAL PRADESH PUNJAB CHANDIGARH UTTARAKHAND HARYANA NCT OF DELHI RAJASTHAN UTTAR PRADESH JAMMU & KASHMIR: Population = 12541302, Literacy Rate = 67.16% HIMACHAL PRADESH: Population = 6864602, Literacy Rate = 82.8% PUNJAB: Population = 27743338, Literacy Rate = 75.84% CHANDIGARH: Population = 1055450, Literacy Rate = 86.05% UTTARAKHAND: Population = 10086292, Literacy Rate = 78.82% HARYANA: Population = 25351462, Literacy Rate = 75.55% NCT OF DELHI: Population = 16787941, Literacy Rate = 86.21% RAJASTHAN: Population = 68548437, Literacy Rate = 66.11% UTTAR PRADESH: Population = 199812341, Literacy Rate = 67.68%

Scatter Plot: Population vs Literacy Rate (North Region).
Scatter Plot: Population vs Literacy Rate (South Region)
Scatter plot depicting the Population vs Literacy Rate (South Region) 70 75 80 85 90 95 Population vs Literacy Rate (South Region) 0 2 4 6 8 Population (Total Persons) (in crores) Literacy Rate (%) ANDHRA PRADESH KARNATAKA LAKSHADWEEP KERALA TAMIL NADU PUDUCHERRY ANDHRA PRADESH: Population = 84580777, Literacy Rate = 67.02% KARNATAKA: Population = 61095297, Literacy Rate = 75.36% LAKSHADWEEP: Population = 64473, Literacy Rate = 91.85% KERALA: Population = 33406061, Literacy Rate = 94% TAMIL NADU: Population = 72147030, Literacy Rate = 80.09% PUDUCHERRY: Population = 1247953, Literacy Rate = 85.85%

Scatter Plot: Population vs Literacy Rate (South Region).
Scatter Plot: Population vs Literacy Rate (East Region)
Scatter plot depicting the Population vs Literacy Rate (East Region) 65 70 75 80 85 90 Population vs Literacy Rate (East Region) 0 2 4 6 8 10 Population (Total Persons) (in crores) Literacy Rate (%) BIHAR SIKKIM ARUNACHAL PRADESH NAGALAND MANIPUR MIZORAM TRIPURA MEGHALAYA ASSAM WEST BENGAL JHARKHAND ODISHA CHHATTISGARH ANDAMAN & NICOBAR ISLANDS BIHAR: Population = 104099452, Literacy Rate = 61.8% SIKKIM: Population = 610577, Literacy Rate = 81.42% ARUNACHAL PRADESH: Population = 1383727, Literacy Rate = 65.38% NAGALAND: Population = 1978502, Literacy Rate = 79.55% MANIPUR: Population = 2570390, Literacy Rate = 79.21% MIZORAM: Population = 1097206, Literacy Rate = 91.33% TRIPURA: Population = 3673917, Literacy Rate = 87.22% MEGHALAYA: Population = 2966889, Literacy Rate = 74.43% ASSAM: Population = 31205576, Literacy Rate = 72.19% WEST BENGAL: Population = 91276115, Literacy Rate = 76.26% JHARKHAND: Population = 32988134, Literacy Rate = 66.41% ODISHA: Population = 41974218, Literacy Rate = 72.87% CHHATTISGARH: Population = 25545198, Literacy Rate = 70.28% ANDAMAN & NICOBAR ISLANDS: Population = 380581, Literacy Rate = 86.63%

Scatter Plot: Population vs Literacy Rate (East Region).
Scatter Plot: Population vs Literacy Rate (West Region)
Scatter plot depicting the Population vs Literacy Rate (West Region) 70 72.5 75 77.5 80 82.5 85 87.5 MAHARASHTRA Population vs Literacy Rate (West Region) 0 2 4 6 8 10 Population (Total Persons) (in crores) Literacy Rate (%) MADHYA PRADESH GUJARAT DAMAN & DIU DADRA & NAGAR HAVELI

GOA MADHYA PRADESH: Population = 72626809, Literacy Rate = 69.32% GUJARAT: Population = 60439692, Literacy Rate = 78.03% DAMAN & DIU: Population = 243247, Literacy Rate = 87.1% DADRA & NAGAR HAVELI: Population = 343709, Literacy Rate = 76.24% MAHARASHTRA: Population = 112374333, Literacy Rate = 82.34% GOA: Population = 1458545, Literacy Rate = 88.7%

Scatter Plot: Population vs Literacy Rate (West Region).
Bar Chart depicting the Average Literacy Rate by Region (Census 2011 PCA)
Bar Chart depicting the percentage of boys and girls in the class 0 20 40 60 80 100 Average Literacy Rate by Region (Census 2011 PCA) East North South West Region Average Literacy Rate (%) East: 76.07% North: 76.25% South: 82.36% West: 80.29% 70.07% 76.25% 82.36% 80.29%

Bar Chart depicting the Average Literacy Rate by Region (Census 2011 PCA)
The output csv will be as follows:
State
Population
Literacy Rate
Region
JAMMU & KASHMIR
12541302
67.16
North
HIMACHAL PRADESH
6864602
82.8
North
PUNJAB
27743338
75.84
North
CHANDIGARH
1055450
86.05
North
UTTARAKHAND
10086292
78.82
North
HARYANA
25351462
75.55
North
NCT OF DELHI
16787941
86.21
North
RAJASTHAN
68548437
66.11
North
UTTAR PRADESH
199812341
67.68
North
BIHAR
104099452
61.8
East
SIKKIM
610577
81.42
East
ARUNACHAL PRADESH
1383727
65.38
East
NAGALAND
1978502
79.55
East
MANIPUR
2570390
79.21
East
MIZORAM
1097206
91.33
East
TRIPURA
3673917
87.22
East
MEGHALAYA
2966889
74.43
East
ASSAM
31205576
72.19
East
WEST BENGAL
91276115
76.26
East
JHARKHAND
32988134
66.41
East
ODISHA
41974218
72.87
East
CHHATTISGARH
25545198
70.28
East
MADHYA PRADESH
72626809
69.32
West
GUJARAT
60439692
78.03
West
DAMAN & DIU
243247
87.1
West
DADRA & NAGAR HAVELI
343709
76.24
West
MAHARASHTRA
112374333
82.34
West
ANDHRA PRADESH
84580777
67.02
South
KARNATAKA
61095297
75.36
South
GOA
1458545
88.7
West
LAKSHADWEEP
64473
91.85
South
KERALA
33406061
94.0
South
TAMIL NADU
72147030
80.09
South
PUDUCHERRY
1247953
85.85
South
ANDAMAN & NICOBAR ISLANDS
380581
86.63
East
Note: Note that the reference file is that of 2011. So, if you’re wondering why few states are missing that’s because they were not yet formed in 2011.
Our mapping will run fine, but two things to note:
Chhattisgarh is usually treated as Central (not East). Since your question only allows East/West/North/South, you can keep it in East (acceptable), but it’s not standard.
Madhya Pradesh is also usually Central, not West. Same logic: your choice is acceptable for a 4-region simplification, but not strictly “correct” geographically.
If you want a more common simplification:
Put Madhya Pradesh in North (or keep West as you did)
Put Chhattisgarh in East (or North)