Data Handling Using Pandas – II

This page contains the NCERT Informatics Practicesclass 12 chapter 3 Data Handling Using Pandas – II from the book Informatics Practices. You can find the solutions for the chapter 3 of NCERT class 12 Informatics Practices, for the Short Answer Questions, Long Answer Questions and Projects/Assignments Questions in this page. So is the case if you are looking for NCERT class 12 Informatics Practices related topic Data Handling Using Pandas – II question and answers. If you’re looking for case study solutions, you can find them at Solved Case Study Bsed on Open Data Sets.
EXERCISE
1.
Write the statement to install the python connector to connect MySQL i.e. pymysql.
To install the Python connector for MySQL, specifically the `pymysql` package, the following command should be used in the Python environment:
pip install pymysql
2.
Explain the difference between pivot() and pivot_table() function?
The following are the differences between the pivot() and pivot_table() functions.
Feature
`pivot()` Function
`pivot_table()` Function
Purpose
Used for reshaping and creating a new DataFrame from the original one.
Also reshapes data but aggregates values from rows with duplicate entries for the specified columns.
Handling Duplicates
Cannot handle duplicate values for the specified columns.
Can aggregate duplicate entries using functions like min, max, mean, etc.
Flexibility
Less flexible, best used when you have unique values to form an index.
More flexible, especially useful in the presence of duplicate entries.
Default Aggregation
Not applicable as it doesn’t aggregate.
Default aggregate function is mean, but can be changed to others like sum, max, min.
Use Case Example
df.pivot(index='Item', columns='Color', values='Units_in_stock')
df.pivot_table(index='Item', columns='Color', values=['Price(Rs)', 'Units_in_stock'], aggfunc={'Price(Rs)': len, 'Units_in_stock': np.mean})
3.
What is sqlalchemy?
SQLAlchemy is a library used to interact with a MySQL database by providing the necessary credentials. This library is essential for connecting a Python program with a MySQL database. It allows you to read from and write data to the database directly using Python code.
To use SQLAlchemy in a Python environment, you first need to install it using the command:
pip install sqlalchemy
Once installed, SQLAlchemy provides a function called `create_engine()`. This function is used to establish a connection with the MySQL database. The `create_engine()` function requires a connection string that includes multiple parameters like the database driver, username, password, host, port number, and the name of the database. The function returns an engine object based on this connection string.
For example, to connect to a MySQL database, you might use:
engine = create_engine('mysql+pymysql://username:password@host:port/database_name')
In this example:
`mysql+pymysql` is the database driver.
`username` and `password` are your MySQL credentials.
`host` and `port` define where your MySQL server is running (usually localhost and port 3306).
`database_name` is the name of the database you want to connect to.
4.
Can you sort a DataFrame with respect to multiple columns?
Yes, a DataFrame can be sorted with respect to multiple columns.
The specific method used for sorting by multiple columns is `sort_values()`. For example to sort data in a DataFrame based on the marks obtained in Science and then according to Hindi in Unit Test 3, we use the following code:
dfUT3 = df[df.UT == 3]
print(dfUT3.sort_values(by=['Science', 'Hindi']))
In this example:
`dfUT3` is a DataFrame filtered to include only Unit Test 3 data.
`sort_values()` is used to sort this data.
The DataFrame is first sorted by the marks in Science.
If there are any ties in Science marks, it then sorts by Hindi marks.
This approach to sorting by multiple columns allows for more refined and specific organization of data, especially in cases where sorting by a single column does not provide enough clarity or distinction between records.
5.
What are missing values? What are the strategies to handle them?
Missing values in a DataFrame are instances where a value corresponding to a column is not present. These are typically denoted by `NaN` (Not a Number) in Python’s Pandas library. Missing values can occur for various reasons, such as improper data collection or certain attributes being irrelevant for some records etc.,.
The document discusses two main strategies to handle missing values in a DataFrame:
1.
Dropping Missing Values:
This involves removing entire rows (objects) that have missing value(s).
The `dropna()` function in Pandas is used for this purpose.
Dropping is suitable when there are only a few objects with missing values, as it reduces the size of the dataset used for analysis.
2.
Estimating Missing Values:
Missing values can be filled using estimations or approximations.
Common methods include using a value just before or after the missing value, or using the average/minimum/maximum of the values for that attribute.
The `fillna()` function in Pandas can replace missing values with a specified value, such as 0 or 1.
This method changes the dataset, but it provides a good approximation for analysis when actual data is not available.
These methods are essential for data cleaning and preparation before any analysis or processing is performed on the data.
6.
Define the following terms: Median, Standard Deviation and variance.
1.
Median:
The median is the middle value in a dataset when it is arranged in ascending order.
It is a measure of central tendency that divides the data into two equal halves.
If the number of observations is odd, the median is the middle number. If even, it is the average of the two middle numbers.
For example, in a DataFrame, the median of each column can be found using `DataFrame.median()`.
2.
Standard Deviation:
Standard deviation measures the amount of variation or dispersion of a set of values.
A low standard deviation means that the values tend to be close to the mean (also called the expected value), whereas a high standard deviation means that the values are spread out over a wider range.
It is calculated as the square root of the variance.
In Pandas, `DataFrame.std()` returns the standard deviation of the values for each column.
3.
Variance:
Variance is the average of the squared differences from the Mean.
It provides a measure of how much each number in the set differs from the mean and thus from every other number in the set.
It’s used to see the spread of a dataset.
In Pandas, `DataFrame.var()` is used to calculate the variance of each column.
7.
What do you understand by the term MODE? Name the function which is used to calculate it.
The term “MODE” in the context of statistical analysis refers to the value that appears most frequently in a data set. It represents the most common or recurring value in a collection of data. The mode is particularly useful in identifying the most typical or frequent value in a dataset.
The function used to calculate the mode in Pandas, a Python library for data manipulation and analysis, is `DataFrame.mode()`. This function displays the mode for each column or row of a DataFrame, and it’s only applicable for numeric and categorical (non-numeric) data.
For example, to get the mode of the ‘Hindi’ marks from a DataFrame, the following statement can be used:
df['Hindi'].mode()
8.
Write the purpose of Data aggregation.
Data aggregation is a process in which an aggregate function is applied to a group of data. In the context of handling data with Pandas in Python, this process typically involves the following steps:
1.
Splitting the Data into Groups:
The data is divided into groups based on some criteria. This is often done using the `GROUP BY()` function in Pandas. For example, a DataFrame can be grouped by a specific column like ‘Name’ or ‘UT’ (Unit Test).
2.
Applying Aggregate Functions:
Once the data is grouped, various aggregate functions can be applied to these groups. These functions compute a single aggregated statistical value corresponding to each group.
Common aggregate functions include `sum()`, `mean()`, `max()`, `min()`, `std()`, and `var()`. These functions can be applied to one or more columns of the grouped data.
For instance, you can calculate the average marks scored by all students in each subject for each unit test.
3.
Combining the Results:
The results of the aggregate functions are then combined to form a new data structure, often a new DataFrame, which contains the aggregated data.
The purpose of data aggregation is to transform the dataset and produce single numeric values from an array or group of data. It’s particularly useful in descriptive statistics and data analysis, as it helps in simplifying complex data sets into meaningful summaries or statistical representations.
9.
Explain the concept of GROUP BY with help on an example.
The concept of “GROUP BY” in Pandas, a Python library for data analysis, is used to split data into groups based on some criteria. This function is pivotal in performing segment-specific analysis on a dataset. The process of using “GROUP BY” typically follows a three-step procedure:
1.
Splitting the Data into Groups:
The data in a DataFrame is divided into groups based on one or more keys. This is done by creating a “GROUP BY” object from the original DataFrame.
For example, if we have a DataFrame of student marks, we can group the data by the name of the student or the unit test (UT) number.
2.
Applying Functions:
After splitting the data into groups, we can apply various functions to each group independently. These functions could include statistical or aggregate functions like `sum()`, `mean()`, `max()`, `min()`, etc.
For instance, if we want to calculate the average marks scored in each subject for each unit test, we can group the data by ‘UT’ and then apply the `mean()` function.
3.
Combining the Results:
Finally, the results obtained from the applied functions on each group are combined to form a new DataFrame. This new DataFrame contains the aggregated data, providing insights based on the groups.
Example:
Let’s consider a DataFrame `df` containing marks of students in different subjects. To group this data by student name and then display the first entry from each group, we can use the following code:
g1 = df.GROUP BY('Name')
print(g1.first())
In this example, `g1` is the group by object created from the DataFrame `df`, and `g1.first()` displays the first entry for each student grouped by their name.
10.
Write the steps required to read data from a MySQL database to a DataFrame.
Reading data from a MySQL database into a Pandas DataFrame involves several steps. These steps are crucial for importing data for analysis and manipulation in a Python environment using Pandas. These steps are provided below:
1.
Install Required Libraries:
Ensure that the necessary Python libraries are installed. Primarily, you need `pymysql` and `sqlalchemy`.
Install these libraries using pip:
pip install pymysql
pip install sqlalchemy
                
2.
Create a Database Engine:
Use `sqlalchemy` to create a database engine. This engine is responsible for managing the connection to the database.
The engine requires a connection string which includes the database driver, username, password, host, port, and database name.
For example:
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://username:password@host:port/database_name')
3.
Use Pandas to Read Data:
Use Pandas `read_sql_query` function to execute a SQL query and load the data into a DataFrame.
Provide the SQL query and the engine (connection) as arguments to the function.
For example, to read data from a table named ‘INVENTORY’:
import pandas as pd
df = pd.read_sql_query('SELECT * FROM INVENTORY', engine)
4.
Dataframe Creation:
The result of the `read_sql_query` function is a DataFrame containing the data fetched from the specified SQL query.
This process efficiently transfers data from a MySQL database into a Pandas DataFrame, enabling further data manipulation and analysis using Python.
11.
Explain the importance of reshaping of data with an example.
Reshaping data is an important aspect of data analysis and manipulation, particularly in Pandas, a Python library. The shape of a dataset, which refers to how the data is arranged into rows and columns, can significantly impact the analysis and interpretation of that data. Reshaping is the process of changing this arrangement to make the data more suitable for specific analyses.
Importance of Reshaping Data:
1.
Makes Data More Readable and Organized: Reshaping helps in organizing data in a more readable and interpretable format. It can transform complex and cluttered data into a structured and comprehensible form.
2.
Facilitates Better Analysis: Certain types of data analysis require data to be in a specific shape. Reshaping ensures that the data meets these requirements.
3.
Enhances Flexibility in Data Manipulation: Reshaping allows for more flexible operations on the dataset, such as pivoting, merging, and grouping.
Example of Reshaping Data:
Consider a dataset containing sales and profit data of four stores (S1, S2, S3, S4) for the years 2016, 2017, and 2018. The data is initially in a long format where each row represents sales and profit for a store in a specific year. To analyze the total sales and profits across different years for each store, reshaping the data into a wide format might be more useful.
Using Pandas’ `pivot` function, the data can be reshaped to have stores as rows and years as columns, with sales and profit figures filled in accordingly. This reshaped DataFrame provides a clearer view of each store’s performance over the years, allowing for more straightforward comparative analysis.
Here is the data before reshaping:
Store
Year
Total_sales
Total_profit
S1
2016
12000
1100
S4
2016
330000
5500
S3
2016
420000
21000
S1
2017
20000
32000
S2
2017
10000
9000
S3
2017
450000
45000
S1
2018
30000
3000
S2
2018
11000
1900
S3
2018
89000
23000
The code to read using Pandas DataFrame as follows:
import pandas as pd
data = {
    'Store': ['S1', 'S4', 'S3', 'S1', 'S2', 'S3', 'S1', 'S2', 'S3'],
    'Year': [2016, 2016, 2016, 2017, 2017, 2017, 2018, 2018, 2018],
    'Total_sales(₹)': [12000, 330000, 420000, 20000, 10000, 450000, 30000, 11000, 89000],
    'Total_profit(₹)': [1100, 5500, 21000, 32000, 9000, 45000, 3000, 1900, 23000]
}
df = pd.DataFrame(data)
Using Pandas’ pivot function, the data can be reshaped to have stores as rows and years as columns, with sales and profit figures filled in accordingly. This reshaped DataFrame provides a clearer view of each store’s performance over the years, allowing for more straightforward comparative analysis.
The corresponding code for reshaping the data is as follows:
pivot_df = df.pivot(index='Store', columns='Year', values=['Total_sales(₹)', 'Total_profit(₹)'])
print(pivot_df)
The reshaped DataFrame would look like this:
Store
Total_sales(₹) 2016
Total_sales(₹) 2017
Total_sales(₹) 2018
Total_profit(₹) 2016
Total_profit(₹) 2017
Total_profit(₹) 2018
S1
12000
20000
30000
1100
32000
3000
S2
NaN
10000
11000
NaN
9000
1900
S3
420000
450000
89000
21000
45000
23000
S4
330000
NaN
NaN
5500
NaN
NaN
In this reshaped DataFrame:
Each row represents a store (S1, S2, S3, S4).
Columns are segregated into total sales and total profits for each year (2016, 2017, 2018).
The values in the table show the respective sales and profits for each store in each year.
NaN indicate missing data for the respective store and year.
This representation makes it easier to analyze trends and performances of each store across different years, demonstrating the utility of reshaping data for clearer and more effective analysis.
12.
Why estimation is an important concept in data analysis?
Estimation is a crucial concept in data analysis, especially when dealing with missing values in datasets. Missing values, denoted by NaN (Not a Number) in Pandas, can occur due to various reasons such as improper data collection or the irrelevance of certain attributes for some records. Estimation becomes important in such cases to handle these missing values effectively.
Importance of Estimation in Data Analysis:
1.
Handling Missing Values: Estimation allows for filling in missing values using logical or statistical methods. This includes using values just before or after the missing value or using the average, minimum, or maximum values of that attribute.
2.
Maintaining Dataset Integrity: By estimating missing values, the integrity of the dataset can be maintained. It prevents the loss of data rows due to missing values, which might be crucial for the analysis.
3.
Improving Data Accuracy: Estimation helps in improving the accuracy of the analysis. Instead of discarding missing data, estimated values provide a close approximation, which can be vital in statistical analysis and decision-making processes.
4.
Flexibility in Data Analysis: Estimation offers flexibility in handling real-world datasets where missing values are common. This allows for a broader and more comprehensive analysis, as the data can be adjusted to fit various analytical models and techniques.
Example of Estimation:
Consider a scenario where a student’s marks in one of the unit tests are missing. An estimation method could be to calculate their average marks from the other tests and use it as an estimate for the missing value. This approach allows for the computation of overall performance without discarding the entire record due to one missing value.
13.
Assuming the given table: Product. Write the python code for the following:
Item
Company
Rupees
USD
TV
LG
12000
700
TV
VIDEOCON
10000
650
TV
LG
15000
800
AC
SONY
14000
750
a)
To create the data frame for the above table.
b)
To add the new rows in the data frame.
c)
To display the maximum price of LG TV.
d)
To display the Sum of all products.
e)
To display the median of the USD of Sony products.
f)
To sort the data according to the Rupees and transfer the data to MySQL.
g)
To transfer the new dataframe into the MySQL with new values.
a) To create the data frame for the above table:
import pandas as pd
from sqlalchemy import create_engine
    
# MySQL connection setup (replace with your actual credentials)
username = 'your_username'
password = 'your_password'
host = 'localhost'
database_name = 'your_database_name'
table_name = 'Product'

# Creating a connection to your MySQL database
engine = create_engine(f'mysql+pymysql://{username}:{password}@{host}/{database_name}')

# a) Creating the DataFrame by reading from the MySQL table
df = pd.read_sql_table(table_name, engine)
b) To add the new rows in the data frame:
# b) Adding new rows to the DataFrame
# Example new rows
new_rows = pd.DataFrame({
    'Item': ['Fridge', 'Washing Machine'],
    'Company': ['Samsung', 'IFB'],
    'Rupees': [18000, 20000],
    'USD': [850, 900]
})
df = pd.concat([df, new_rows], ignore_index=True)
c) To display the maximum price of LG TV:
# c) Displaying the maximum price of LG TV
lg_tv_max_price = df[(df['Item'] == 'TV') & (df['Company'] == 'LG')]['Rupees'].max()
# Displaying the calculated value
print("Maximum price of LG TV:", lg_tv_max_price)
d) To display the Sum of all products.
# d) Displaying the Sum of all products
total_sum = df['Rupees'].sum()
# Displaying the calculated value
print("Sum of all products:", total_sum)
e) To display the median of the USD of Sony products.
# e) Displaying the median of the USD of Sony products
sony_usd_median = df[df['Company'] == 'SONY']['USD'].median()
# Displaying the calculated value
print("Median of the USD of Sony products:", sony_usd_median)
f) To sort the data according to the Rupees and transfer the data to MySQL.
# f) Sorting the data according to the Rupees
df_sorted = df.sort_values(by='Rupees')
g) To transfer the new dataframe into the MySQL with new values.
# g) Transferring the sorted DataFrame into MySQL with new values
# Remember that we've declared a variable table_name to hold the table name 'Product'
df_sorted.to_sql(table_name, engine, if_exists='replace', index=False)
14.
Write the python statement for the following question on the basis of given dataset:

Name
Degree
Score
0
Aparna
MBA
90.0
1
Pankaj
BCA
NaN
2
Ram
M.Tech
80.0
3
Ramesh
MBA
98.00
4
Naveen
NaN
97.0
5
Krrishnav
BCA
78.0
6
Bhawna
MBA
89.0

a)
To create the above DataFrame.
b)
To print the Degree and maximum marks in each stream.
c)
To fill the NaN with 76.
d)
To set the index to Name.
e)
To display the name and degree wise average marks of each student.
f)
To count the number of students in MBA.
g)
To print the mode marks BCA.
The following is the Python code for the tasks based on the given dataset and the results for each task:
a) To create the above DataFrame:
data = {
    'Name': ['Aparna', 'Pankaj', 'Ram', 'Ramesh', 'Naveen', 'Krrishnav', 'Bhawna'],
    'Degree': ['MBA', 'BCA', 'M.Tech', 'MBA', 'NaN', 'BCA', 'MBA'],
    'Score': [90.0, float('NaN'), 80.0, 98.00, 97.0, 78.0, 89.0]
}
df = pd.DataFrame(data)
b) To print the Degree and maximum marks in each stream:
max_marks_per_degree = df.groupby('Degree')['Score'].max()
Maximum Marks per Degree:
BCA: 78.0
M.Tech: 80.0
MBA: 98.0
NaN: 97.0
c) To fill the NaN with 76:
df_filled = df.fillna(76)
d) To set the index to Name:
df_indexed = df.set_index('Name')
e) To display the name and degree wise average marks of each student:
avg_marks_per_student = df.groupby(['Name', 'Degree'])['Score'].mean()
f) To count the number of students in MBA:
mba_student_count = df[df['Degree'] == 'MBA']['Name'].count()
Number of MBA Students: 3
g) To print the mode marks of BCA:
bca_mode_marks = df[df['Degree'] == 'BCA']['Score'].mode()
Mode Marks for BCA: 78.0
These operations create a DataFrame from the provided data, perform various manipulations, and calculate statistical measures based on the dataset’s content.