JitCoder

Data Cleaning in Pandas (Real Examples) – Complete Beginner to Advanced Guide

Data cleaning in pandas is one of the most important steps in data analysis and machine learning. In real-world projects, raw data is often messy, incomplete, inconsistent, and full of errors. Before analyzing data or building models, we must clean the data properly.

In Python, Pandas is the most popular library for data cleaning and manipulation. It provides powerful tools to handle missing values, duplicates, incorrect formats, unwanted spaces, invalid entries, and much more.

In this complete guide, you will learn:

  • What data cleaning is
  • Why data cleaning matters
  • Common data issues
  • Real-world Pandas data cleaning examples
  • Best practices for clean datasets
  • Frequently used Pandas functions

By the end of this tutorial, you will be able to clean real datasets confidently using Pandas.


What is Data Cleaning in Pandas?

Data cleaning is the process of detecting and fixing incorrect, incomplete, duplicate, or inconsistent data.

Raw data collected from:

  • websites
  • forms
  • surveys
  • APIs
  • Excel files
  • databases

usually contains errors.

For example:

NameAgeSalary
John2550000
AliceNaN60000
John2550000
Bobtwenty45000

Problems here:

  • Missing values
  • Duplicate rows
  • Incorrect datatype
  • Invalid values

Data cleaning solves these problems.

Data Cleaning in Pandas is one of the most important skills in data analysis.

This tutorial on Data Cleaning in Pandas covers real-world examples for beginners.

With Data Cleaning in Pandas, you can handle missing values and duplicates efficiently.


Why Data Cleaning is Important

Data cleaning is essential because bad data leads to:

  • Wrong analysis
  • Incorrect business decisions
  • Poor machine learning performance
  • Data inconsistency
  • Visualization errors

Clean data improves:

  • Accuracy
  • Reliability
  • Performance
  • Decision-making

Data scientists spend almost 70% of their time cleaning data.

Data Cleaning in Pandas is one of the most important skills in data analysis.


Installing Pandas

If Pandas is not installed, use:

pip install pandas

Import Pandas:

import pandas as pd

Creating Sample Dataset

We will use this dataset throughout the tutorial.

import pandas as pddata = {    "Name": ["John", "Alice", "Bob", "John", None],    "Age": [25, None, "twenty", 25, 30],    "Salary": [50000, 60000, 45000, 50000, None],    "City": ["New York", "London", "Paris ", "New York", "Delhi"]}df = pd.DataFrame(data)print(df)

Output:

    Name     Age   Salary       City0   John      25  50000.0   New York1  Alice    None  60000.0     London2    Bob  twenty  45000.0     Paris3   John      25  50000.0   New York4   None      30      NaN      Delhi

This dataset contains several data quality issues.


1. Checking Dataset Information

Before cleaning data, always inspect the dataset.

View First Rows

print(df.head())

Check Shape

print(df.shape)

Output:

(5, 4)

Check Data Types

print(df.dtypes)

Output:

Name       objectAge        objectSalary    float64City       object

Notice:

  • Age should be numeric but is object type.

Summary Statistics

print(df.describe())

2. Handling Missing Values

Missing values are very common in real datasets.

Detect Missing Values

print(df.isnull())

Count Missing Values

print(df.isnull().sum())

Output:

Name      1Age       1Salary    1City      0

Removing Missing Values

Drop Rows with Missing Values

df_clean = df.dropna()print(df_clean)

Rows containing missing values will be removed.


Drop Specific Columns with Missing Values

df.dropna(axis=1)
  • axis=0 → rows
  • axis=1 → columns

Filling Missing Values

Instead of deleting data, we can fill missing values.

Fill with Fixed Value

df["Salary"].fillna(0, inplace=True)

Fill with Mean

df["Salary"].fillna(df["Salary"].mean(), inplace=True)

Fill with Median

df["Age"].fillna(df["Age"].median(), inplace=True)

Fill with Mode

df["City"].fillna(df["City"].mode()[0], inplace=True)

Real Example: Student Marks Dataset

data = {    "Student": ["A", "B", "C", "D"],    "Marks": [80, None, 75, 90]}df = pd.DataFrame(data)df["Marks"].fillna(df["Marks"].mean(), inplace=True)print(df)

Output:

  Student  Marks0       A   80.01       B   81.62       C   75.03       D   90.0

3. Removing Duplicate Data

Duplicate rows can distort analysis.

Detect Duplicate Rows

print(df.duplicated())

Count Duplicates

print(df.duplicated().sum())

Remove Duplicate Rows

df = df.drop_duplicates()

Real Example

Before cleaning:

NameAge
John25
John25

After removing duplicates:

NameAge
John25

4. Fixing Incorrect Data Types

Wrong data types are common in CSV or Excel files.

Check Data Types

print(df.dtypes)

Suppose Age contains text values like "twenty".


Convert Column to Numeric

df["Age"] = pd.to_numeric(df["Age"], errors="coerce")

Explanation:

  • Invalid values become NaN

Output:

0    25.01     NaN2     NaN3    25.04    30.0

Convert String to Datetime

df["Date"] = pd.to_datetime(df["Date"])

Example:

data = {    "Date": ["2025-01-01", "2025-02-15"]}df = pd.DataFrame(data)df["Date"] = pd.to_datetime(df["Date"])

5. Removing Extra Spaces

Extra spaces create inconsistency.

Example:

  • "Paris"
  • "Paris "

These are treated differently.

Remove Spaces

df["City"] = df["City"].str.strip()

Convert Text to Lowercase

df["City"] = df["City"].str.lower()

Output:

new yorklondonparis

Convert Text to Uppercase

df["City"] = df["City"].str.upper()

6. Renaming Columns

Messy column names make analysis difficult.

Rename Columns

df.rename(columns={"Salary": "Employee_Salary"}, inplace=True)

Convert All Column Names to Lowercase

df.columns = df.columns.str.lower()

Replace Spaces in Column Names

df.columns = df.columns.str.replace(" ", "_")

7. Replacing Incorrect Values

Sometimes datasets contain invalid entries.

Example:

  • Gender column contains:
    • Male
    • M
    • male

We need consistency.

Replace Values

df["Gender"] = df["Gender"].replace({    "M": "Male",    "male": "Male"})

Real Example

Before:

Gender
M
male
Male

After:

Gender
Male
Male
Male

8. Handling Outliers

Outliers are abnormal values.

Example:

  • Salary = 99999999

These can affect analysis.


Detect Outliers Using IQR Method

Q1 = df["Salary"].quantile(0.25)Q3 = df["Salary"].quantile(0.75)IQR = Q3 - Q1lower = Q1 - 1.5 * IQRupper = Q3 + 1.5 * IQRoutliers = df[(df["Salary"] < lower) | (df["Salary"] > upper)]print(outliers)

Remove Outliers

df = df[(df["Salary"] >= lower) & (df["Salary"] <= upper)]

9. Handling Invalid Data

Sometimes values are impossible.

Example:

  • Age = -5
  • Marks = 150

Filter Invalid Values

df = df[df["Age"] > 0]

Real Example

df = df[df["Marks"] <= 100]

10. Working with Dates

Date columns often require cleaning.

Convert to Datetime

df["Joining_Date"] = pd.to_datetime(df["Joining_Date"])

Extract Year

df["Year"] = df["Joining_Date"].dt.year

Extract Month

df["Month"] = df["Joining_Date"].dt.month

Extract Day

df["Day"] = df["Joining_Date"].dt.day

11. Handling Categorical Data

Example:

  • Gender
  • City
  • Department

Check Unique Values

print(df["City"].unique())

Count Categories

print(df["City"].value_counts())

12. Removing Special Characters

Example:

"$5000""₹7000"

Need numeric values only.

Remove Symbols

df["Salary"] = df["Salary"].replace(r"[\$,₹]", "", regex=True)

Convert to Numeric

df["Salary"] = pd.to_numeric(df["Salary"])

13. Real-World Employee Dataset Cleaning

Raw Dataset

data = {    "Name": ["John ", "Alice", "BOB", None],    "Age": ["25", "30", "twenty", "40"],    "Salary": ["50000", "$60000", "45000", None],    "City": ["New York", "London ", "PARIS", "Delhi"]}df = pd.DataFrame(data)

Step 1: Remove Missing Values

df.dropna(inplace=True)

Step 2: Clean Text Data

df["Name"] = df["Name"].str.strip().str.title()df["City"] = df["City"].str.strip().str.title()

Step 3: Clean Salary Column

df["Salary"] = df["Salary"].replace(r"[\$]", "", regex=True)df["Salary"] = pd.to_numeric(df["Salary"])

Step 4: Clean Age Column

df["Age"] = pd.to_numeric(df["Age"], errors="coerce")

Step 5: Remove Invalid Rows

df.dropna(inplace=True)

Final Clean Dataset

print(df)

Output:

    Name   Age  Salary      City0   John  25.0   50000  New York1  Alice  30.0   60000    London

Important Pandas Functions for Data Cleaning

FunctionPurpose
isnull()Detect missing values
dropna()Remove missing values
fillna()Fill missing values
duplicated()Find duplicates
drop_duplicates()Remove duplicates
replace()Replace values
astype()Change datatype
to_numeric()Convert to numeric
to_datetime()Convert to date
str.strip()Remove spaces
value_counts()Count unique values

Best Practices for Data Cleaning

1. Always Backup Raw Data

Never modify the original dataset directly.

df_copy = df.copy()

2. Inspect Data Before Cleaning

Always use:

  • head()
  • info()
  • describe()

3. Handle Missing Values Carefully

Removing too much data can reduce dataset quality.


4. Standardize Text Data

Use:

  • lowercase
  • uppercase
  • title case

for consistency.


5. Validate Data Types

Always verify:

  • numeric columns
  • date columns
  • categorical columns

Common Mistakes in Data Cleaning

Ignoring Missing Values

Missing data can produce incorrect results.


Removing Too Much Data

Excessive row deletion may reduce dataset usefulness.


Not Checking Duplicates

Duplicates distort analysis.


Inconsistent Text Formatting

Example:

  • Delhi
  • delhi
  • DELHI

These should be standardized.


Real-World Use Cases of Data Cleaning

Data cleaning is used in:

  • Data Science
  • Machine Learning
  • Business Analytics
  • Finance
  • Healthcare
  • E-commerce
  • Banking
  • Research

Example: E-commerce Dataset Cleaning

Problems:

  • Missing prices
  • Duplicate orders
  • Wrong product names
  • Extra spaces
  • Invalid quantities

Using Pandas, companies clean millions of records daily.


Advantages of Using Pandas for Data Cleaning

Why developers love Pandas:

  • Easy syntax
  • Fast processing
  • Powerful functions
  • Excellent CSV/Excel support
  • Works well with NumPy and Matplotlib

Conclusion

Data cleaning is one of the most critical skills in data analysis and machine learning. Real-world datasets are rarely perfect, so knowing how to clean data properly is essential.

With Pandas, you can efficiently:

  • handle missing values
  • remove duplicates
  • fix incorrect data types
  • standardize text
  • clean dates
  • detect outliers
  • validate data

Mastering these techniques will help you build accurate analytics and better machine learning models.

If you are learning Python for data science, Pandas data cleaning should be one of your top priorities.


FAQs

What is data cleaning in Pandas?

Data cleaning in Pandas means fixing missing, duplicate, inconsistent, or incorrect data using Pandas functions.


Why is data cleaning important?

Clean data improves analysis accuracy and machine learning performance.


Which Pandas function removes duplicates?

drop_duplicates()

How do you fill missing values in Pandas?

fillna()

How to convert strings to numbers in Pandas?

pd.to_numeric()

Is Pandas good for big datasets?

Yes, Pandas is excellent for medium to large datasets and widely used in industry.

This tutorial on Data Cleaning in Pandas covers real-world examples for beginners.

Learn more from the official Pandas documentation:

Read also:
– NumPy Basics Explained with Examples
– Python Exception Handling Tutorial
– Python File Handling Explained

Leave a Comment

Your email address will not be published. Required fields are marked *