Python for Data Science #2: Data Analysis with pandas

Python for Data Science #2: Data Analysis with pandas – Mastering the DataFrame

Your journey into Python data analysis starts with the single most important tool: the **DataFrame**.

**pandas** is the core library that makes data analysis in Python feel intuitive. It converts the familiar structure of a spreadsheet into a **powerful, flexible data structure** called the DataFrame. Here is the fundamental workflow for using it.

Step 1: Importing and Loading Data

Every analysis starts with importing the library and loading your data, typically from a CSV file:

import pandas as pd
# Load the CSV file into a DataFrame
df = pd.read_csv('data/sales_data.csv')

We use the common alias `pd` for pandas.

Step 2: Data Inspection (The First Look)

Before any heavy lifting, you need to understand the data's shape, content, and quality. You can quickly view the first or last few rows using `.head()` and `.tail()`.

**The Data Profile:** Use **`.info()`** to get a summary of your data, including the column names, the number of **non-null counts**, the data types (`dtypes`), and the memory usage.
**The Shape:** Use **`.shape`** to quickly see the total number of rows and columns (e.g., `(1000, 5)` means 1000 rows and 5 columns).
**Statistical Summary:** Use **`.describe()`** to generate descriptive statistics (like count, mean, min, and max). Remember that `.describe()` only works for numerical columns.

Step 3: Data Cleaning (Handling Missing Values)

Missing data, often represented as 'NaN' or 'NAN' (Not a Number), must be addressed.

**Dropping Rows:** Use `df.dropna()` to remove rows that contain any missing values.
**Filling Values:** Use `df.fillna(value)` to replace missing values (NAN) with a chosen value, such as the mean, mode, or a specific label like 'New York'.

Step 4: Manipulation: Grouping and Aggregation

One of the most powerful analysis techniques is grouping data to find summaries per category.

# Group sales data by 'Region' and calculate the total Quantity
df.groupby('Region')['Quantity'].sum()

This allows you to take raw data and aggregate metrics by distinct categories like Region, Item, or Date.

Your Next Steps: Data Challenge

To solidify these fundamentals, you should immediately practice these three steps:

Find a real-world CSV file to download.
Run **`.info()`** and **`.describe()`** to inspect it.
Practice grouping the data by one category using **`.groupby()`**.

Search This Blog

📝 Latest Blog Post