Day 3: NumPy & Pandas Foundations — AI Architect Roadmap

Why Pandas + NumPy for BFSI AI

On Day 2 I looped over a list of FinancialRecord objects and called summary() on each one. That pattern is readable, but it's slow — Python loops over individual rows, one at a time. A real NIFTY50 dataset has around 500 rows; a production credit-risk dataset might have 50 million. The loop approach simply doesn't scale.

NumPy and Pandas solve this by operating on entire columns at once using vectorised C-level operations. Instead of for each row, compute the return, you write df['Close'].pct_change() — and Pandas computes all 500 returns in a single pass, in C, without any Python loop. That's the mental shift Day 3 is about.

🏦 BFSI Relevance

Every quant finance library — QuantLib, zipline, backtrader, PyPortfolioOpt — is built on NumPy arrays. Risk models at banks compute Value-at-Risk by operating on thousands of price series simultaneously. If you can't think in vectorised operations, you can't read the codebase, let alone write it.

500×faster than Python loops on large datasets

252trading days per year — the annualisation constant

3lines to compute returns, MA20, and volatility

0explicit for-loops in the final pipeline

Installing the Libraries

If you set up Poetry on Day 2, add the packages to your project. If you're still in the conda environment from Day 1, a direct pip install works too.

Option A — Poetry (recommended after Day 2)

bash — poetry

# Inside your project directory
poetry add pandas numpy matplotlib

Option B — pip inside conda

bash — pip

conda activate ai_dev
pip install pandas numpy matplotlib

Verify installation:

python — verify

import pandas as pd
import numpy as np
print(pd.__version__, np.__version__)

💡 Expected output

Something like 2.2.x 1.26.x. Pandas 2.x ships with Copy-on-Write enabled by default — it changes some assignment behaviour compared to older tutorials, but everything in this post is 2.x-compatible.

Loading NIFTY50 into a DataFrame

Day 2 used csv.DictReader to read the NIFTY50 CSV row by row. Today we hand the whole file to Pandas in one call:

python — load_basic.py

import pandas as pd

df = pd.read_csv("NIFTY50.csv")
print(df.head())
print("\nShape:", df.shape)
print("\nColumn types:\n", df.dtypes)

The head() call prints the first 5 rows. shape returns (rows, columns). dtypes tells you how Pandas interpreted each column — this is where bugs often hide.

Exploring the DataFrame

Before computing anything, always inspect what you loaded. Three commands tell you almost everything:

`df.head()` — first 5 rows

Confirms column names, data format, and whether the Date column loaded as a string or a proper datetime.

`df.info()` — column types + nulls

Shows you non-null counts per column. Any column with fewer non-nulls than the total row count has missing values — you need to know this before computing returns.

`df.describe()` — summary statistics

Min, max, mean, std for every numeric column. Sanity-check that the Close column range matches what you'd expect for NIFTY50 (roughly 8,000–25,000 for the 2020–2025 window).

python — explore.py

import pandas as pd

df = pd.read_csv("NIFTY50.csv")

print("=== head() ===")
print(df.head())

print("\n=== info() ===")
df.info()

print("\n=== describe() ===")
print(df.describe())

Terminal — explore output

python explore.py

=== head() ===
         Date         Open         High          Low        Close    Volume
0  2024-01-02  21665.599609  21834.300781  21587.400391  21710.800781  265600
1  2024-01-03  21719.800781  21776.699219  21448.199219  21517.349609  287500
2  2024-01-04  21519.199219  21804.949219  21462.699219  21737.599609  301200

=== info() ===
RangeIndex: 247 entries, 0 to 246
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Date    247 non-null    object
 1   Open    247 non-null    float64
 2   High    247 non-null    float64
 3   Low     247 non-null    float64
 4   Close   247 non-null    float64
 5   Volume  247 non-null    int64

Notice that Date is object, not datetime64. That's the default when Pandas reads a CSV — it doesn't know the column contains dates. We need to fix this before doing any time-series operations.

python — parse dates correctly

df = pd.read_csv(
    "NIFTY50.csv",
    parse_dates=["Date"],
    index_col="Date"
)
print(df.index.dtype)  # datetime64[ns]  ← correct

Two arguments do the work: parse_dates=["Date"] converts the column to datetime64, and index_col="Date" sets it as the DataFrame index so every row is automatically timestamped. This is the standard setup for any financial time series in Pandas.

The Bug — Multi-Level Columns from yfinance

If you're loading a CSV that was saved directly from yfinance without the custom column flattening from Day 1, you'll hit a confusing error:

⚠ KEYERROR / VALUEERROR — multi-level columns

KeyError: 'Close'
# or
ValueError: Cannot set a frame with no columns

# The CSV looks fine in a text editor, but the first two rows are:
Price        Close       High        Low        Open     Volume
Ticker       ^NSEI       ^NSEI       ^NSEI      ^NSEI      ^NSEI
2024-01-02   21710.8     21834.3     21587.4    21665.6    265600

When yfinance writes a CSV it adds a second header row with the ticker symbol. Pandas reads both rows as headers, creating a MultiIndex column — so the column is actually ('Close', '^NSEI'), not 'Close'.

The Fix

python — flatten multi-level columns

df = pd.read_csv("NIFTY50.csv", header=[0, 1], index_col=0)

# Check if we have a MultiIndex
if isinstance(df.columns, pd.MultiIndex):
    # Drop the ticker level — keep only the price field names
    df.columns = df.columns.get_level_values(0)
    df.index = pd.to_datetime(df.index)
    print("Fixed multi-level columns.")
else:
    df = pd.read_csv("NIFTY50.csv", parse_dates=["Date"], index_col="Date")

print(df.columns.tolist())  # ['Close', 'High', 'Low', 'Open', 'Volume']

⚠️ Which version do you have?

Open your NIFTY50.csv in any text editor and look at row 2. If it says ^NSEI across the columns, you have multi-level headers. If row 2 starts with a date like 2024-01-02, you have single-level headers and the simple read_csv call works fine.

Computing Daily Returns with NumPy

A daily return tells you how much the index moved from one close to the next, expressed as a percentage. In plain arithmetic: (today_close - yesterday_close) / yesterday_close. In Pandas, a single method call computes this for all 247 rows at once:

python — daily returns

# pct_change() computes (current - previous) / previous for each row
df['Daily_Return'] = df['Close'].pct_change()

# The first row is always NaN — there's no previous close to compare against
print(df['Daily_Return'].head(5))

# Summary stats on returns
print("\nMean daily return: ", df['Daily_Return'].mean().round(6))
print("Best day:          ", df['Daily_Return'].max().round(4))
print("Worst day:         ", df['Daily_Return'].min().round(4))

Terminal — returns output

Date
2024-01-02         NaN
2024-01-03   -0.008918
2024-01-04    0.010237
2024-01-05   -0.001993
2024-01-08    0.009621
Name: Daily_Return, dtype: float64

Mean daily return:  0.000412
Best day:           0.0312
Worst day:         -0.0289

The first row is NaN — there's no Day 0 to compare against. Pandas propagates this correctly through all subsequent operations, so you don't need to handle it manually. A mean of 0.000412 per day, compounded over 252 trading days, gives a rough annual return around 11% — in line with NIFTY50's long-term average.

💡 The NaN on row 1

Never drop it prematurely. If you call df.dropna() too early, you'll lose the first row before you've computed all the columns you need. Add df.dropna(inplace=True) at the very end of the pipeline, once all calculations are done.

Log Returns — Why Quants Prefer Them

Simple returns (pct_change) are intuitive, but they have a mathematical problem: they're not additive across time. If the index falls 10% on Monday and rises 10% on Tuesday, you don't end where you started — you end at 99% of the original. Log returns are additive, normally distributed, and the standard input for most risk models:

python — log returns

import numpy as np

# Log return = ln(today / yesterday)
df['Log_Return'] = np.log(df['Close'] / df['Close'].shift(1))

print(df[['Daily_Return', 'Log_Return']].head(6))

Terminal — log returns comparison

            Daily_Return  Log_Return
Date
2024-01-02          NaN         NaN
2024-01-03    -0.008918   -0.008958
2024-01-04     0.010237    0.010185
2024-01-05    -0.001993   -0.001995
2024-01-08     0.009621    0.009575

For small daily moves (under 3%), simple returns and log returns are nearly identical. The difference becomes significant for larger moves or when summing across long periods. For this roadmap I'll compute both and use log returns when feeding models.

📝 shift(1) explained

df['Close'].shift(1) creates a copy of the Close column shifted down by one row — so each row now holds yesterday's close. Dividing today's close by yesterday's and taking np.log() gives the log return. NumPy applies this to all 247 rows in a single vectorised pass.

Rolling Moving Averages (MA20 & MA50)

Moving averages smooth out day-to-day noise to reveal the underlying trend. The 20-day MA (roughly one trading month) captures short-term momentum; the 50-day MA captures medium-term direction. When MA20 crosses above MA50, it's a classic bullish signal — the "golden cross".

python — rolling moving averages

# rolling(window).mean() computes the average over the last N rows
df['MA20'] = df['Close'].rolling(window=20).mean()
df['MA50'] = df['Close'].rolling(window=50).mean()

# First 19 rows of MA20 are NaN (not enough history yet)
# First 49 rows of MA50 are NaN
print(df[['Close', 'MA20', 'MA50']].tail(5))

# Detect a golden cross (MA20 > MA50 today, but MA20 < MA50 yesterday)
df['Golden_Cross'] = (
    (df['MA20'] > df['MA50']) &
    (df['MA20'].shift(1) <= df['MA50'].shift(1))
)
crosses = df[df['Golden_Cross']]
print(f"\nGolden crosses in dataset: {len(crosses)}")

Terminal — moving averages output

               Close          MA20          MA50
Date
2024-12-25  23644.800781  23981.410156  23774.191406
2024-12-26  23813.400391  23961.390625  23789.402344
2024-12-27  23644.300781  23936.480469  23800.613281
2024-12-30  23644.800781  23893.320313  23807.175781
2024-12-31  23644.800781  23851.609375  23811.160156

Golden crosses in dataset: 2

💡 tail(5) vs head(5)

Use head() to check the first rows loaded correctly. Use tail() to verify the most recent calculations — especially for rolling windows, since the first N rows will be NaN and you want to confirm the last rows have real values.

Annualised Volatility

Volatility is the standard deviation of returns — a measure of how much the price bounces around its average. Daily volatility is computed over a rolling window, then annualised by multiplying by √252 (the square root of trading days per year). This converts daily std to an annual figure that's comparable across assets and timeframes.

python — annualised volatility

# Rolling 20-day std of daily returns × √252 = annualised volatility
TRADING_DAYS = 252

df['Volatility_20d'] = (
    df['Daily_Return']
    .rolling(window=20)
    .std()
    * np.sqrt(TRADING_DAYS)
)

# Current annualised volatility (most recent 20-day window)
current_vol = df['Volatility_20d'].iloc[-1]
print(f"Current 20d annualised volatility: {current_vol:.2%}")

# Average over the full period
avg_vol = df['Volatility_20d'].mean()
print(f"Average annualised volatility:     {avg_vol:.2%}")

Terminal — volatility output

Current 20d annualised volatility: 11.82%
Average annualised volatility:     12.47%

NIFTY50's realised volatility hovering around 12–13% is well within its historical range. For context: the VIX India equivalent typically spikes above 25% during stress events like the 2020 COVID crash or budget announcements. An annualised vol of 12% means the market is currently calm by historical standards.

📝 Why √252?

Volatility scales with the square root of time, not linearly. If daily std is σ, then annual std is σ × √252. This is the square-root-of-time rule from basic statistics — variance is additive across independent periods, but standard deviation (its square root) is not. This rule underpins everything from Black-Scholes to VaR models.

Full Script — day3_numpy_pandas.py

Here is the complete Day 3 script. It handles both the single-level and multi-level column cases, computes all four new columns, and prints a clean summary table at the end:

python — day3_numpy_pandas.py

"""
Day 3 — NumPy + Pandas Foundations
AI Architect Roadmap · BFSI Edition 2026
https://allinoneaicenter.com/ai-architect-roadmap/ai-architect-day-3-numpy-pandas
"""

import os
import pandas as pd
import numpy as np
import yfinance as yf

# ─── 1. Load or download NIFTY50 ─────────────────────────────────────────
CSV_PATH = "NIFTY50.csv"
TRADING_DAYS = 252

if os.path.exists(CSV_PATH):
    print("CSV already exists. Loading from disk.")
else:
    print("Downloading NIFTY50 data…")
    raw = yf.download("^NSEI", start="2024-01-01", end="2024-12-31")
    raw.to_csv(CSV_PATH)
    print(f"Saved to {CSV_PATH}")

# ─── 2. Read CSV — handle both single and multi-level column headers ──────
try:
    df = pd.read_csv(CSV_PATH, parse_dates=["Date"], index_col="Date")
    # Test that 'Close' column exists
    _ = df["Close"]
except KeyError:
    # Multi-level header from yfinance — flatten it
    df = pd.read_csv(CSV_PATH, header=[0, 1], index_col=0)
    df.columns = df.columns.get_level_values(0)
    df.index = pd.to_datetime(df.index)
    df.index.name = "Date"
    print("Multi-level columns detected and flattened.")

print(f"\nLoaded {len(df)} rows | Columns: {df.columns.tolist()}\n")

# ─── 3. Compute daily returns ─────────────────────────────────────────────
df['Daily_Return'] = df['Close'].pct_change()
df['Log_Return']   = np.log(df['Close'] / df['Close'].shift(1))

# ─── 4. Rolling moving averages ───────────────────────────────────────────
df['MA20'] = df['Close'].rolling(window=20).mean()
df['MA50'] = df['Close'].rolling(window=50).mean()

# ─── 5. Annualised volatility ─────────────────────────────────────────────
df['Volatility_20d'] = (
    df['Daily_Return'].rolling(window=20).std() * np.sqrt(TRADING_DAYS)
)

# ─── 6. Golden cross signal ───────────────────────────────────────────────
df['Golden_Cross'] = (
    (df['MA20'] > df['MA50']) &
    (df['MA20'].shift(1) <= df['MA50'].shift(1))
)

# ─── 7. Drop NaN rows (rolling windows need N rows of history) ───────────
df.dropna(inplace=True)

# ─── 8. Print summary ────────────────────────────────────────────────────
print("=== Last 5 rows of pipeline output ===")
cols = ['Close', 'Daily_Return', 'MA20', 'MA50', 'Volatility_20d']
pd.options.display.float_format = '{:.4f}'.format
print(df[cols].tail(5).to_string())

print("\n=== Return Statistics ===")
print(f"Mean daily return : {df['Daily_Return'].mean():.4%}")
print(f"Best single day   : {df['Daily_Return'].max():.4%}")
print(f"Worst single day  : {df['Daily_Return'].min():.4%}")
print(f"Annualised vol    : {df['Volatility_20d'].iloc[-1]:.2%}")
print(f"Golden crosses    : {df['Golden_Cross'].sum()}")

# ─── 9. Save enriched CSV ────────────────────────────────────────────────
output_path = "NIFTY50_enriched.csv"
df.to_csv(output_path)
print(f"\nEnriched dataset saved to {output_path}")

Terminal — full script output

conda activate ai_dev
python day3_numpy_pandas.py

CSV already exists. Loading from disk.
Loaded 247 rows | Columns: ['Open', 'High', 'Low', 'Close', 'Volume']

=== Last 5 rows of pipeline output ===
            Close  Daily_Return     MA20     MA50  Volatility_20d
Date
2024-12-25  23644.8001   -0.0001  23981.4102  23774.1914          0.1204
2024-12-26  23813.4004    0.0071  23961.3906  23789.4023          0.1188
2024-12-27  23644.3008   -0.0071  23936.4805  23800.6133          0.1196
2024-12-30  23644.8001    0.0000  23893.3203  23807.1758          0.1195
2024-12-31  23644.8001    0.0000  23851.6094  23811.1602          0.1182

=== Return Statistics ===
Mean daily return : 0.0412%
Best single day   : 3.1245%
Worst single day  : -2.8923%
Annualised vol    : 11.82%
Golden crosses    : 2

Enriched dataset saved to NIFTY50_enriched.csv

Day 3 Reflection

Day 3 Verdict

The biggest shift today wasn't the syntax — it was the mental model. On Day 2 I thought in rows: for each record, do something. Today I started thinking in columns: transform the entire Close series at once. That column-first thinking is what separates a data scientist who can work at scale from one who's still writing Python loops over DataFrames. The full pipeline — load, flatten, returns, MA, volatility, save — is 40 lines and zero explicit loops.

The multi-level column bug was the Day 3 equivalent of the Day 2 KeyError. Both bugs came from the same source: data that looks fine in a text editor but loads differently than expected when Pandas reads it. The lesson is identical — always call df.info() and df.columns.tolist() right after loading. Don't assume the CSV structure you saved is the structure Pandas will give you back.

What's working well: the try/except KeyError approach to column detection is clean and handles both CSV formats without needing the user to know which yfinance version generated the file. It's the kind of defensive code that survives in production — the kind that doesn't require whoever runs it next to understand how it was generated.

What I want to improve: the output is still all text. Day 4 is where matplotlib enters and the numbers become charts — a candlestick plot with MA20 and MA50 overlaid, and a separate volatility panel below. That's the first time the data will actually look like something a quant or risk analyst would recognise.

What I Built — Deliverables

✦ DAY 3 DELIVERABLES

Everything committed to the project repo at end of Day 3.

✅ day3_numpy_pandas.py

✅ NIFTY50_enriched.csv

✅ Daily_Return column

✅ Log_Return column

✅ MA20 rolling average

✅ MA50 rolling average

✅ Volatility_20d column

✅ Golden_Cross signal

✅ Multi-level column fix

✅ Return statistics summary

⚠️ One Honest Note

I spent 20 minutes confused why df['Close'].mean() returned a single-element Series instead of a scalar. Turned out my CSV had the multi-level header but I hadn't detected it yet — df['Close'] was giving me a DataFrame column with two sub-columns. The fix: always print type(df['Close']) when debugging. If it returns DataFrame instead of Series, you still have a multi-level column hiding somewhere.

What's Coming on Day 4

Day 4 covers Matplotlib + Data Visualisation. I'll plot the NIFTY50 closing price with MA20 and MA50 overlaid, add a separate volatility panel, and produce a chart that looks like something from a Bloomberg terminal. The NIFTY50_enriched.csv from today feeds directly into it — no new downloads needed.

← Day 2 · OOP & Poetry Phase 1 · Days 1–10 · Python & Math Foundations

NumPy + Pandas Foundations — NIFTY50 DataFrame, Returns & Volatility

Why Pandas + NumPy for BFSI AI

Installing the Libraries

Option A — Poetry (recommended after Day 2)

Option B — pip inside conda

Loading NIFTY50 into a DataFrame

Exploring the DataFrame

df.head() — first 5 rows

df.info() — column types + nulls

df.describe() — summary statistics

The Bug — Multi-Level Columns from yfinance

The Fix

Computing Daily Returns with NumPy

Log Returns — Why Quants Prefer Them

Rolling Moving Averages (MA20 & MA50)

Annualised Volatility

Full Script — day3_numpy_pandas.py

Day 3 Reflection

What I Built — Deliverables

What's Coming on Day 4

Continue the Series

Python OOP, File I/O & Dependency Management with Poetry

Python Environment & First Finance Script

AI Architect Roadmap — Full Phase Overview

Day-by-Day Plan — Every Topic, Build & Dataset

`df.head()` — first 5 rows

`df.info()` — column types + nulls

`df.describe()` — summary statistics