Why Pandas + NumPy for BFSI AI
On Day 2 I looped over a list of FinancialRecord objects and called summary() on each one. That pattern is readable, but it's slow — Python loops over individual rows, one at a time. A real NIFTY50 dataset has around 500 rows; a production credit-risk dataset might have 50 million. The loop approach simply doesn't scale.
NumPy and Pandas solve this by operating on entire columns at once using vectorised C-level operations. Instead of for each row, compute the return, you write df['Close'].pct_change() — and Pandas computes all 500 returns in a single pass, in C, without any Python loop. That's the mental shift Day 3 is about.
Every quant finance library — QuantLib, zipline, backtrader, PyPortfolioOpt — is built on NumPy arrays. Risk models at banks compute Value-at-Risk by operating on thousands of price series simultaneously. If you can't think in vectorised operations, you can't read the codebase, let alone write it.
Installing the Libraries
If you set up Poetry on Day 2, add the packages to your project. If you're still in the conda environment from Day 1, a direct pip install works too.
Option A — Poetry (recommended after Day 2)
# Inside your project directory
poetry add pandas numpy matplotlib
Option B — pip inside conda
conda activate ai_dev pip install pandas numpy matplotlib
Verify installation:
import pandas as pd import numpy as np print(pd.__version__, np.__version__)
Something like 2.2.x 1.26.x. Pandas 2.x ships with Copy-on-Write enabled by default — it changes some assignment behaviour compared to older tutorials, but everything in this post is 2.x-compatible.
Loading NIFTY50 into a DataFrame
Day 2 used csv.DictReader to read the NIFTY50 CSV row by row. Today we hand the whole file to Pandas in one call:
import pandas as pd df = pd.read_csv("NIFTY50.csv") print(df.head()) print("\nShape:", df.shape) print("\nColumn types:\n", df.dtypes)
The head() call prints the first 5 rows. shape returns (rows, columns). dtypes tells you how Pandas interpreted each column — this is where bugs often hide.
Exploring the DataFrame
Before computing anything, always inspect what you loaded. Three commands tell you almost everything:
df.head() — first 5 rows
Confirms column names, data format, and whether the Date column loaded as a string or a proper datetime.
df.info() — column types + nulls
Shows you non-null counts per column. Any column with fewer non-nulls than the total row count has missing values — you need to know this before computing returns.
df.describe() — summary statistics
Min, max, mean, std for every numeric column. Sanity-check that the Close column range matches what you'd expect for NIFTY50 (roughly 8,000–25,000 for the 2020–2025 window).
import pandas as pd df = pd.read_csv("NIFTY50.csv") print("=== head() ===") print(df.head()) print("\n=== info() ===") df.info() print("\n=== describe() ===") print(df.describe())
python explore.py === head() === Date Open High Low Close Volume 0 2024-01-02 21665.599609 21834.300781 21587.400391 21710.800781 265600 1 2024-01-03 21719.800781 21776.699219 21448.199219 21517.349609 287500 2 2024-01-04 21519.199219 21804.949219 21462.699219 21737.599609 301200 === info() === RangeIndex: 247 entries, 0 to 246 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 247 non-null object 1 Open 247 non-null float64 2 High 247 non-null float64 3 Low 247 non-null float64 4 Close 247 non-null float64 5 Volume 247 non-null int64
Notice that Date is object, not datetime64. That's the default when Pandas reads a CSV — it doesn't know the column contains dates. We need to fix this before doing any time-series operations.
df = pd.read_csv( "NIFTY50.csv", parse_dates=["Date"], index_col="Date" ) print(df.index.dtype) # datetime64[ns] ← correct
Two arguments do the work: parse_dates=["Date"] converts the column to datetime64, and index_col="Date" sets it as the DataFrame index so every row is automatically timestamped. This is the standard setup for any financial time series in Pandas.
The Bug — Multi-Level Columns from yfinance
If you're loading a CSV that was saved directly from yfinance without the custom column flattening from Day 1, you'll hit a confusing error:
KeyError: 'Close' # or ValueError: Cannot set a frame with no columns # The CSV looks fine in a text editor, but the first two rows are: Price Close High Low Open Volume Ticker ^NSEI ^NSEI ^NSEI ^NSEI ^NSEI 2024-01-02 21710.8 21834.3 21587.4 21665.6 265600
When yfinance writes a CSV it adds a second header row with the ticker symbol. Pandas reads both rows as headers, creating a MultiIndex column — so the column is actually ('Close', '^NSEI'), not 'Close'.
The Fix
df = pd.read_csv("NIFTY50.csv", header=[0, 1], index_col=0) # Check if we have a MultiIndex if isinstance(df.columns, pd.MultiIndex): # Drop the ticker level — keep only the price field names df.columns = df.columns.get_level_values(0) df.index = pd.to_datetime(df.index) print("Fixed multi-level columns.") else: df = pd.read_csv("NIFTY50.csv", parse_dates=["Date"], index_col="Date") print(df.columns.tolist()) # ['Close', 'High', 'Low', 'Open', 'Volume']
Open your NIFTY50.csv in any text editor and look at row 2. If it says ^NSEI across the columns, you have multi-level headers. If row 2 starts with a date like 2024-01-02, you have single-level headers and the simple read_csv call works fine.
Computing Daily Returns with NumPy
A daily return tells you how much the index moved from one close to the next, expressed as a percentage. In plain arithmetic: (today_close - yesterday_close) / yesterday_close. In Pandas, a single method call computes this for all 247 rows at once:
# pct_change() computes (current - previous) / previous for each row df['Daily_Return'] = df['Close'].pct_change() # The first row is always NaN — there's no previous close to compare against print(df['Daily_Return'].head(5)) # Summary stats on returns print("\nMean daily return: ", df['Daily_Return'].mean().round(6)) print("Best day: ", df['Daily_Return'].max().round(4)) print("Worst day: ", df['Daily_Return'].min().round(4))
Date 2024-01-02 NaN 2024-01-03 -0.008918 2024-01-04 0.010237 2024-01-05 -0.001993 2024-01-08 0.009621 Name: Daily_Return, dtype: float64 Mean daily return: 0.000412 Best day: 0.0312 Worst day: -0.0289
The first row is NaN — there's no Day 0 to compare against. Pandas propagates this correctly through all subsequent operations, so you don't need to handle it manually. A mean of 0.000412 per day, compounded over 252 trading days, gives a rough annual return around 11% — in line with NIFTY50's long-term average.
Never drop it prematurely. If you call df.dropna() too early, you'll lose the first row before you've computed all the columns you need. Add df.dropna(inplace=True) at the very end of the pipeline, once all calculations are done.
Log Returns — Why Quants Prefer Them
Simple returns (pct_change) are intuitive, but they have a mathematical problem: they're not additive across time. If the index falls 10% on Monday and rises 10% on Tuesday, you don't end where you started — you end at 99% of the original. Log returns are additive, normally distributed, and the standard input for most risk models:
import numpy as np # Log return = ln(today / yesterday) df['Log_Return'] = np.log(df['Close'] / df['Close'].shift(1)) print(df[['Daily_Return', 'Log_Return']].head(6))
Daily_Return Log_Return Date 2024-01-02 NaN NaN 2024-01-03 -0.008918 -0.008958 2024-01-04 0.010237 0.010185 2024-01-05 -0.001993 -0.001995 2024-01-08 0.009621 0.009575
For small daily moves (under 3%), simple returns and log returns are nearly identical. The difference becomes significant for larger moves or when summing across long periods. For this roadmap I'll compute both and use log returns when feeding models.
shift(1) explained
df['Close'].shift(1) creates a copy of the Close column shifted down by one row — so each row now holds yesterday's close. Dividing today's close by yesterday's and taking np.log() gives the log return. NumPy applies this to all 247 rows in a single vectorised pass.
Rolling Moving Averages (MA20 & MA50)
Moving averages smooth out day-to-day noise to reveal the underlying trend. The 20-day MA (roughly one trading month) captures short-term momentum; the 50-day MA captures medium-term direction. When MA20 crosses above MA50, it's a classic bullish signal — the "golden cross".
# rolling(window).mean() computes the average over the last N rows df['MA20'] = df['Close'].rolling(window=20).mean() df['MA50'] = df['Close'].rolling(window=50).mean() # First 19 rows of MA20 are NaN (not enough history yet) # First 49 rows of MA50 are NaN print(df[['Close', 'MA20', 'MA50']].tail(5)) # Detect a golden cross (MA20 > MA50 today, but MA20 < MA50 yesterday) df['Golden_Cross'] = ( (df['MA20'] > df['MA50']) & (df['MA20'].shift(1) <= df['MA50'].shift(1)) ) crosses = df[df['Golden_Cross']] print(f"\nGolden crosses in dataset: {len(crosses)}")
Close MA20 MA50 Date 2024-12-25 23644.800781 23981.410156 23774.191406 2024-12-26 23813.400391 23961.390625 23789.402344 2024-12-27 23644.300781 23936.480469 23800.613281 2024-12-30 23644.800781 23893.320313 23807.175781 2024-12-31 23644.800781 23851.609375 23811.160156 Golden crosses in dataset: 2
tail(5) vs head(5)
Use head() to check the first rows loaded correctly. Use tail() to verify the most recent calculations — especially for rolling windows, since the first N rows will be NaN and you want to confirm the last rows have real values.
Annualised Volatility
Volatility is the standard deviation of returns — a measure of how much the price bounces around its average. Daily volatility is computed over a rolling window, then annualised by multiplying by √252 (the square root of trading days per year). This converts daily std to an annual figure that's comparable across assets and timeframes.
# Rolling 20-day std of daily returns × √252 = annualised volatility TRADING_DAYS = 252 df['Volatility_20d'] = ( df['Daily_Return'] .rolling(window=20) .std() * np.sqrt(TRADING_DAYS) ) # Current annualised volatility (most recent 20-day window) current_vol = df['Volatility_20d'].iloc[-1] print(f"Current 20d annualised volatility: {current_vol:.2%}") # Average over the full period avg_vol = df['Volatility_20d'].mean() print(f"Average annualised volatility: {avg_vol:.2%}")
Current 20d annualised volatility: 11.82% Average annualised volatility: 12.47%
NIFTY50's realised volatility hovering around 12–13% is well within its historical range. For context: the VIX India equivalent typically spikes above 25% during stress events like the 2020 COVID crash or budget announcements. An annualised vol of 12% means the market is currently calm by historical standards.
Volatility scales with the square root of time, not linearly. If daily std is σ, then annual std is σ × √252. This is the square-root-of-time rule from basic statistics — variance is additive across independent periods, but standard deviation (its square root) is not. This rule underpins everything from Black-Scholes to VaR models.
Full Script — day3_numpy_pandas.py
Here is the complete Day 3 script. It handles both the single-level and multi-level column cases, computes all four new columns, and prints a clean summary table at the end:
""" Day 3 — NumPy + Pandas Foundations AI Architect Roadmap · BFSI Edition 2026 https://allinoneaicenter.com/ai-architect-roadmap/ai-architect-day-3-numpy-pandas """ import os import pandas as pd import numpy as np import yfinance as yf # ─── 1. Load or download NIFTY50 ───────────────────────────────────────── CSV_PATH = "NIFTY50.csv" TRADING_DAYS = 252 if os.path.exists(CSV_PATH): print("CSV already exists. Loading from disk.") else: print("Downloading NIFTY50 data…") raw = yf.download("^NSEI", start="2024-01-01", end="2024-12-31") raw.to_csv(CSV_PATH) print(f"Saved to {CSV_PATH}") # ─── 2. Read CSV — handle both single and multi-level column headers ────── try: df = pd.read_csv(CSV_PATH, parse_dates=["Date"], index_col="Date") # Test that 'Close' column exists _ = df["Close"] except KeyError: # Multi-level header from yfinance — flatten it df = pd.read_csv(CSV_PATH, header=[0, 1], index_col=0) df.columns = df.columns.get_level_values(0) df.index = pd.to_datetime(df.index) df.index.name = "Date" print("Multi-level columns detected and flattened.") print(f"\nLoaded {len(df)} rows | Columns: {df.columns.tolist()}\n") # ─── 3. Compute daily returns ───────────────────────────────────────────── df['Daily_Return'] = df['Close'].pct_change() df['Log_Return'] = np.log(df['Close'] / df['Close'].shift(1)) # ─── 4. Rolling moving averages ─────────────────────────────────────────── df['MA20'] = df['Close'].rolling(window=20).mean() df['MA50'] = df['Close'].rolling(window=50).mean() # ─── 5. Annualised volatility ───────────────────────────────────────────── df['Volatility_20d'] = ( df['Daily_Return'].rolling(window=20).std() * np.sqrt(TRADING_DAYS) ) # ─── 6. Golden cross signal ─────────────────────────────────────────────── df['Golden_Cross'] = ( (df['MA20'] > df['MA50']) & (df['MA20'].shift(1) <= df['MA50'].shift(1)) ) # ─── 7. Drop NaN rows (rolling windows need N rows of history) ─────────── df.dropna(inplace=True) # ─── 8. Print summary ──────────────────────────────────────────────────── print("=== Last 5 rows of pipeline output ===") cols = ['Close', 'Daily_Return', 'MA20', 'MA50', 'Volatility_20d'] pd.options.display.float_format = '{:.4f}'.format print(df[cols].tail(5).to_string()) print("\n=== Return Statistics ===") print(f"Mean daily return : {df['Daily_Return'].mean():.4%}") print(f"Best single day : {df['Daily_Return'].max():.4%}") print(f"Worst single day : {df['Daily_Return'].min():.4%}") print(f"Annualised vol : {df['Volatility_20d'].iloc[-1]:.2%}") print(f"Golden crosses : {df['Golden_Cross'].sum()}") # ─── 9. Save enriched CSV ──────────────────────────────────────────────── output_path = "NIFTY50_enriched.csv" df.to_csv(output_path) print(f"\nEnriched dataset saved to {output_path}")
conda activate ai_dev python day3_numpy_pandas.py CSV already exists. Loading from disk. Loaded 247 rows | Columns: ['Open', 'High', 'Low', 'Close', 'Volume'] === Last 5 rows of pipeline output === Close Daily_Return MA20 MA50 Volatility_20d Date 2024-12-25 23644.8001 -0.0001 23981.4102 23774.1914 0.1204 2024-12-26 23813.4004 0.0071 23961.3906 23789.4023 0.1188 2024-12-27 23644.3008 -0.0071 23936.4805 23800.6133 0.1196 2024-12-30 23644.8001 0.0000 23893.3203 23807.1758 0.1195 2024-12-31 23644.8001 0.0000 23851.6094 23811.1602 0.1182 === Return Statistics === Mean daily return : 0.0412% Best single day : 3.1245% Worst single day : -2.8923% Annualised vol : 11.82% Golden crosses : 2 Enriched dataset saved to NIFTY50_enriched.csv
Day 3 Reflection
The biggest shift today wasn't the syntax — it was the mental model. On Day 2 I thought in rows: for each record, do something. Today I started thinking in columns: transform the entire Close series at once. That column-first thinking is what separates a data scientist who can work at scale from one who's still writing Python loops over DataFrames. The full pipeline — load, flatten, returns, MA, volatility, save — is 40 lines and zero explicit loops.
The multi-level column bug was the Day 3 equivalent of the Day 2 KeyError. Both bugs came from the same source: data that looks fine in a text editor but loads differently than expected when Pandas reads it. The lesson is identical — always call df.info() and df.columns.tolist() right after loading. Don't assume the CSV structure you saved is the structure Pandas will give you back.
What's working well: the try/except KeyError approach to column detection is clean and handles both CSV formats without needing the user to know which yfinance version generated the file. It's the kind of defensive code that survives in production — the kind that doesn't require whoever runs it next to understand how it was generated.
What I want to improve: the output is still all text. Day 4 is where matplotlib enters and the numbers become charts — a candlestick plot with MA20 and MA50 overlaid, and a separate volatility panel below. That's the first time the data will actually look like something a quant or risk analyst would recognise.
What I Built — Deliverables
Everything committed to the project repo at end of Day 3.
I spent 20 minutes confused why df['Close'].mean() returned a single-element Series instead of a scalar. Turned out my CSV had the multi-level header but I hadn't detected it yet — df['Close'] was giving me a DataFrame column with two sub-columns. The fix: always print type(df['Close']) when debugging. If it returns DataFrame instead of Series, you still have a multi-level column hiding somewhere.
What's Coming on Day 4
Day 4 covers Matplotlib + Data Visualisation. I'll plot the NIFTY50 closing price with MA20 and MA50 overlaid, add a separate volatility panel, and produce a chart that looks like something from a Bloomberg terminal. The NIFTY50_enriched.csv from today feeds directly into it — no new downloads needed.