ML Feature Engineering for Options Trading
Overview This document outlines the feature engineering approach for our ML-based options trading model. Our primary objective is predicting short-term (1-5 day) option price movements using a combination of market data, volatility metrics, and derived features. Data Sources Primary: Yahoo Finance (yfinance) We collect the following data through the yfinance API: Option Chains: Calls and puts organized by strike price and expiration date Greeks: Delta, Gamma, Theta, Vega for each contract Implied Volatility: Forward-looking volatility derived from option prices Volume & Open Interest: Liquidity and positioning metrics Historical Prices: 5+ years of OHLCV data for underlying assets Secondary: VIX Index Market-wide fear gauge Used for volatility regime detection and classification Feature Categories 1. Volatility Metrics Feature Description Timeframes Historical Volatility (HV) Realized volatility from underlying returns 10, 20, 30, 60 days Implied Volatility (IV) Forward-looking from option prices Current IV Rank Percentile of current IV vs 52-week range Rolling 252 days IV Percentile % of days IV was lower than current Rolling 252 days HV-IV Spread Difference between realized and implied vol Current 2. Volatility Surface Features Vol Smile/Skew: How IV varies across strikes at a given expiration Term Structure Slope: How IV varies across expirations at a given moneyness Butterfly Spread IV: Convexity of the volatility smile Risk Reversal: Skew measurement (call IV - put IV) 3. Greeks-Based Features # Example feature derivations delta_normalized = contract_delta / atm_delta gamma_dollar = gamma * spot_price * 0.01 theta_ratio = theta / option_price vega_normalized = vega / implied_vol 4. Market Structure Features Volume/OI Ratio: Indicates new vs closing positions Put-Call Ratio: Sentiment indicator Gamma Exposure (GEX): Aggregate dealer gamma positioning Delta Exposure (DEX): Aggregate directional exposure Data Pipeline Architecture Raw Data --> Cleaning --> Feature Calc --> Normalization --> Model Input | | | | Yahoo API Handle NaN Rolling Windows Z-Score/MinMax Bad Ticks Lagged Features Pipeline Steps Data Collection: Fetch option chains and price history Data Cleaning: Handle missing values, filter illiquid contracts Feature Calculation: Compute derived features with rolling windows Normalization: Apply appropriate scaling for each feature type Feature Selection: Remove highly correlated and low-importance features Model Targets We predict the following targets: ...