Overview
This document outlines the feature engineering approach for our ML-based options trading model. Our primary objective is predicting short-term (1-5 day) option price movements using a combination of market data, volatility metrics, and derived features.
Data Sources
Primary: Yahoo Finance (yfinance)
We collect the following data through the yfinance API:
- Option Chains: Calls and puts organized by strike price and expiration date
- Greeks: Delta, Gamma, Theta, Vega for each contract
- Implied Volatility: Forward-looking volatility derived from option prices
- Volume & Open Interest: Liquidity and positioning metrics
- Historical Prices: 5+ years of OHLCV data for underlying assets
Secondary: VIX Index
- Market-wide fear gauge
- Used for volatility regime detection and classification
Feature Categories
1. Volatility Metrics
| Feature | Description | Timeframes |
|---|---|---|
| Historical Volatility (HV) | Realized volatility from underlying returns | 10, 20, 30, 60 days |
| Implied Volatility (IV) | Forward-looking from option prices | Current |
| IV Rank | Percentile of current IV vs 52-week range | Rolling 252 days |
| IV Percentile | % of days IV was lower than current | Rolling 252 days |
| HV-IV Spread | Difference between realized and implied vol | Current |
2. Volatility Surface Features
- Vol Smile/Skew: How IV varies across strikes at a given expiration
- Term Structure Slope: How IV varies across expirations at a given moneyness
- Butterfly Spread IV: Convexity of the volatility smile
- Risk Reversal: Skew measurement (call IV - put IV)
3. Greeks-Based Features
# Example feature derivations
delta_normalized = contract_delta / atm_delta
gamma_dollar = gamma * spot_price * 0.01
theta_ratio = theta / option_price
vega_normalized = vega / implied_vol
4. Market Structure Features
- Volume/OI Ratio: Indicates new vs closing positions
- Put-Call Ratio: Sentiment indicator
- Gamma Exposure (GEX): Aggregate dealer gamma positioning
- Delta Exposure (DEX): Aggregate directional exposure
Data Pipeline Architecture
Raw Data --> Cleaning --> Feature Calc --> Normalization --> Model Input
| | | |
Yahoo API Handle NaN Rolling Windows Z-Score/MinMax
Bad Ticks Lagged Features
Pipeline Steps
- Data Collection: Fetch option chains and price history
- Data Cleaning: Handle missing values, filter illiquid contracts
- Feature Calculation: Compute derived features with rolling windows
- Normalization: Apply appropriate scaling for each feature type
- Feature Selection: Remove highly correlated and low-importance features
Model Targets
We predict the following targets:
| Target | Definition | Horizon |
|---|---|---|
| Direction | Binary up/down classification | 1-5 days |
| Magnitude | Percentage price change | 1-5 days |
| Vol Change | IV change direction and magnitude | 1-5 days |
Backtesting Framework
- Walk-Forward Validation: Train on expanding window, test on next period
- No Look-Ahead Bias: Features computed only with data available at trade time
- Transaction Costs: Include realistic spreads and commissions
- Slippage Model: Account for market impact on execution
Current Model Performance
Note: Live model metrics are updated daily in our monitoring dashboard.
- Training Period: 2021-01 to 2025-12
- Validation Period: 2026-01 to present
- Features in production: 47
- Model architecture: Gradient Boosted Trees (XGBoost)
This documentation is maintained by the quant agent and updated as features evolve.