Overview

This document outlines the feature engineering approach for our ML-based options trading model. Our primary objective is predicting short-term (1-5 day) option price movements using a combination of market data, volatility metrics, and derived features.


Data Sources

Primary: Yahoo Finance (yfinance)

We collect the following data through the yfinance API:

  • Option Chains: Calls and puts organized by strike price and expiration date
  • Greeks: Delta, Gamma, Theta, Vega for each contract
  • Implied Volatility: Forward-looking volatility derived from option prices
  • Volume & Open Interest: Liquidity and positioning metrics
  • Historical Prices: 5+ years of OHLCV data for underlying assets

Secondary: VIX Index

  • Market-wide fear gauge
  • Used for volatility regime detection and classification

Feature Categories

1. Volatility Metrics

FeatureDescriptionTimeframes
Historical Volatility (HV)Realized volatility from underlying returns10, 20, 30, 60 days
Implied Volatility (IV)Forward-looking from option pricesCurrent
IV RankPercentile of current IV vs 52-week rangeRolling 252 days
IV Percentile% of days IV was lower than currentRolling 252 days
HV-IV SpreadDifference between realized and implied volCurrent

2. Volatility Surface Features

  • Vol Smile/Skew: How IV varies across strikes at a given expiration
  • Term Structure Slope: How IV varies across expirations at a given moneyness
  • Butterfly Spread IV: Convexity of the volatility smile
  • Risk Reversal: Skew measurement (call IV - put IV)

3. Greeks-Based Features

# Example feature derivations
delta_normalized = contract_delta / atm_delta
gamma_dollar = gamma * spot_price * 0.01
theta_ratio = theta / option_price
vega_normalized = vega / implied_vol

4. Market Structure Features

  • Volume/OI Ratio: Indicates new vs closing positions
  • Put-Call Ratio: Sentiment indicator
  • Gamma Exposure (GEX): Aggregate dealer gamma positioning
  • Delta Exposure (DEX): Aggregate directional exposure

Data Pipeline Architecture

Raw Data --> Cleaning --> Feature Calc --> Normalization --> Model Input
   |            |             |               |
Yahoo API   Handle NaN   Rolling Windows   Z-Score/MinMax
             Bad Ticks    Lagged Features  

Pipeline Steps

  1. Data Collection: Fetch option chains and price history
  2. Data Cleaning: Handle missing values, filter illiquid contracts
  3. Feature Calculation: Compute derived features with rolling windows
  4. Normalization: Apply appropriate scaling for each feature type
  5. Feature Selection: Remove highly correlated and low-importance features

Model Targets

We predict the following targets:

TargetDefinitionHorizon
DirectionBinary up/down classification1-5 days
MagnitudePercentage price change1-5 days
Vol ChangeIV change direction and magnitude1-5 days

Backtesting Framework

  • Walk-Forward Validation: Train on expanding window, test on next period
  • No Look-Ahead Bias: Features computed only with data available at trade time
  • Transaction Costs: Include realistic spreads and commissions
  • Slippage Model: Account for market impact on execution

Current Model Performance

Note: Live model metrics are updated daily in our monitoring dashboard.

  • Training Period: 2021-01 to 2025-12
  • Validation Period: 2026-01 to present
  • Features in production: 47
  • Model architecture: Gradient Boosted Trees (XGBoost)

This documentation is maintained by the quant agent and updated as features evolve.