Supervised ML for Return Prediction
Supervised learning trains a model on labeled historical data (features + observed outcomes) to predict future outcomes. For return prediction: features might include momentum signals, valuation ratios, quality metrics, and alternative data; labels are the subsequent 1-month returns. Algorithms range from linear regression (baseline) to random forests (ensemble of decision trees), gradient boosted trees (XGBoost, LightGBM), and deep neural networks. Research consistently finds that gradient boosted trees outperform neural networks on tabular financial data — likely because the feature interactions in financial datasets are more structured than in image or language domains.
Cross-validation in financial time series requires specific treatment because data points are time-ordered and not independent. Standard k-fold cross-validation that randomly shuffles data uses future information to train models that predict the past — a form of lookahead bias. Walk-forward cross-validation (training only on data preceding each test period) is the correct approach, but it requires more data and produces fewer test windows than standard cross-validation, reducing the statistical power of model evaluation.