Backtesting Futures Strategies: What the Numbers Actually Tell You and When They Lie
Overview #
Backtesting is the most dangerous tool in a futures trader's arsenal. Not because it doesn't work — it does, under specific conditions — but because it produces convincing numbers that feel like evidence when they're often just artifacts of how the test was constructed.
A backtest evaluates a repeatable trading rule applied to historical market data under a set of execution assumptions. It answers one narrow question: "If I had traded these exact rules, with these exact fills, on this exact data, what would have happened?" That question is useful. The problem is that most traders hear a different question: "Will this strategy make money going forward?" Backtesting cannot answer that.
This article covers what backtesting actually tests, how platform engines work under the hood, the pitfalls that produce false confidence, and the validation methods that separate useful results from noise.
What Backtesting Tests -- and What It Does Not #
A backtest validates three things: your signal logic (do these entry/exit conditions fire when you expect?), your risk management mechanics (do stops, targets, and position sizing behave correctly?), and your execution model (fills, commissions, slippage as you've configured them).
What a backtest does not validate: whether the market will behave similarly in the future, whether your orders will actually get filled at the prices assumed, whether your operational latency will match the test assumptions, and whether your strategy's edge survives real market impact from your own orders.
As [Big Mike explained on NexusFi] [1], "Out of sample data is critical for a meaningful backtest, yet most traders don't do it. Out of sample means that you would backtest on one period of data and then once you feel you have the strategy the way you want it, you would then test it on OOS data and compare the results. If the results are drastically different, you have curve fitted your strategy to in sample data and the strategy is worthless going forward."
The fundamental issue is that most backtests validate signal logic plus an optimistic execution model. Unless you explicitly model execution reality — slippage distributions, partial fills, queue position, latency — your results tell you what would have happened in a world that doesn't exist.
Bar-Level vs. Tick-Level Engines #
The backtesting engine your platform uses determines how it steps through historical data and resolves order fills. This mechanical difference produces materially different results on the same strategy.
Bar-level engines evaluate trades using OHLCV candles. When your strategy has both a stop and a target that could be hit within the same bar, the engine faces an impossible question: which one triggered first? The bar's open, high, low, and close don't tell you the path price took during that bar. A 1-minute ES bar showing High=5500, Low=5495 doesn't reveal whether price hit 5498 (your stop) before reaching 5496 (your target) or vice versa.
Platforms handle this ambiguity with conventions — some assume the worst case, some assume the best case, some use "stop first then target." These conventions silently bias your results. If you're running a strategy where stops and targets frequently compete within the same bar, the engine's convention is determining your results more than your strategy logic.
Tick-level engines step through each trade or quote update, which eliminates most path ambiguity. However, "tick data" on most retail platforms means last-trade ticks — not the full order book with depth. You get better trigger timing, but you still lack the information needed to model queue position accurately. For limit orders, knowing that price traded at your level doesn't tell you whether your order would have been filled — that depends on where you were in the queue.
The practical difference: bar-level backtests on strategies with tight stops and targets are unreliable. Tick-level backtests are more realistic for trigger timing but still optimistic for limit order fills. Neither engine truly models the execution reality of a live market.
Fill Assumptions: Where Backtests Silently Diverge #
The fill model is where most backtests lie. Not intentionally — the platform simply doesn't know enough about real market microstructure to fill orders accurately.
Market orders are typically filled at the next available price (next tick or next bar open). In reality, market orders during fast moves can slip multiple ticks, and the slippage is not constant — it's a function of volatility, spread, and liquidity at that moment.
Limit orders are filled when price "touches" the limit level in most engines. This is deeply optimistic. In a live market, your limit order sits in a queue behind every order placed before yours at that price. Price touching your level doesn't mean you get filled — it means the orders ahead of you get filled. If the market reverses before reaching your position in the queue, you get nothing.
Stop orders are triggered at the stop price and filled at the next available price. Most engines assume the fill occurs at the trigger price, ignoring the gap between trigger and fill that widens during volatile conditions.
As [RM99 noted on NexusFi] [6], "Many people do not trust the results of a backtest for execution reasons. This is why you craft strategies that do not incorporate positive bias in backtests." He added key warning signs for execution unreliability: "If you're squeezing out 2 ticks per trade, then what do you think will happen if you see even a tick of variance going live from slippage, platform lag and inconsistencies — it will totally wipe you out."
The smaller your average profit per trade relative to typical slippage, the more your backtest results depend on fill assumptions rather than strategy logic.
The Four Biases That Produce False Confidence #
Look-Ahead Bias #
Using information that wouldn't have been available at the time of the trading decision. The classic example: computing an indicator from the bar's close price, then entering at the bar's open. In live trading, you don't know the close until the bar is complete — but in a backtest, the engine has access to all data simultaneously.
This is more subtle than it sounds. Custom indicators that reference "current bar" values during signal evaluation, strategies that check end-of-day data for intraday decisions, and any logic that uses data from the same bar as the entry can all introduce look-ahead bias.
Survivorship Bias #
For futures, this primarily shows up in contract selection. If your continuous contract data only includes the front month and skips illiquid periods around roll dates, you're backtesting on cleaner data than you'd actually trade. CL (crude oil) is especially susceptible — front-month and deferred contracts have materially different volatility and spreads, and roll costs are real.
Curve Fitting #
Adjusting parameters until the backtest looks profitable. As [Trembling Hand described] [3], "You grab the last year's 1 min data, run a backtest. Results are rubbish but you made a few coding errors so fix them and get a slight gain, now you think you can get more gains if you change a MA to 30 period instead of 20. Before long you go 'hey why not use the wonder of multiple core CPU and my software's optimization feature' so you do a 300 odd run parameter search optimization. And boom you have found a system that's spitting out a 3 next to the profit factor. But you have also just curve fitted your results to that moment in time."
[Big Mike provided a quick detection method] [2]: "Whatever time frame you are using, slightly change it. For example if using 5 minute bars change it to 3 minute bars and re-run the test. Switch to a highly correlated instrument — for example if trading ES then switch to YM or NQ. In both cases, your final results should be highly correlated with the originals. If they aren't then likely curve fitted to specific data."
Execution Assumption Overfitting #
The least discussed but most insidious bias. Your strategy appears profitable only because the engine assumes fills that wouldn't occur in practice: mid-price fills on limit orders, no queue position effects, best-case intrabar path, and no adverse selection. The "edge" isn't in your signal logic — it's in the gap between the execution model and reality.
[Fat Tails identified specific danger areas] [4] for NinjaTrader backtests: running on renko bars or other exotic bar types that cannot be backtested reliably, data consistency issues with rollover gaps creating fake profits, and underestimation of slippage when trading breakouts.
The Backtest-to-Live Parity Gap #
When a strategy works in backtesting but fails in live trading, the failure usually falls into one of four categories.
Execution realism gap. Slippage is state-dependent — it's worse during high volatility, wider during news events, and inconsistent across time of day. A constant slippage assumption (e.g., "1 tick per trade") systematically underestimates costs during exactly the periods that matter most. ES slippage during FOMC announcements can be 3-5x normal spreads.
Latency divergence. Your live order flow is slower than the backtest assumes. Strategies that depend on tight timing — "cancel if not filled within 200ms" — are treated as instantaneous in most backtests. In live trading, API jitter, platform thread scheduling, and order routing delays create real gaps.
Risk management divergence. Even if the signal logic is identical, live fills produce different stop prices, different average entries from partial fills, and different position flip timing. If your backtest uses bar-close logic but live uses tick-based stop execution, results diverge from the first trade.
Data differences. The historical feed used for backtesting is typically cleaner than the live feed. Missing ticks, timestamp drift, different session templates, and inconsistent roll methodology can all produce systematic differences between backtest and live performance.
As [RM99 put it] [6], "When trying to nail down the feasibility of a strategy, you have to distinguish between edge and execution — either can ruin your day."
Walk-Forward Analysis and Out-of-Sample Testing #
The discipline that separates useful backtests from self-deception is structured validation. Walk-forward analysis is the standard approach for futures strategies.
The process: divide your historical data into segments. Improve (or develop) the strategy on the first segment (in-sample). Test the optimized strategy on the next segment (out-of-sample) without making any changes. Roll forward and repeat. The out-of-sample results across all windows represent your strategy's realistic performance expectation.
As Big Mike emphasized, "Bear in mind, you can only do this once. Once you have tested your strategy on out of sample data, you cannot make changes to your strategy and re-test it on that data. It is no longer out of sample, and any changes you make to it are now curve fitted."
For futures specifically, out-of-sample testing must cover:
- Regime diversity: Both trending and mean-reverting periods. A strategy that works only in one regime is a ticking clock.
- Contract rolls: Especially for CL and other physically settled contracts where roll behavior affects spread and liquidity.
- Session segments: RTH vs. Globex for ES/NQ. Many strategies that work on RTH data fail spectacularly during overnight sessions (and vice versa).
- Volatility regimes: Include both calm markets and high-volatility events (CPI releases, FOMC decisions, earnings seasons).
The pass criteria should be more demanding than "profitable." You want stable expectancy (consistent average trade), controlled drawdown (no catastrophic periods), and a performance distribution that doesn't depend on a handful of outlier trades.
[Fat Tails offered important nuance] [5] on when curve fitting is actually appropriate: "If curve-fitting means translating repetitive behavior into a period of a moving average, then curve-fitting will work as long as humans take a break at noon." The key is fitting to structural market behavior (session rhythms, liquidity patterns) rather than to noise. He identified four core risks: optimizing on too small a sample, conditions changing between in-sample and production, using too many parameters, and using parameters that aren't independent from each other.
Realistic Fill Modeling #
If you're serious about backtesting futures, your fill model needs more sophistication than "1 tick slippage."
Slippage should be conditional, not constant. Model it as a function of current spread, volatility (ATR or realized vol), order aggressiveness (market vs. limit), time-of-day liquidity, and proximity to scheduled events. A market order at 10:30 AM in ES during a quiet session costs less than the same order at 8:30 AM during CPI release. Your backtest should reflect this.
Partial fills matter for larger size. Standard engines assume full fill at first opportunity. In reality, your order size relative to available depth determines fill completeness. During fast moves, even in liquid ES/NQ, you can be partially filled with the remainder left unfilled as price runs away. If your strategy assumes full fills and you're trading anything larger than 1-2 contracts, you're likely overestimating performance.
Queue position is the hardest problem. For limit order strategies (mean reversion, spread capture), your backtest results hinge entirely on whether the engine accurately models your position in the queue. Most engines don't — they fill you whenever price touches your level, ignoring that thousands of orders may have been ahead of yours. If your strategy's profitability depends on limit fills, you need either a queue simulation model or very conservative fill probability assumptions.
Data Quality: The Invisible Foundation #
Bad data produces convincing backtests with completely fictional results. For futures, minimum data quality requirements include:
Correct instrument mapping. ES and MES have different contract specifications. Don't unknowingly mix data from different contract sizes. Similarly, ensure you're testing on the correct contract months — front month behavior differs from deferred months.
Accurate session templates. RTH (Regular Trading Hours) and Globex sessions have different characteristics. A strategy backtested on 24-hour continuous data may show different results than one tested on RTH-only data. Define your session template explicitly and use it consistently.
Consistent roll methodology. How you construct a continuous contract matters enormously. Ratio-adjusted vs. offset-adjusted produce different price series. Roll timing (by volume, by date, at open, at close) affects the transition. The roll method must match what you'll use in live trading.
No timestamp drift or missing ticks. For tick-level backtests, even small timestamp errors create false signals. Missing ticks simulate "teleport" fills where price appears to jump instantaneously, potentially generating phantom profits.
Consistent price fields. Using last-trade prices for some calculations and bid/ask for others creates subtle but systematic errors, especially for limit order fill evaluation.
When Backtesting Is Meaningful #
Backtesting produces reliable signals when:
The strategy's edge comes from information that moves slower than the execution timeframe. A strategy using daily trend filters with intraday entries can be backtested reliably because the signal (trend direction) changes slowly relative to the execution (individual trades). The fill model errors are small relative to the expected profit per trade.
The execution model is conservative. If you assume 2 ticks of slippage on every trade and your strategy is still profitable, you have a margin of safety. If your strategy breaks even with 1 tick of slippage, the fill model is determining your results.
The results survive stress tests. Increase slippage by 50%. Slightly perturb parameters. Exclude the most volatile weeks. Test on correlated instruments. If performance collapses under any of these tests, you're fitting to artifacts.
Walk-forward validation shows stability. Not just profitability across out-of-sample windows, but consistent expectancy, controlled drawdown, and similar trade distributions.
When Backtesting Is Misleading #
The backtest produces fictional results when:
You're trading microstructure effects (spread capture, queue-dependent strategies) but using an engine that fills limits on touch. The "edge" exists only in the execution model's optimism.
You're using bar-level data for strategies with intrabar stop/target logic. The engine's convention for resolving intrabar ambiguity is driving your results.
You've optimized extensively without walk-forward discipline. Multiple rounds of parameter tuning on the same dataset converge on noise, not signal.
Your edge disappears when you add realistic costs. If the strategy's net expectancy per trade is less than 2x your realistic slippage estimate, the fill model is more important than the signal logic.
The Validation Checklist #
Before trusting any futures backtest, verify these ten points:
- Signal timestamps are correct — no look-ahead, proper bar shifting
- Time resolution matches holding period and execution logic
- Stops and targets use realistic intrabar assumptions (or tick data)
- Fill rules for market, limit, and stop orders are documented and understood
- Slippage is conditional on volatility and time-of-day, not constant
- Partial fills are modeled plausibly for your position size
- Contract rolls are simulated with your actual live roll methodology
- Walk-forward validation covers multiple market regimes
- Performance survives cost and fill stress tests (50%+ slippage increase)
- Results remain stable under small parameter perturbation
If a backtest passes all ten, it's not a guarantee of live profitability — but it's an honest assessment of the strategy's historical behavior under realistic assumptions. That's the most any backtest can offer, and for systematic traders, it's enough to make informed decisions about risking real capital.
For more on choosing a futures trading platform with reliable backtesting capabilities, and understanding the automation infrastructure needed to move from backtest to live deployment, see the linked Academy articles.
Knowledge Map
Prerequisites
Understand these firstGo Deeper
Build on this knowledgeReferences This Article
Articles that build on this topicCitations
- — Does backtesting work? (2011) 👍 2“Out of sample data is critical for a meaningful backtest, yet most traders don't do it. Out of sample means that you would backtest on one period of data (say Jan 1 2009 - Jan 1 2010) and then once you feel you have the strategy the way you want it,...”
- — Benchmarks for a good automated ES trading system (2014) 👍 3“My first guess would be that you have almost certainly overfit (Curve fit) to the historical data. You can quickly verify this a couple of ways: a) Whatever time frame you are using, slightly change it.”
- — How quickly do algos go bad? (2021) 👍 5“I think the fact that you have tested on the latest data and then tested backwards on old data is a huge flag of possible curve fitting. Time series testing is hard. It requires a good amount of honesty.”
- — Good Test results!!! Should I go LIVE? (2020) 👍 2“When running backtests as you did, there are a few dangers: (1) Running backtests on renko bars or other exotic bar types that cannot be backtested In case that you have backtested on Renko bars or other exotic bars that cannot be backtested the back...”
- — An experiment on curve fitting (2010) 👍 3“Hi shodson, first of all, thanks for bringing the subject up. I have not traded automated systems and I do not intend to do so during the next year, as this is much more demanding than discretionary trading and requires a larger variety of skills.”
- — Strategy Optimization and trusting the results (2011) 👍 5“There's more than one issue at work here. The reason you forward test is to gain confidence for both edge and execution. Many people do not trust the results of a backtest for execution reasons.”
