How to evaluate an "intelligent" planning tool: twelve questions that separate engines from wrappers
Every planning tool's homepage now says intelligent, and most of the claims are a thin model call away from a to-do list. Buying in this market means distinguishing two architectures that demo identically: a wrapper, where AI text generation decorates a conventional task tracker, and an engine, where deterministic computation does the load-bearing work and AI assists at the edges. Wrappers demo better. Engines survive contact with month three.
We are a vendor in this market, so calibrate accordingly. But every question below is checkable in a trial, asks for behaviour rather than branding, and several of them will disqualify us for you, which is rather the point.
The twelve questions#
1. Does it forecast a distribution or a date? A single finish date hides a quantile choice someone made for you. Ask for P50 and P90, and the probability of each ending. If the tool cannot say "82% success, P90 in May", the intelligence is cosmetic.
2. Can estimates carry uncertainty? "Two days" and "two days, could be five" are different facts. A tool that cannot store a range cannot be honest downstream of one.
3. Is the same question answered the same way twice? Run the forecast twice with nothing changed. If the answers differ, you are reading a language model's mood. Deterministic, seeded computation is what makes a number worth arguing with.
4. Can it explain a date? Ask why a task is scheduled where it is. An engine names the chain: this dependency, that person's available hours, this prior. A wrapper offers a paragraph of plausible prose (why explanations need structure).
5. Where do your availability and your team's capacity live? If the schedule assumes 40 hours, every date is fiction by your real-life ratio (painted weeks, capacity pooling). Bonus: does taking on a new project show its cost to the existing ones?
6. Does it learn from actuals? Completed work is calibration data. A tool that does not recalibrate against your real pace will be exactly as wrong in month six as in week one, and the planning-fallacy research says wrong in a known direction.
7. What happens when the AI is wrong? The decisive architecture question. Is there a deterministic checker that catches invalid structure regardless of who authored it, or does AI output go straight to your plan? Ask the vendor to show you AI output failing validation (the drafts-and-judges split).
8. Can you bring your own model, and leave with your data? Plans-as-text plus a public spec means any LLM can author and any future tool can import; per-seat AI pricing on a proprietary format means the opposite (the BYOAI argument, the pricing economics). Export is the question you ask before you need it.
9. Can it simulate the decision you are actually facing? "What if we cut scope?" should be an edit with a recomputed forecast, not a workshop. If money is your constraint, ask whether spend can trade against odds explicitly.
10. Do loops exist? Real work is "revise until approved". If the tool cannot model a bounded loop with a probabilistic length, your riskiest work is invisible to its forecast.
11. What is the audit trail? When the date moved, can you see what changed? Text-based plans give you diffs; database rows give you a shrug.
12. What does month three cost? Not the licence: the maintenance. Hours per week dragging bars, reconciling views, updating the resourcing sheet. Computed views make that cost approach zero; hand-maintained ones make it a part-time job.
When the answer is not us#
A checklist from a vendor is only credible if it can return "buy something else", so, concretely: Topolog is the wrong tool when your work does not decompose into dependent tasks (a reading list needs a list), when you need issue-tracking depth (tickets, triage, SLAs: that is Jira's home turf, not ours), when your plans are short and decoupled enough that a board is genuinely sufficient (a five-task week has no interesting structure to compute), or when nobody on the team will invest the one honest hour that estimates-with-ranges and a painted week require. Dependency-aware, probability-bearing planning pays off where plans are coupled, uncertain, and consequential. Below that threshold, simpler is correct, and we would rather say so here than in your churn survey.
| Buy a board/list when | Buy an engine when |
|---|---|
| Tasks are independent and short-lived | Tasks block each other across weeks or people |
| Being a day late costs nothing | Dates carry commitments, cash, or both |
| One person, one project | A portfolio competing for the same hours |
| Structure would be ceremony | Structure is where your risk actually lives |
One last test that compresses all twelve: ask each vendor what their tool refuses to do. Engines have principled refusals (ours will not run a plan with a dependency cycle, and will not invent a probability it cannot defend). Wrappers refuse nothing, because text generation has no opinions about correctness. In planning tools, as in colleagues, the ones who never say no are the ones you cannot trust with the dates.