Prompt economics; Coding with agents; Hallucination detection and mitigation

By Donald Farmer, Principal, Treehive Strategy

Data without analysis is a wasted asset. Analytics without action is wasted effort. I write compulsively and advise startups, established software vendors, investors and enterprises on data, analytics and AI strategy.

Who’s Paying For Your Prompt?

More good insight from The Information Difference

Market leadership splits by segment. OpenAI commands roughly 70% of consumer traffic, but Anthropic has overtaken it in the enterprise market as of 2025, holding 32% of foundation model revenue against OpenAI’s 25%. The picture of who “leads” in AI depends entirely on which customers you count.
Both major vendors are burning cash at staggering rates. OpenAI’s estimated revenue for 2025 sits around $13 billion, yet the company may have lost $12 billion in a single quarter (July-September 2025). Anthropic projects break-even in 2028 with $70 billion revenue; OpenAI aims for 2030, requiring $125 billion in revenue to reach that milestone.
Training costs, though eye-catching, are one-time capital expenditures. Training GPT-3 cost perhaps $3 million; GPT-4 around $100 million; GPT-5 may have reached $2 billion. The more persistent economic pressure comes from inference, the ongoing cost of executing user queries, which now accounts for 49% of enterprise compute spending.
The marginal cost of a query varies wildly by model complexity. A standard query costs approximately 0.36 cents in GPU processing; a reasoning model like o1 can cost the vendor $0.10 to $0.50 per query, while users pay the same $0.04. Two complex questions per day can exhaust the entire value of a $20 monthly subscription, making power users a net drain on vendor finances.
Casual users cross-subsidize power users. With only 2% of ChatGPT users paying for subscriptions, and reasoning queries running at a loss, the current pricing structure redistributes costs from light users to heavy ones. The vendor absorbs the difference.
Switching costs remain low, undermining the Uber analogy. Unlike ride-sharing, where network effects locked in drivers and customers before price hikes, LLM vendors face minimal barriers to customer migration. Libraries like LangChain abstract provider differences; unless an enterprise fine-tunes its own model, switching is straightforward. The “moat” that protected Uber’s eventual price increases does not exist here.
Component costs are rising sharply. DRAM contract prices in Q4 2025 jumped 170% year-over-year; server-grade memory may double again in 2026. Data centre capacity is projected to double by 2030. These costs must ultimately be recovered somewhere in the value chain.
Financial stress is becoming visible in the market. Oracle’s $248 billion in off-balance sheet lease commitments for AI data centres rattled investors; credit default swap prices on Oracle spiked in late 2025. Robin AI and Builder.AI (once valued at $1.5 billion) both collapsed, illustrating that not all AI ventures enjoy patient capital.
Enterprise AI projects are failing at troubling rates. A September 2025 BCG report flags the high failure rate of corporate AI initiatives, even under current generous pricing. Businesses building ROI cases on present token prices may find themselves exposed if vendors eventually raise prices to reach profitability.
The article frames current pricing as an investor-subsidized land grab rather than a sustainable equilibrium.Enterprises treating LLM costs as stable inputs are, in the author’s framing, building on shifting sands; the bills for this capital-intensive expansion will come due, and the adjustment could be painful for vendors and customers alike.

It’s not too long, and it’s worth reading

This article presents a realistic analysis of current AI pricing, challenging the notion that it accurately reflects long-term economic trends. It’s a valuable resource for anyone planning enterprise AI projects or seeking clarity on why the industry’s unit economics seem to lag behind growth.

Made me think

If switching costs between LLM providers remain low, what strategies could vendors pursue to create meaningful lock-in, and would those strategies ultimately benefit or harm customers?

The article notes that reasoning models cost vendors far more per query than users pay. How will this be resolved: through rationing, tiered pricing, efficiency gains, or something else entirely?

The Uber analogy depends on eventual market dominance that justifies earlier losses. Given the proliferation of open-source models from China and elsewhere, is there any plausible path to the kind of concentration that would make current AI losses recoverable?

Best practices for coding with agents

From the Cursor blog – worth following!

The agent harness orchestrates three components: instructions (system prompts and rules), tools (file editing, search, terminal execution), and user messages. Cursor tunes these components differently for each frontier model based on internal evaluations and external benchmarks. A model trained on shell-oriented workflows might prefer grep over a dedicated search tool; another might need explicit instructions to run linters after edits. The harness abstracts these differences.
Planning before coding marks the single most consequential change to workflow. A University of Chicago study found that experienced developers are more likely to plan before generating code. Cursor’s Plan Mode (Shift+Tab) has the agent research the codebase, ask clarifying questions, and produce a detailed implementation plan with file paths and code references before touching any files. Plans can be saved as Markdown to .cursor/plans/ for documentation and resumption.
When implementation goes wrong, revert to the plan rather than patching through follow-up prompts.Reverting changes, refining the plan, and running it again often produces cleaner results faster than trying to correct an agent mid-flight.
Context management becomes the human’s primary job as agents take over code generation. Cursor’s agent can find files through grep and semantic search on demand; you need not tag every relevant file. Including irrelevant files can confuse the agent about what matters. Features like @Branch orient the agent to your current work; @Past Chats let you reference previous conversations selectively rather than duplicating entire histories.
Long conversations degrade agent performance. After many turns and summarizations, context accumulates noise; the agent may lose focus or drift to unrelated tasks. Starting fresh when moving to a new feature or when the agent seems confused restores effectiveness.
Rules and Skills extend agent behavior in complementary ways. Rules are static Markdown files in .cursor/rules/ that apply to every conversation: commands to run, patterns to follow, pointers to canonical examples. Skills are dynamic capabilities loaded only when relevant, including custom commands triggered with /, hooks that run before or after agent actions, and domain knowledge for specific tasks. The distinction keeps the context window clean while preserving access to specialized capabilities.
Hooks create powerful looping patterns for goal-oriented work. A stop hook can check whether tests pass or a scratchpad contains “DONE”; if not, it sends a followup_message to continue the loop. This pattern works for running tests until they pass, iterating on UI until it matches a design mockup, or any task where success is verifiable.
Parallel execution via git worktrees lets multiple agents work simultaneously without interference. Each agent runs in its own worktree with isolated files; when finished, you apply changes back to the working branch. Running the same prompt across multiple models and comparing results side by side improves output quality, particularly for harder problems.
Cloud agents handle tasks you would otherwise add to a to-do list. Bug fixes that arose while working on something else, refactors of recent code, test generation, documentation updates. Cloud agents clone the repo, create a branch, work autonomously, and open a pull request when finished. You can trigger them from Slack, the web, or your phone, and check results later.
Debug Mode offers a hypothesis-driven approach to tricky bugs. Rather than guessing at fixes, it generates multiple hypotheses, instruments code with logging, asks you to reproduce the bug while collecting runtime data, then analyzes actual behavior to pinpoint root causes. It works best for reproducible bugs, race conditions, performance problems, and regressions.

It’s not too long, and it’s worth reading

This guide is a kind of cheat sheet as well as a roadmap for how to use AI to develop software with agents. It’s not just about telling you to “use AI better,” but shows specific patterns and techniques that work well.

Made me think

The guide recommends reverting and refining plans rather than patching through follow-up prompts. What does this tell us about the limitations of conversational correction with current models? Will those limitations change as context handling improves? (Let’s hope)

Rules are described as static and always-on, while Skills load dynamically when relevant. So, how does the agent decide when a Skill is relevant, and what happens when that judgment is wrong?

The stop hook pattern for iterative loops depends on verifiable success criteria (tests passing, a scratchpad marked “DONE”). For tasks without clear verification signals, how should developers think about when to stop the loop?

Running the same prompt across multiple models and comparing results is presented as a “technique” for hard problems. But what criteria could a developer use to decide which solution is “best” when different models take structurally different approaches?

The guide emphasizes that AI-generated code can “look right while being subtly wrong.” OK, but what kinds of subtle wrongness are most common? How should we review code to catch errors that are not syntactically or even logically obvious?

Hallucination Detection and Mitigation in Large Language Models

A research paper from CIBC, Toronto, by Ahmad Pesaranghader and Erin Lee

The paper proposes a continuous improvement cycle for hallucination management, organized around root cause awareness rather than generic fixes. The authors argue that applying a blanket solution (such as retraining) to a hallucination caused by outdated data is wasteful when a targeted intervention (such as retrieval-augmented generation) would suffice. The framework links detection signals to probable failure categories and selects mitigation strategies accordingly.
Hallucination sources are categorized into three domains: model, data, and context. Model-level issues include architectural constraints (finite context windows, autoregressive next-token prediction that prioritizes fluency over factual consistency), training objectives that do not explicitly reward accuracy, and decoding dynamics that introduce variability. Data-level issues include knowledge gaps, outdated information, noisy or biased training corpora, and spurious correlations learned from web-crawled text. Context-level issues arise from ambiguous prompts, contradictory retrieved documents, and domain or task shifts at inference time.
Detection methods span multiple levels of model observability. Probabilistic entropy measures uncertainty in the token probability distribution; semantic entropy measures uncertainty in embedding space by clustering multiple sampled responses and calculating the distribution across clusters. Expected Calibration Error (ECE) quantifies the gap between predicted confidence and empirical accuracy, catching high-confidence hallucinations that entropy-based methods miss. Self-declared uncertainty has the model verbalize its own confidence. Internal state monitoring probes hidden layers and attention weights. The RACE framework (Reasoning and Answer Consistency Evaluation) penalizes outputs that arrive at correct answers through flawed reasoning.
The distinction between open-weight and closed-weight models determines which detection and mitigation techniques are available. Open-weight models permit Monte Carlo dropout, ensemble variance, training of custom classifiers on latent representations, and full control over decoding parameters. Closed-weight models accessed via API restrict observability to final text outputs, limiting detection to token-level entropy estimated through sampling and output-level comparison with retrieved documents. Hybrid approaches wrap closed-weight generators with open-weight detectors to estimate uncertainty without access to internal states.
Mitigation strategies fall into five classes: external knowledge grounding, confidence calibration, prompt engineering, decoding control, and fine-tuning. External knowledge grounding through RAG addresses outdated or incomplete training data. Confidence calibration (temperature scaling, isotonic regression, Bayesian post-hoc methods, multi-pass self-evaluation) aligns predicted confidence with empirical accuracy. Prompt engineering techniques include chain-of-thought prompting, few-shot exemplars, instruction layering, self-consistency prompting, and role-aligned prompting. Decoding control manipulates sampling parameters and applies constrained decoding to force outputs within a verified vocabulary. Fine-tuning through adversarial training, contrastive learning, and RLHF provides deeper corrections but consumes more resources.
Temperature scaling receives detailed treatment as a lightweight calibration method. A single scalar parameter T is applied to logits before softmax; T > 1 flattens the distribution to reduce overconfidence, T < 1 sharpens it. The optimal T is learned on a validation set by minimizing negative log-likelihood. The paper shows how temperature scaling interacts with probabilistic entropy, semantic entropy, and Monte Carlo dropout, adjusting within-pass confidence and often increasing inter-pass variance for overconfident models.
The paper acknowledges that hallucinations cannot be eliminated entirely. The statistical nature of language modeling means these systems predict plausible sequences rather than verify ground truth. The framework aims to identify, contain, and learn from failures rather than pursue perfect accuracy. This framing positions the work as risk management rather than problem-solving.
A case study applies the framework to data extraction in financial documents. The authors describe a tiered architecture where the Model Tier handles intrinsic reliability through calibration and ensemble filtering; the Context Tier addresses prompt and instructional misalignment through prompt optimization and context summarization; and the Data Tier governs external verification through cross-source fact-checking and retrieval-augmented grounding. Each tier feeds residual errors back into the others, creating cross-tier feedback integration.
The paper positions itself for practitioners in regulated industries. For AI practitioners, it offers guidance on selecting detection metrics based on operational constraints. For risk officers, it provides assurance that hallucination risks are actively managed through a verifiable, systematic process with transparency and structured controls suitable for regulated deployment.
Recent research caveats temper optimism about RLHF. The authors note that the effectiveness of reinforcement learning from human feedback depends on reward design and annotation quality, and may not consistently penalize subtle factual errors. This acknowledgment of limitations distinguishes the paper from more promotional treatments of alignment techniques.

It’s not too long, and it’s worth reading

This paper presents a practical taxonomy of detection and mitigation techniques, categorized based on model access and suspected failure types. For professionals developing production systems in finance or other regulated fields, the tiered architecture and continuous improvement approach provides a tangible roadmap not just a research agenda.

Made me think

The framework assumes root cause awareness can guide mitigation selection, but detection signals are probabilistic indicators rather than definitive attributions. How should practitioners handle cases where multiple root causes are plausible?

How does the framework prevent confirmation bias in diagnosis?

Expected Calibration Error is a way to catch high-confidence hallucinations that entropy-based methods miss. But ECE needs labeled data to work. So, how can data scientists build and keep up with validation sets for domains where ground truth is expensive to obtain or rapidly changing?

The authors acknowledge that hallucinations cannot be eliminated, only managed. How should organizations communicate residual hallucination risk to downstream users? What liability frameworks are appropriate when managed risk still produces costly failures?