LLMs: A Hackers Guide

Hrishi · 2026-05-22 ·▶ Watch on YouTube ·via captions ·3 min read
TL;DR

A practitioner's guide to working with LLMs, drawn from real shipping-industry AI deployments. Covers the iterative workflow for building with LLMs, input/output best practices, debugging frameworks, and honest assessments of embeddings and long-context models. ---

Key Concepts

CPLN loop
tap to reveal ↩
Chat → Playground → Loop (add data/test cases) → Nest (break into subtasks) — the core iterative workflow for LLM development
Modalities
tap to reveal ↩
Text, vision, audio/speech, and code — all available and underused
Structured output
tap to reveal ↩
Using type specs (e.g., Zod, TypeScript) to constrain model output and reduce hallucinations
Nested prompts
tap to reveal ↩
Breaking large prompts into smaller, composable sub-tasks — analogous to refusing 700-line code files
Error taxonomy
tap to reveal ↩
Three root causes — app-level orchestration issues, factuality issues, instruction-following issues
Embeddings as fuzzy search
tap to reveal ↩
Best used at the end of a pipeline on an already-reduced search space, not as the primary retrieval mechanism

Notes

§The Iterative Loop (CPLN)

  • Chat (~50%+ of time): Explore freely, try radically different approaches — don't commit to a single prompt early
  • Most teams stop at ~40% working and go to production too soon
  • Unlike code, prompts should be rewritten many times before settling
  • Playground (~20%): Use any provider's playground; key feature needed is retroactive editing of prompts and conversation history ("surfing the latent space")
  • Loop: Add more data and test cases; stress-test your hypothesis
  • Nest: Break prompts into smaller and smaller subtasks
  • If you wouldn't accept a 700-line code file, don't accept a 100-line prompt
  • Every prompt can be decomposed further
  • Subtasks then re-enter the same loop when new problems or customers arrive

§Input Modality Best Practices

  • Speech: Users give ~200 words when asked to speak vs. ~5 words in a text box — use it to gather richer context
  • Vision: Captures relationships that text cannot; useful as expensive-but-dense OCR; visual diagrams are often more token-efficient than their text equivalents
  • Code/structured input: Almost always prefer structured input and structured output
  • Tools: TypeScript + Zod for type specs; SQL to express search logic (even if never executed)
  • Structured output reduces hallucinations by constraining the token probability space
  • Lean on the model's training: Express problems as supersets of known languages (Python, TypeScript, English) rather than inventing custom DSLs

§General Dos and Don'ts

  • Use all available modalities
  • Try multiple models — they have diverged significantly and have different "personalities"
  • Keep input/output size ratios roughly proportional; 5 words in → 20 paragraphs out = poor output
  • Add structure at both input and output stages
  • Add heavy abstractions (frameworks, libraries) between yourself and the model early on — you learn less and get trapped
  • Stick to one model provider
  • Expect massive output from minimal input

§What LLMs Enable (4 Capability Classes)

    §Debugging Framework

    • Drop to prompt level; remove abstractions and work back up
    • Try a smarter model — if that fixes it, the problem is prompt/input
    • Try a dumber model — reveals where reasoning is actually needed
    • Transform the input (verbosity, structure, chunking)
    • Add more structure to the output to expose where failures occur
    • Find what separates failing cases from working cases — that difference points to a prompt fix
    • Add more validation
    • If the system is doing more than one of the four capability classes, separate them into distinct prompts/models
    • Classify errors into: app-level orchestration, factuality, or instruction-following
    • You are almost always too verbose — cut prompts down, then cut again
    • Lower task complexity per prompt = easier to debug, easier to swap out broken pieces

    §Project Example: Transcript → Docs Tool

      §Cost & Performance Trajectory

      • Expect 10–50x cost reduction and speed increase in the near term
      • Drivers: hardware-level optimization (Nvidia), memory optimizations, quantization — all incremental engineering, not research breakthroughs
      • Design systems for where costs and latency will be in 6 months, not today

      §Long-Context Windows (Q&A)

      • Attention is quadratic — doubling context requires 4x memory/compute
      • Practical long-context implementations "cheat": a pre-pass selects which tokens to attend to, so you don't truly get full-context reasoning
      • Still an open research problem with no known clean solution

      §Input Transformation (Q&A)

      • Use AI for complex transformations; use deterministic/structural methods where possible
      • Simple structural transformations (split by sentence, extract title/sections from markdown) are already valuable
      • Titles = highest compressed information in a document; first section = intro — these are free signals

      §Embeddings & Vector Search (Q&A)

      • Embedding models are small — limited understanding of underlying text; long-context embeddings in particular are unreliable
      • Cosine similarity is the only usable signal once embedded, and the model internals are opaque
      • Correct pattern: reduce search space first using structured search (BM25, keyword filters, metadata, LLM-based pre-filtering), then apply embeddings on the reduced set
      • Embeddings should be the last step, not the first
      • For ranking/highlighting within retrieved results (e.g., finding the most relevant sentence within a page), embeddings are appropriate
      • Increasingly viable alternative: use an LLM directly for retrieval tasks where embeddings were previously used

      Actionable Takeaways

      1. Spend the majority of your time in the Chat phase trying genuinely different approaches — not incrementally tweaking one prompt
      2. Nest every prompt: if a task can be broken into subtasks, it should be
      3. Always use structured output (type specs, JSON schemas) — it reduces hallucinations and makes debugging easier
      4. Transform your input before passing to the model: extract structure, separate by sentence, label sections
      5. Use multiple models for different subtasks; don't assume one model is the ceiling
      6. When debugging, classify the error (app-level, factuality, instruction-following) before changing anything
      7. Design for 10–50x cheaper/faster compute — build what becomes viable in 6 months, not just what's practical today
      8. For RAG/vector search: reduce search space structurally first, then apply embeddings only on the remainder

      Quotes Worth Keeping

      If you're not going to accept a 700-line code file as good, you shouldn't accept a 100-line prompt as good.

      You might try something and genuinely be the first person in the world to have thought of that particular way of solving a problem with a model.

      If you present users with a text box they'll give you five words. If you ask them to press a button and talk, they'll give you 200 words.

      Embeddings should always be the last step in your pipeline. You should never be searching your full search space with embeddings.