LLMs: A Hackers Guide
A practitioner's guide to working with LLMs, drawn from real shipping-industry AI deployments. Covers the iterative workflow for building with LLMs, input/output best practices, debugging frameworks, and honest assessments of embeddings and long-context models. ---
Key Concepts
Notes
§The Iterative Loop (CPLN)
- Chat (~50%+ of time): Explore freely, try radically different approaches — don't commit to a single prompt early
- Most teams stop at ~40% working and go to production too soon
- Unlike code, prompts should be rewritten many times before settling
- Playground (~20%): Use any provider's playground; key feature needed is retroactive editing of prompts and conversation history ("surfing the latent space")
- Loop: Add more data and test cases; stress-test your hypothesis
- Nest: Break prompts into smaller and smaller subtasks
- If you wouldn't accept a 700-line code file, don't accept a 100-line prompt
- Every prompt can be decomposed further
- Subtasks then re-enter the same loop when new problems or customers arrive
§Input Modality Best Practices
- Speech: Users give ~200 words when asked to speak vs. ~5 words in a text box — use it to gather richer context
- Vision: Captures relationships that text cannot; useful as expensive-but-dense OCR; visual diagrams are often more token-efficient than their text equivalents
- Code/structured input: Almost always prefer structured input and structured output
- Tools: TypeScript + Zod for type specs; SQL to express search logic (even if never executed)
- Structured output reduces hallucinations by constraining the token probability space
- Lean on the model's training: Express problems as supersets of known languages (Python, TypeScript, English) rather than inventing custom DSLs
§General Dos and Don'ts
- Use all available modalities
- Try multiple models — they have diverged significantly and have different "personalities"
- Keep input/output size ratios roughly proportional; 5 words in → 20 paragraphs out = poor output
- Add structure at both input and output stages
- Add heavy abstractions (frameworks, libraries) between yourself and the model early on — you learn less and get trapped
- Stick to one model provider
- Expect massive output from minimal input
§What LLMs Enable (4 Capability Classes)
§Debugging Framework
- Drop to prompt level; remove abstractions and work back up
- Try a smarter model — if that fixes it, the problem is prompt/input
- Try a dumber model — reveals where reasoning is actually needed
- Transform the input (verbosity, structure, chunking)
- Add more structure to the output to expose where failures occur
- Find what separates failing cases from working cases — that difference points to a prompt fix
- Add more validation
- If the system is doing more than one of the four capability classes, separate them into distinct prompts/models
- Classify errors into: app-level orchestration, factuality, or instruction-following
- You are almost always too verbose — cut prompts down, then cut again
- Lower task complexity per prompt = easier to debug, easier to swap out broken pieces
§Project Example: Transcript → Docs Tool
§Cost & Performance Trajectory
- Expect 10–50x cost reduction and speed increase in the near term
- Drivers: hardware-level optimization (Nvidia), memory optimizations, quantization — all incremental engineering, not research breakthroughs
- Design systems for where costs and latency will be in 6 months, not today
§Long-Context Windows (Q&A)
- Attention is quadratic — doubling context requires 4x memory/compute
- Practical long-context implementations "cheat": a pre-pass selects which tokens to attend to, so you don't truly get full-context reasoning
- Still an open research problem with no known clean solution
§Input Transformation (Q&A)
- Use AI for complex transformations; use deterministic/structural methods where possible
- Simple structural transformations (split by sentence, extract title/sections from markdown) are already valuable
- Titles = highest compressed information in a document; first section = intro — these are free signals
§Embeddings & Vector Search (Q&A)
- Embedding models are small — limited understanding of underlying text; long-context embeddings in particular are unreliable
- Cosine similarity is the only usable signal once embedded, and the model internals are opaque
- Correct pattern: reduce search space first using structured search (BM25, keyword filters, metadata, LLM-based pre-filtering), then apply embeddings on the reduced set
- Embeddings should be the last step, not the first
- For ranking/highlighting within retrieved results (e.g., finding the most relevant sentence within a page), embeddings are appropriate
- Increasingly viable alternative: use an LLM directly for retrieval tasks where embeddings were previously used
Actionable Takeaways
- Spend the majority of your time in the Chat phase trying genuinely different approaches — not incrementally tweaking one prompt
- Nest every prompt: if a task can be broken into subtasks, it should be
- Always use structured output (type specs, JSON schemas) — it reduces hallucinations and makes debugging easier
- Transform your input before passing to the model: extract structure, separate by sentence, label sections
- Use multiple models for different subtasks; don't assume one model is the ceiling
- When debugging, classify the error (app-level, factuality, instruction-following) before changing anything
- Design for 10–50x cheaper/faster compute — build what becomes viable in 6 months, not just what's practical today
- For RAG/vector search: reduce search space structurally first, then apply embeddings only on the remainder
Quotes Worth Keeping
If you're not going to accept a 700-line code file as good, you shouldn't accept a 100-line prompt as good.
You might try something and genuinely be the first person in the world to have thought of that particular way of solving a problem with a model.
If you present users with a text box they'll give you five words. If you ask them to press a button and talk, they'll give you 200 words.
Embeddings should always be the last step in your pipeline. You should never be searching your full search space with embeddings.