Skip to content

Filesystem Traversal and Text-Processing Tools for Agents

Agent harnesses benefit from treating filesystem traversal and text processing as typed, bounded operations. fd and ripgrep are strong discovery/search primitives; sed and awk are useful reference tools but should be treated as text-processing languages with mutation and execution edges, not as harmless read-only helpers.

  • Prefer purpose-built wrappers over raw shell: Agent-facing tools should expose typed inputs like pattern, path, glob, max_results, and context instead of accepting arbitrary command strings.
  • Use fd for path discovery: fd has repo-friendly defaults: it skips hidden files and respects ignore rules by default, while offering filters for type, extension, depth, size, path matching, and result limits.
  • Use rg for content search: ripgrep respects ignore rules, skips hidden and binary files by default, supports globs and file types, and can emit JSON Lines for machine parsing.
  • Keep sed output-only when it is used by a harness: sed is valuable for line-range printing and substitution previews, but -i, w, and GNU e can mutate files or execute commands. Prefer Rust-native line slicing for common reads.
  • Keep awk templated or human-approved: awk is excellent for field extraction and summaries, but programs can redirect output, open pipes, and call system(). Do not run model-authored arbitrary awk by default.
  • Even read-only tooling needs caps: Traversal depth, max results, max bytes, stdout/stderr caps, timeout, and project-root containment should be enforced by the tool boundary, not left to model behavior.
Claim Support Caveat / Confidence
rg is the right default content-search primitive. ripgrep’s guide documents recursive search, automatic filtering, .gitignore/.ignore/.rgignore support, hidden-file skipping, binary-file skipping, globs, file types, and --files. Local rg --help confirms JSON Lines output via --json. High. Need handle exit code 1 as “no matches,” not a tool failure.
fd is the right default path-discovery primitive. fd’s README describes it as a filesystem entry finder with regex/glob patterns, parallel traversal, hidden/ignored defaults, type/extension filters, and command execution options. Local fd --help confirms --max-depth, --max-results, --type, --extension, --print0, and --one-file-system. High. Do not expose -x/-X execution to the model.
sed should not be a write tool. GNU sed and BSD/macOS sed support in-place editing, and sed scripts can write files. macOS manpage warns about corruption/partial content risk with in-place editing without backups. High. Also portability differs between GNU and BSD sed.
awk should be constrained because it is a programming language. gawk manual describes pattern-action programs, fields, built-ins, redirection, pipes, and system(). Local man awk confirms fields, print/printf, redirection, pipes, and system(). High. Useful for summaries, risky as arbitrary model code.
Tool output should be structured for the UI/model. Typed agent events and transcript entries need stable fields; rg --json gives structured search events, and fd output can be parsed line-by-line or NUL-delimited. High. fd does not provide JSON, so use NUL or newline plus path normalization.
Term Meaning
fd Fast filesystem entry finder; best used for path discovery and filtered file listing.
rg / ripgrep Fast recursive content search; best used for grep-like search and candidate file discovery.
sed Stream editor; useful for line-oriented output transforms, but capable of in-place edits and file writes.
awk Pattern-action language for scanning records and fields; useful for summaries, but capable of command execution and output redirection.
Tool wrapper Rust function that maps typed input to a fixed command invocation and structured output.
Project root containment Rejecting paths that escape the selected workspace root after canonicalization.

Use cases:

  • Print a line range for quick context.
  • Preview simple substitutions on stdout.
  • Show transformed text without touching the file.

Recommendation:

  • Implement read_file_range in Rust instead of invoking sed.
  • If sed is used, only permit -n plus generated address/print scripts.
  • Never pass model-authored sed scripts directly.
  • Never let a text-inspection helper become an implicit edit path.

Portability note:

  • Local sed is BSD/macOS style and does not support --help.
  • GNU sed and BSD sed differ in extensions and -i behavior. Keep any required sed usage POSIX-ish or avoid it.

Use cases:

  • Count rows/fields in logs or delimited data.
  • Extract a column from a bounded text stream.
  • Build simple summaries from command output.

Recommendation:

  • Prefer canned templates over arbitrary awk programs.
  • Feed bounded input through stdin, not project-wide files directly.
  • Strip or reject programs containing redirection, pipes, or system( if raw awk ever becomes a human-approved advanced mode.
  • Default to Rust-native parsing when the format is known.
  • When should a harness parse rg --json instead of simpler line output?
    • Parse rg --json once matches need stable paths, line numbers, truncation metadata, or machine-readable UI rendering.
  • When should path discovery prefer fd over rg --files?
    • Prefer fd when the query is about filesystem entries and filters, and use rg --files when search tooling already owns ignore semantics.
  • Do we want an advanced human-approved “run awk” mode after v1, or should summaries stay template-only?
    • Keep summaries template-only unless users repeatedly need arbitrary tabular transformations that cannot be expressed safely.
  • What is the default output cap per tool call: lines, bytes, or both?
    • Cap both bytes and lines so huge lines and huge result counts are bounded independently.
  • Related ideas: Pi’s explicit small tool set; Herdr’s semantic tool/process states; deterministic snapshots for tool-result rendering; narrow tool boundaries before broad process execution.
  • Related sources: pi, herdr, release.
  • Contradictions or tensions: developer muscle memory favors raw shell commands, but an agent-facing harness needs typed, bounded, auditable operations.
  • Conceptual uses: reliable repo search, file discovery, context gathering, output capping, path containment, and safe transcript rendering.
  • When should broad text-processing or shell-like power be exposed to the model?
    • Recommendation: Prefer typed, capped search and file tools first, and expose broad text-processing or shell-like power only behind explicit product need.
  • Whether to vendor Rust crates for search/traversal later (ignore, grep, walkdir) instead of spawning fd/rg.
    • Recommendation: Keep spawning mature CLI tools until portability, startup cost, or structured-output needs make Rust-native crates clearly better.
  • How to handle Windows environments where sed/awk may be absent.
    • Recommendation: Avoid depending on sed or awk for core behavior and implement required inspection paths in Rust.
  • Whether user config should allow hidden/ignored files globally or only per tool call.
    • Recommendation: Make hidden and ignored traversal explicit per call, with any global default staying conservative.
  • How much stderr should be shown in the transcript versus hidden in diagnostics.
    • Recommendation: Show concise stderr summaries in the transcript and keep full diagnostic detail behind capped logs or verbose output.

“ripgrep will never modify your files.”