Changelog
All notable changes to ChunkHound will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Unreleased
5.1.0 - 2026-05-20
Breaking Changes
- MCP
searchresponse format changed to markdown — Thesearchtool now returns lean markdown strings instead of JSON objects, with syntax-highlighted code fences, similarity percentages (semantic search), and a pagination footer. MCP clients that parse raw search output as JSON must migrate to the new format.
Added
--index-unknown-filesflag — Files with unrecognized extensions are now indexable as plain text (binary files are still skipped). Enabled via--index-unknown-filesCLI flag,indexing.index_unknown_filesconfig key, orCHUNKHOUND_INDEXING__INDEX_UNKNOWN_FILESenv var.- Proto, GraphQL, XML, config, and Dockerfile support —
.proto,.graphql,.gql,.xml,.ini,.properties,.conf,.cfg, and extensionlessDockerfile/Jenkinsfilefiles are now indexed by default..envfiles are explicitly excluded to prevent secret leakage. chunkhound.aionboarding — Interactive CLI setup is replaced with guided onboarding at chunkhound.ai; local backend is now configured explicitly rather than through prompts.
Fixed
- MCP startup HNSW crash — MCP server no longer fails with a
CreateDeltaIndexassertion on startup against databases missing the unique(chunk_id, provider, model)index from v5.0.0. HNSW recreation now runs outside the transaction (issue #280). - WAL validation HNSW crash — WAL pre-flight validation now uses in-memory+ATTACH, preventing a C++ abort when the WAL contained HNSW operations from a prior session (issue #273).
--dbnested directory bug — Passing an explicit file path (e.g.--db /path/to/chunks.db) no longer creates achunks.db/chunks.dbnested directory; known DB extensions (.db,.duckdb) are now correctly identified as file paths (issue #215).- Parser install hints — C# error messages now show the correct PyPI package
tree-sitter-c-sharp(wastree-sitter-csharp); Makefile showstree-sitter-make; SCSS points totree-sitter-language-pack(issue #267).
5.0.0 - 2026-05-05
Breaking Changes
- Config precedence reordered — Local
.chunkhound.jsonnow takes precedence over environment variables. If you relied on env vars overriding project-level settings, use CLI arguments instead. --confignow overrides local.chunkhound.json— Previously, a project-local.chunkhound.jsontook precedence over an explicit--configpath. Now--configwins. If you relied on local.chunkhound.jsonshadowing a shared config file, move that override into CLI arguments.- Missing
--config/CHUNKHOUND_CONFIG_FILEpath now raises — A non-existent config file path used to be silently ignored; it now raisesValueErrorwith an actionable message. DEFAULT_LLM_TIMEOUTdoubled — Default LLM request timeout increased from 60 s to 120 s for all providers (was already 120 s for Gemini; now uniform).- HTTP MCP server removed — ChunkHound now supports stdio transport only for MCP connections
chunkhound mcp httpcommand removed--http,--port,--hostCLI flags removed- FastMCP dependency removed
- Migration: Use
chunkhound mcp(stdio) instead. All major MCP clients (Claude Code, Claude Desktop, VS Code) support stdio transport. - Rationale: Simplified codebase, reduced dependencies, focused on primary use case (stdio is the standard for MCP)
- Unsupported file types no longer indexed as plain text — Files with unrecognized extensions are now skipped instead of being force-parsed as plain text. Files with known text extensions (.txt, .log, .cfg, .conf, .ini) are unaffected.
- Claude Code CLI default model changed from
claude-sonnet-4-5-20250929toclaude-haiku-4-5-20251001. Users who relied on the default model will see different cost/quality characteristics. Setllm.model,llm.utility_model, orllm.synthesis_modelexplicitly to retain previous behavior. - Anthropic provider upgraded to Claude Opus 4.7/4.6 and Sonnet 4.6
anthropicdependency minimum bumped to>=0.96.0,<1.0.0- Default Anthropic utility and synthesis models changed to ChunkHound’s
claude-haikusentinel. This is intentional: current Claude Haiku is capable enough for synthesis, is Anthropic’s cheapest available Claude model, and Anthropic does not currently offer a true low-cost utility tier. Users who prefer maximum synthesis quality can overridesynthesis_model. - Default Claude Code CLI model changed to the same
claude-haikusentinel. ChunkHound still honors its Claude env overrides first; otherwise it preserves the sentinel so Claude Code can resolve the latest matching alias itself. - Removed module symbols
BETA_EFFORTandEFFORT_SUPPORTED_MODELS. Callers should use thesupports_effort(model)/supports_effort_level(model, level)predicates instead. thinking_enabled=Truewiththinking_mode="auto"resolves to adaptive only for adaptive-capable models such as Opus 4.6/4.7, Sonnet 4.6, and Mythos. The pinned Haiku fallback remains manual-mode thinking.anthropic_prompt_cachingdefaults tofalsebecause ChunkHound requests rarely reuse prompt prefixes enough to offset Anthropic cache-write costs. To opt in, setCHUNKHOUND_LLM_ANTHROPIC_PROMPT_CACHING=trueor pass--llm-anthropic-prompt-caching.- Invalid
thinking_modevalues and sub-20000task_budget_tokensnow raiseValueErrorinstead of warning-and-coercing.
Added
- Elixir language support — Full Elixir parsing (32nd language) via tree-sitter-elixir: modules, functions, macros, protocols, structs, specs, and import/alias/require statements.
- TwinCAT/Structured Text parser — IEC 61131-3 Structured Text (
.TcPOU) files for PLC development are now fully searchable. - HTML, CSS, SCSS, and Jinja parsers — Full tree-sitter parsing for web languages: HTML (
.html,.htm,.xhtml), CSS (.css), SCSS/Sass (.scss,.sass), and Jinja templates (.jinja,.j2,.njk,.erb,.ejs). SCSS preprocessing handles#{...}interpolations for correct AST byte offsets. Import resolution is supported for all four languages. - Grok (xAI) LLM provider — xAI Grok models are now supported for deep code research via the
code_researchtool. - Matryoshka embeddings — OpenAI and VoyageAI providers now support Matryoshka truncation for flexible vector dimensions; default OpenAI model upgraded to
text-embedding-3-large. openai_compatibleembedding provider — Connect any OpenAI-compatible embedding endpoint with configurable SSL verification, auth, and dimension support.- Azure OpenAI embeddings — Native Azure OpenAI embedding support with
azure_endpoint,api_version, andazure_deploymentconfiguration options. - VoyageAI ranking support — VoyageAI provider now supports reranking for improved search result quality.
- Claude Opus 4.7 / Opus 4.6 / Sonnet 4.6 support — Adaptive thinking mode (auto / off / manual / adaptive selector), expanded effort levels (
low,medium,high,xhigh(Opus 4.7 only),max(4.6+)), opt-in prompt caching with configurable TTL (5m/1h), and the task-budgets beta (Opus 4.7 only, advisory cap for agentic loops, min 20000 tokens). - New
LLMConfigfields —anthropic_thinking_mode,anthropic_thinking_display,anthropic_prompt_caching,anthropic_cache_ttl,anthropic_task_budget_tokens(and matchingCHUNKHOUND_LLM_ANTHROPIC_*env vars and--llm-anthropic-*CLI flags). The pre-existinganthropic_thinking_enabled,anthropic_thinking_budget_tokens,anthropic_interleaved_thinking,anthropic_effort,anthropic_context_management_enabled, andanthropic_clear_*fields are now also readable from env and CLI. - Embedded SQL detection — SQL embedded in string literals is detected and indexed by default across Python, Java, JavaScript, TypeScript, C#, Go, Rust, and PHP. Disable with
--no-detect-embedded-sqlorCHUNKHOUND_INDEXING__DETECT_EMBEDDED_SQL=false. - OpenAI Responses API — Deep code research now supports reasoning models (gpt-5.1, gpt-5.1-codex, o-series, gpt-5-pro) via the Responses API, with automatic routing based on model compatibility across 30+ models.
- Reasoning effort control — Configurable LLM reasoning effort (
none/minimal/low/medium/high) for deep research viaCHUNKHOUND_LLM_CODEX_REASONING_EFFORTwith per-role overrides. - Structured JSON output — Responses API maintains schema validation consistency across both Chat Completions and Responses endpoints.
- Multi-client MCP daemon — Multiple MCP clients can share a single DuckDB connection via a background daemon, eliminating lock conflicts in multi-session workflows.
--perf-diagnosticsmode —chunkhound index --perf-diagnosticscollects per-batch timing metrics and detects performance regressions via linear regression and z-score analysis, outputting a JSON diagnostics file.--path-filterfor research —chunkhound research --path-filter <dir>scopes deep code research to a subdirectory.- PHP config-literal parsing — PHP files with top-level
return [...]arrays are now searchable. - Universal config-literal parsing — Exported configuration objects and arrays in Python, JavaScript, TypeScript, and JSX/TSX are now discoverable through semantic search.
- Watchman live-indexing operator docs — Documents the private
.chunkhound/watchman/sidecar, fail-fast startup/no-implicit-fallback behavior,daemon_statushealth interpretation, and the rollout/default-switch gate for making Watchman the primary backend. - Dart language support —
.dartfiles are now fully searchable via tree-sitter parsing: classes, functions, methods, constructors, and import/export statements (33rd language). - Lua language support —
.luafiles are now parsed and indexed via tree-sitter, covering functions, tables, and module patterns. - T-SQL (SQL Server) parser — SQL Server T-SQL (
.sql) files are now fully parsed and searchable via tree-sitter. chunkhound autodoccommand — Generates a static Astro documentation site from codebase research, with provenance citations linked to source references and byte-stable output across platforms.chunkhound codemapcommand — Maps areas of interest (POIs) in a codebase through deep code research; the-jflag enables parallel POI processing with automatic backoff to serial on failure.- Configurable disk storage limit —
database.max_disk_usage_mbconfig option (--max-disk-usage-gbCLI flag,CHUNKHOUND_DATABASE__MAX_DISK_USAGE_GBenv var) caps database growth and raises a clear error instead of filling the disk. - Anthropic native structured outputs — Anthropic provider now uses the
structured-outputs-2025-11-13beta API for guaranteed schema-compliant JSON via constrained decoding, with type-safe Pydantic model responses and extended thinking compatibility. - Global gitignore support — ChunkHound now reads the user’s global gitignore file (via
git config --global core.excludesFile) when building the exclusion list during indexing.
Changed
- Watchman default backend — Watchman is now the default realtime backend on supported native-runtime platforms;
watchdogandpollingremain explicit fallback backends.
Enhanced
- MCP tool routing —
code_researchandsearchtool descriptions rewritten for improved LLM routing; cross-references between tools are shown or hidden dynamically based on whether an LLM provider is configured. - Daemon overlap guard — A user-scoped daemon registry is now validated against each project’s
daemon.lockbefore startup, preventing live parent/child root overlaps (e.g., running daemons for/workspaceand/workspace/projectsimultaneously). Exact-root reuse across restarts is preserved; sibling roots are allowed. ChunkType.IMPORT— Import statements across all languages now use a dedicated chunk type instead of falling through toUNKNOWN, improving search precision.- Chunk size enforcement — All parsers now enforce a central size guard before DB persistence; oversized chunks are split automatically, preventing embedding API failures.
- Windows compatibility — Cross-platform temp directory handling for Claude Code CLI provider;
shutil.whichreplaces Unix-onlywhichfor git binary detection. - Version management — Supports PEP 440 pre-release formats (alpha, beta, RC) with safety checks to prevent accidental releases from uncommitted work.
- Multi-client MCP daemon — index lock conflict handling —
chunkhound indexnow detects a running daemon’s lock file on DuckDB conflict: a healthy daemon prints an informational message and exits cleanly; an unresponsive daemon prompts the user to kill it and retry. - Python import resolution — Import statements are now resolved more accurately in Python code research, improving cross-file symbol discovery.
Performance
- LanceDB dimension detection — Table creation now detects embedding dimensions upfront from the configured provider, eliminating the O(n) table recreation penalty during first embedding insertion for large codebases (e.g. 16,000+ chunks no longer require full table migration).
Fixed
- Cross-repo data loss — Re-indexing a subdirectory in a shared workspace no longer deletes other repositories’ data from the database (fixes #87).
- Global gitignore false exclusions —
~/.gitignorewas incorrectly used as a global excludes fallback, causing all files to be excluded when a dotfiles repo contained broad patterns like*(fixes #216). - MCP startup error visibility — DuckDB lock conflicts and config validation errors now surface as JSON-RPC error responses instead of silently exiting, with a specific hint to kill stale processes on lock conflicts.
- Gemini LLM timeout — All
code_researchcalls no longer fail immediately; the 120s timeout was being passed as 120ms to the google-genai SDK. - Gemini LLM initialization — Gemini provider no longer fails to register when
base_urlis present in config, restoringcode_researchavailability. - VoyageAI
api_base→base_url— voyageai ≥0.3.7 renamed the parameter; ChunkHound now detects the correct key at runtime, preventing Azure ML endpoint rejections. tree-sitter-language-pack1.0.0 incompatibility — Pinned to<1.0.0to prevent fresh installs from pulling the breaking release that made YAML, MATLAB, Swift, and other language-pack parsers fail at startup.- Global chunk deduplication — YAML and Universal parsers now participate in chunk deduplication, preventing duplicate chunk IDs that caused indexing failures on repeated config values.
hdbscanstartup crash under numpy 2.x — Replacedhdbscanpackage (which uses the numpy 1.x ABI) withsklearn.cluster.HDBSCAN(already a dependency), eliminating MCP daemon startup failures on systems running numpy 2.x.- Windows MCP unicode safety — MCP server stdout on Windows is now reconfigured with
errors='backslashreplace'to prevent crashes when source files contain non-UTF-8 bytes; applied to bothmain()andmain_sync()entry points (fixes #225). - HDBSCAN outlier cluster assignment — Outliers in Phase 2 cluster merging were mapped to incorrect final cluster indices, causing code research results to be grouped with unrelated code. Fixed by threading the cluster-id-to-final-index mapping through the outlier merge step.
- Symlink path preservation — Worktree and repository symlink paths are now stored as their symlink paths during indexing instead of being silently resolved to their targets (fixes #102).
Removed
CHUNKHOUND_EMBEDDING_OPTIMIZATION_BATCH_FREQUENCY— Database optimization now runs once at indexing end; the per-batch frequency config option is removed.
4.0.1 - 2025-11-12
Fixed
- Package build configuration now excludes test fixtures from distribution, reducing package size and removing unnecessary test data from published releases
4.0.0 - 2025-11-12
Added
- Map-reduce synthesis for dramatically improved research accuracy - clusters related files and synthesizes them separately before combining insights
- Compact numbered citation system
[1][2][3]replacing verbosefile.py:123references for better readability - Automatic query expansion with intelligent deduplication to find more relevant results
- Structured JSON output support for LLM providers enabling programmatic research workflows
- Tree progress display with event system for visual research feedback
chunkhound research <query>command for direct code research without starting MCP serverchunkhound index --simulate [--json]- Dry-run mode showing which files would be indexed without making changeschunkhound diagnose [--json]- Troubleshooting command comparing ChunkHound’s decisions vs git’s ignore ruleschunkhound calibrate- Automatic batch size performance tuning for Qwen3 reranker--show-sizesflag for file size reporting during indexing- Swift language support with tree-sitter parsing for classes, protocols, functions, and properties (
.swift,.swiftinterface) - Objective-C support with content detection to disambiguate from MATLAB (
.mfiles) - Zig language support with comprehensive tree-sitter parsing
- Haskell language support for functions, types, classes, and modules (
.hs,.lhs,.hs-boot,.hsig,.hsc) - HCL (HashiCorp Configuration Language) support for Terraform with nested object parsing (
.hcl,.tf,.tfvars) - Vue.js Single File Component (SFC) support with specialized parsing for template, script, and style sections
- Svelte Single File Component support with specialized parsing for template, script, and style sections (
.svelte) - Vue cross-reference tracking between template elements and script definitions for enhanced semantic understanding
- PHP language support with comprehensive parsing for classes, interfaces, traits, functions, methods, namespaces, and PHPDoc comments
- RapidYAML parser using native bindings (10-100x faster than tree-sitter for large YAML files)
- Helm template sanitizer for Go template syntax in Kubernetes manifests
- Automatic fallback to tree-sitter parser when RapidYAML encounters issues
- Benchmark harness comparing PyYAML, universal, and RapidYAML performance (
scripts/bench_yaml.py) - Repo-aware ignore engine respecting repository boundaries and preventing rule leakage between sibling repos
- Workspace overlay mode collecting .gitignore rules from root and nested files with correct anchoring
- Combined exclusion modes:
indexing.exclude_modesupports"combined","config_only", or"gitignore_only" - Wildcard directory segment matching for patterns like
**/.venv*/and**/*.phar/ - Git pathspec capping with fallback to prevent pathspec explosion (default: 128, env:
CHUNKHOUND_INDEXING__GIT_PATHSPEC_CAP) - Real-time telemetry for git pathspec usage and exclusion sources
- TEI (Text Embeddings Inference) reranking format support alongside Cohere format
- Automatic reranker format detection from response field names (Cohere vs TEI)
- Thread-safe format caching for consistent reranker behavior across requests
- Authorization header support for TEI endpoints with
--api-keyflag - Qwen3 reranker with automatic batch size calibration for optimal performance
- Async regex search methods for concurrent search operations
- Claude Code CLI provider with direct integration (
claude-code-cli) - Codex CLI provider for synthesis workflows
- AWS Anthropic Bedrock provider using official Anthropic SDK
- Provider-specific synthesis concurrency limits: OpenAI (3), Bedrock (5), Claude CLI (1)
- Smart change detection using checksums for verification when mtime/size differ
- Content hash support in both DuckDB and LanceDB providers
- DuckDB schema migration with
files.content_hashcolumn (idempotent viaALTER TABLE IF NOT EXISTS) - LanceDB execute_query adapter for lightweight batch SELECT operations
- In-memory database mode for simulate on fresh workspaces (no .chunkhound/ directory created)
- Checkpointing and recovery for more robust indexing coordinator
- Per-file timeout controls:
indexing.per_file_timeout_seconds,indexing.per_file_timeout_min_size_kb - Configurable host parameter for HTTP MCP server (
--hostfor binding to specific interfaces) - Size-based filtering threshold for structured config files (JSON/YAML/TOML)
- Environment variable override for DB executor timeout:
CHUNKHOUND_DB_EXECUTE_TIMEOUT - Comprehensive test suites for Swift, Objective-C, Zig, Java, C#, Python, PHP, Vue, HCL
- Test fixtures for refactored research modules with fake providers and better mocks
Enhanced
- Native git bindings for gitignore exclusions replacing Python-based pattern matching (10-100x faster indexing)
- Parallel directory discovery with auto-scaling for enterprise monorepos
- Concurrent file parsing using ProcessPoolExecutor across CPU cores
- Lazy parser instantiation reducing startup time
- Single-file fast path using in-process handling (no ProcessPool overhead)
- Single-read checksum verification eliminating redundant file I/O
- Provider-aware embedding concurrency: OpenAI (8 concurrent batches), VoyageAI (40 concurrent batches)
- Automatic retry logic for VoyageAI embedding provider
- Real-time embedding pass: dedicated “embed” phase after quick parse/store for new chunks
- Removed redundant reranking passes from deep research pipeline
- xxHash3-64 replacing SHA-256 for faster file change detection
- Git pathspec capping preventing pathspec explosion (configurable via env)
- In-memory DuckDB for simulate mode on fresh workspaces
- Automatic parser worker auto-scaling to CPU count when timeouts enabled (capped at 32)
- Split progress reporting: “Parsing files” vs “Handling files” with live cumulative info
- Better error messages and truncation detection for LLM responses
- Non-TTY progress fallback properly working in CI environments
- Improved diagnostics for parse/store errors with clearer failure messages
- Post-run prompt to add timed-out files to
indexing.excludewhen interactive - Skipped file counts broken out into “Unchanged” and “Filtered” buckets
- Raw markdown output from code_research tool for better formatting in Claude
- Lazy imports for MCP-safe stdio operation
- Proper JSON-RPC handshake reliability
- Test-mode patches for Codex CLI integration (env-gated, no production impact)
- Increased startup wait time for Mac CI stability (3s → 5s)
- TEI reranking format comprehensive guide in CLAUDE.md
- Test coverage documentation with refactoring progress
- README improvements with startup profile CAP notes and exclusions section updates
- Benchmark instructions for YAML parser performance testing
- MCP setup improvements with multi-client support and
--show-setupflag
Changed
- BREAKING: Removed
depthparameter fromcode_researchMCP tool - system now auto-scales synthesis budgets based on repository size - BREAKING: Checksum algorithm switched from SHA-256 to xxHash3-64 for faster file change detection - all files will be reindexed on first run after upgrade
- BREAKING: Default exclusion behavior changed - providing
indexing.excludelist no longer disables .gitignore (useexclude_mode: "config_only"for legacy behavior) - BREAKING: RapidYAML is now the default YAML parser (set
CHUNKHOUND_YAML_ENGINE=treeto revert to tree-sitter) - BREAKING: LanceDB provider now requires
content_hashcolumn in files schema - Default per-file timeout enabled:
indexing.per_file_timeout_seconds=3.0(previously0, disabled) - Parser workers auto-scale to CPU count when timeouts enabled (capped at 32)
- Combined exclusion mode is now default: overlays gitignore + config excludes instead of replacing
- Model defaults updated to Haiku 4.5 for claude-code-cli and bedrock providers
- Deep research service refactored into specialized modules: question_generator, synthesis_engine, budget_calculator, citation_manager, quality_validator
- Search service refactored into strategies: context_retriever, single_hop_strategy, multi_hop_strategy, result_enhancer
- Extracted research pipeline modules: unified_search, query_expander, file_reader, context_manager
Fixed
- Fixed double ”**/” prefix preventing root file matches in default excludes
- Fixed real-time indexing for newly added languages
- Fixed file diversity collapse in deep research using proper reranking
- Fixed TOML parser to extract only matched node content instead of entire file
- Fixed tree-sitter language names for C# and Makefile parsers
- Fixed .gitignore pattern handling and error logging
- Fixed symbol validation inconsistency in Chunk.from_dict()
- Fixed Config.init to respect target_dir kwarg in tests
- Fixed DuckDB
get_file_by_path(as_model=True)to return correct mtime and size_bytes for accurate skip checks - Fixed registry provider instance handling (was storing lambda instead of provider)
- Fixed orphaned embeddings cleanup with proper per-call db_path configuration
- Fixed LanceDB optimize() API usage for 0.21.0+ (cleanup_older_than parameter)
- Fixed single-file indexing to use in-process path and call on_batch for immediate storage
- Fixed missing sources in synthesis by using correct chunk.content field (was chunk.code)
- Fixed flaky multi-hop semantic chain test
- Fixed reranker single-batch top_k filtering for consistency across backends
- Fixed concurrent rerank calls using aiohttp (replaced custom socket-based HTTP)
- Fixed MCP stdio flow for code_research end-to-end reliability
- Fixed non-TTY progress manager regression (added minimal Progress shim for CI)
- Fixed exception classes to allow traceback assignment (removed frozen dataclass)
- Fixed Windows path separator issues in gitignore pattern generation and matching
- Fixed ProcessPoolExecutor segfault on Linux by forcing spawn multiprocessing
- Fixed flaky QA test with file processing completion polling
- Fixed real-time indexing flakiness with proper timeout handling and task cleanup
Removed
- Removed AWS Bedrock provider (consolidated to Anthropic SDK-based Bedrock provider)
- Removed research tools setup section from CONTRIBUTING.md (obsolete)
- Removed obsolete tests incompatible with refactored modular architecture
Security
- Removed embedded API key from
.chunkhound.json- use environment variables instead (e.g.,CHUNKHOUND_EMBEDDING__API_KEY)
3.3.1 - 2025-09-25
Enhanced
- Dependency updates to latest stable versions for improved stability and performance
- Test infrastructure reliability with better provider detection and error handling
Fixed
- Tree-sitter 0.25.x API compatibility ensuring parsing works with latest language parsers
- Code formatting and import organization for cleaner, more maintainable codebase
3.3.0 - 2025-09-21
Added
- Official Windows support with full CI testing across Windows, macOS, and Ubuntu
- Command-line search functionality (
chunkhound search) for semantic and regex queries without starting MCP - CONTRIBUTING.md guidelines
- Setup wizard when
.chunkhound.jsonisn’t found in the directory
Fixed
- File exclude patterns (/tmp/) on Linux systems
- Regex search path resolution across platforms
3.2.0 - 2025-08-24
Enhanced
- Semantic search upgraded from two-hop to dynamic multi-hop expansion with intelligent stopping criteria, delivering more comprehensive and contextually relevant results while avoiding search explosion
3.1.0 - 2025-08-21
Added
- PDF document parsing and indexing with full text extraction using PyMuPDF integration
Enhanced
- Language support expanded to 29 languages with comprehensive documentation breakdown
Fixed
- JSON file parsing now extracts specific node content instead of entire file content, improving search precision and reducing noise
3.0.1 - 2025-08-21
Enhanced
- Documentation site improved with cross-linking between pages and hero image for better navigation
- OpenAI-compatible endpoint flexibility increased by making API keys optional for local deployments
- Test infrastructure reliability improved with comprehensive CI fixes and timeout handling
Fixed
- JSON file parsing now handles empty chunks correctly, eliminating indexing failures on common JSON patterns
- Test suite stability enhanced with proper background task cleanup and configuration isolation
- GitHub Actions workflow simplified and made more reliable by removing redundant processes
3.0.0 - 2025-08-20
Added
- VoyageAI embedding provider with advanced two-hop semantic search and reranking capabilities
- GitHub Pages documentation site with interactive examples and improved navigation
- Intelligent file exclusion system with .gitignore support and JSON size filtering
- Advanced makefile parsing with dependency analysis for better code comprehension
- Comprehensive test suite for database consistency and integration testing
- Real-time filesystem indexing with MCP integration for live code monitoring
Enhanced
- Parsing system completely rebuilt with cAST (Code AST) algorithm for universal language support
- Configuration system dramatically simplified with fewer user-facing options for easier setup
- OpenAI provider unified to handle both standard and custom OpenAI-compatible endpoints
- MCP server reliability improved with proper initialization sequencing and watchdog coordination
- Test infrastructure enhanced with Ollama compatibility and extended timeouts
- Directory indexing consolidated between CLI and MCP with shared service architecture
Fixed
- MCP server initialization blocking resolved - no more startup deadlocks during directory scanning
- Custom OpenAI endpoint configuration now properly recognized and applied
- Real-time indexing now generates missing embeddings for unchanged code chunks
- SSL verification disabled for custom OpenAI-compatible endpoints to support local deployments
- Watchdog filesystem monitoring no longer blocks MCP server startup process
- MCP server properly respects target directory path arguments across all operations
Removed
- TEI (Text Embeddings Inference) provider support - simplified provider ecosystem
- BGE provider support - consolidated to core providers for better maintenance
- Legacy parsing system replaced with modern cAST algorithm
- Obsolete configuration documentation and setup files cleaned up
2.8.1 - 2025-07-20
Enhanced
- Architecture documentation significantly improved for better LLM comprehension and AI-assisted development workflows
Fixed
- Type annotation syntax errors that could cause import failures in Python 3.10+ environments
- Enhanced smoke tests now detect forward reference type annotation issues early
2.8.0 - 2025-07-20
Added
- MCP HTTP transport support alongside stdio transport for flexible deployment options
Enhanced
- Configuration system unified across CLI and MCP components for consistent behavior
- File change processing reliability improved in MCP servers with better debouncing and coordination
- Database portability enhanced with relative path storage
Fixed
- MCP server initialization deadlocks and startup crashes resolved with proper async coordination
- File deletion handling improved using IndexingCoordinator for better reliability
- MCP server tool discovery enhanced with fallback logic for better error recovery
- File path resolution improved in DuckDB provider for cross-platform consistency
2.7.0 - 2025-07-12
Fixed
- MCP server now uses configured embedding model instead of hardcoded text-embedding-3-small default, ensuring semantic search works with any configured model
- MCP test environment improvements with comprehensive test data and configuration files
2.6.3 - 2025-07-10
Fixed
- Configuration merge precedence now correctly preserves environment variables over JSON config values
- MCP server semantic search now works properly when running from different directories
Removed
- Removed obsolete Ubuntu 20 Dockerfile as issue was resolved in configuration system
2.6.2 - 2025-07-10
Fixed
- MCP server now properly loads embedding provider configuration from target directory
2.6.1 - 2025-07-10
Fixed
- MCP server now properly respects CLI-provided project root directory for configuration loading
- Configuration files (.chunkhound.json) are now correctly loaded when running MCP server from different directories
2.6.0 - 2025-07-10
Fixed
- MCP server crashes on Ubuntu and Linux systems when running from different directories by fixing database path resolution and process coordination
- Enhanced TaskGroup error reporting to show underlying causes instead of generic wrapper errors
- Configuration file loading in MCP server now properly respects .chunkhound.json files in target directories
- Database lock conflicts between multiple MCP instances resolved with proper process detection
Enhanced
- Docker test infrastructure for MCP server validation to prevent future regressions
- Improved error messages for debugging MCP server issues with detailed analysis
2.5.4 - 2025-07-10
Fixed
- MCP server reliability on Ubuntu and other Linux distributions when running from different directories
- Database path resolution consistency across all MCP server components
2.5.3 - 2025-07-10
Fixed
- MCP server communication reliability improved by removing debug logging that interfered with JSON-RPC protocol
2.5.2 - 2025-07-10
Added
- Automatic database optimization during embedding generation to maintain performance with large datasets (every 1000 batches, configurable via
CHUNKHOUND_EMBEDDING_OPTIMIZATION_BATCH_FREQUENCY)
Fixed
- MCP server compatibility on Ubuntu and other strict platforms by preserving virtual environment context in subprocesses
- OpenAI embedding provider crash on Ubuntu due to async resource creation outside event loop context
2.5.1 - 2025-01-09
Fixed
- Project detection now properly respects CHUNKHOUND_PROJECT_ROOT environment variable, ensuring MCP command works correctly when launched from any directory
- Removed duplicate MCP parser function that could cause confusion
2.5.0 - 2025-01-09
Enhanced
- MCP positional path argument now controls complete project scope - database location, config file search, and watch paths are all set to the specified directory instead of just watch paths
Fixed
- MCP launcher import path resolution when running from different directories, eliminating TaskGroup errors on Ubuntu and other strict platforms
2.4.4 - 2025-01-09
Fixed
- Ubuntu TaskGroup crash fixed by removing problematic directory change in MCP launcher
2.4.3 - 2025-01-09
Fixed
- MCP server now works correctly when launched from any directory, not just the project root
- Fixed path resolution inconsistencies that caused TaskGroup errors on Ubuntu deployments
2.4.2 - 2025-01-09
Added
- MCP command now accepts optional path argument to specify directory for indexing and watching (defaults to current directory)
Fixed
- Parser architecture inconsistencies resolved across C, Bash, and Makefile parsers for consistent search functionality
- MCP server database duplication eliminated through proper async task isolation
- LanceDB storage growth controlled with automatic optimization during quiet periods
- MCP server reliability improved with corrected import structure and dependency resolution
- Python parser behavior now consistent between CLI and MCP modes
- Search operation freezes after file deletion resolved with proper thread safety
2.4.1 - 2025-01-09
Fixed
- Package structure consolidated under chunkhound/ directory for improved import reliability and Python packaging best practices
2.4.0 - 2025-01-09
Fixed
- LanceDB storage growth issue resolved with automatic database optimization during quiet periods
- Configuration system project root detection for .chunkhound.json files improved
Changed
- Enhanced database provider architecture with capability detection and activity tracking
- Modernized configuration system by removing legacy registry config building
2.3.1 - 2025-07-09
Fixed
- MCP server communication reliability improved by preventing stderr output from corrupting JSON-RPC messages
- Enhanced configuration documentation with automatic .chunkhound.json detection examples
2.3.0 - 2025-07-08
Changed
- BREAKING: Configuration system completely refactored with centralized management and clear precedence hierarchy
- BREAKING: Automatic configuration file loading removed - config files now only load with explicit
--configflag - BREAKING: Environment variables standardized to
CHUNKHOUND_*prefix with__delimiters (e.g.,CHUNKHOUND_EMBEDDING__API_KEY) - BREAKING: Legacy
OPENAI_API_KEYandOPENAI_BASE_URLenvironment variables no longer supported
Added
- Complete CLI argument coverage for all configuration options
- Centralized configuration precedence: CLI args → Config file → Environment variables → Defaults
- Comprehensive migration guide for updating existing configurations
- Database file gitignore pattern for Lance database files
Fixed
- MCP server database duplication caused by shared transaction state across async tasks
- Parser architecture inconsistencies for C, Bash, and Makefile language parsers
- Configuration auto-detection issues that caused deployment complexity
2.2.0 - 2025-01-07
Fixed
- Database freezing during concurrent file operations through proper async/sync boundary handling
- Thread safety issues in DuckDB provider with synchronized WAL cleanup and operation timeouts
- LanceDB duplicate file entries through atomic merge operations and path normalization
- File deletion operations now properly handle async contexts without blocking the event loop
Changed
- Aligned LanceDB provider with serial executor pattern for consistency with DuckDB
- Improved path normalization to handle symlinks and different path representations
- Enhanced database operation reliability with proper thread isolation
Added
- Support for complete configuration storage including API keys in .chunkhound.json files
- Consolidated embedding provider creation system for consistent behavior across CLI and config files
2.1.4 - 2025-07-03
Fixed
- CLI argument defaults no longer override config file values
- Updated dependencies via uv.lock
2.1.3 - 2025-07-03
Changed
- Consolidated embedding provider creation to use single factory pattern for consistency
- Reduced embedding provider log verbosity for cleaner output
2.1.2 - 2025-07-03
Fixed
- API key configuration loading from .chunkhound.json files
- Configuration precedence documentation to match actual behavior
Added
- Complete configuration examples with API key and security guidance
2.1.1 - 2025-07-03
Added
- Centralized version management system for consistent versioning across all components
Changed
- Simplified version updates through automated scripts
- Enhanced installation and development documentation
- Code formatting improvements and linting cleanup
Fixed
- Version consistency across CLI, MCP server, and package initialization
- Import statement in package
__init__.pyfor better module exposure
2.1.0 - 2025-07-02
Fixed
- Database duplication in MCP server by implementing single-threaded executor pattern
- WAL corruption handling during DuckDB catalog replay
- Parser architecture inconsistencies for C, Bash, and Makefile parsers
- DuckDB foreign key constraint transaction limitations
- Python parser CLI/MCP divergence through unified factory pattern
- Connection management architectural violations
Changed
- Consolidated database operations through DuckDBProvider executor pattern
- Simplified ConnectionManager to handle only connection lifecycle
- Updated file discovery patterns to include all 16 supported languages
- Removed deprecated connection methods and schema fields
- Enhanced transaction handling with contextvars for task isolation
Added
- Automatic database migration system for schema updates
- Enhanced parser functionality for C pointer functions and Bash function bodies
- Task-local transaction state management
- Comprehensive executor methods for database operations
2.0.0 - 2025-06-26
Added
- 10 new language parsers: Rust, Go, C++, C, Kotlin, Groovy, Bash, TOML, Makefile, Matlab
- Search pagination with response size limits
- Registry-based parser architecture
- MCP search task coordinator
- Test coverage for file modification tracking
- Comment and docstring indexing for all language parsers
- Background periodic indexing for better performance
- Path filtering support for targeted searches
- HNSW index WAL recovery with enhanced checkpoints
- Embedding cache optimization with CRC32-based content tracking
Changed
- BREAKING: ‘run’ command renamed to ‘index’ with current directory default
- BREAKING: Parser system refactored to registry pattern
- Centralized language support in Language enum
- Optimized embedding performance with token-aware batching
- Enhanced PyInstaller compatibility
- Improved cross-platform build support (Windows, Ubuntu Docker)
- Enhanced MCP server JSON-RPC communication with logging suppression
Fixed
- Parser error handling and registry integration
- OpenAI token limit handling
- PyInstaller module path resolution
- Database WAL corruption issues on server exit
- File watcher cancellation responsiveness
- Signal handler safety by removing unsafe database operations
- Windows PyInstaller and MATLAB dependency issues
- Build workflow reliability across platforms
1.2.3 - 2025-06-23
Changed
- Default database location changed to current directory for better persistence
Fixed
- OpenAI token limit exceeded error with dynamic batching for large embedding requests
- Empty chunk filtering to reduce noise in search results
- Python parser validation for empty symbol names
- Windows build support with comprehensive GitHub Actions workflow
- macOS Intel build issues with UV package manager installation
- Cross-platform build workflow reliability
Added
- Windows build support with automated testing
- Enhanced debugging for build processes across platforms
1.2.2 - 2024-12-15
Added
- File watching CLI for real-time code monitoring
Changed
- Unified JavaScript and TypeScript parsers
- Default database location to current directory
Fixed
- Empty symbol validation in Python parser
1.2.1 - 2024-11-28
Added
- Ubuntu 20.04 build support
- Token limit management for MCP search
Fixed
- Duplicate chunks after file edits
- File modification detection race conditions
1.2.0 - 2024-11-15
Added
- C# language support
- JSON, YAML, and plain text file support
- File watching with real-time indexing
Fixed
- File deletion handling
- Database connection issues
1.1.0 - 2025-06-12
Added
- Multi-language support: TypeScript, JavaScript, C#, Java, and Markdown
- Comprehensive CLI interface
- Binary distribution with faster startup
Changed
- Improved CLI startup performance (90% faster)
- Binary startup performance (16x faster)
Fixed
- Version display consistency
- Cross-platform build issues
1.0.1 - 2025-06-11
Added
- Python 3.10+ compatibility
- PyPI publishing
- Standalone executable support
- MCP server integration
Fixed
- Dependency conflicts
- OpenAI model parameter handling
- Binary compilation issues
1.0.0 - 2025-06-10
Added
- Initial release of ChunkHound
- Python parsing with tree-sitter
- DuckDB backend for storage and search
- OpenAI embeddings for semantic search
- CLI interface for indexing and searching
- MCP server for AI assistant integration
- File watching for real-time indexing
- Regex search capabilities
For more information, visit: https://github.com/chunkhound/chunkhound