This update provides additional performance improvements specifically targeted at large image libraries (e.g. 300k+ images):
1. Optimized the exact match SQL query for images:
- Added filtering for zero/empty fingerprints to avoid massive false-positive groups.
- Added a LIMIT of 1000 duplicate groups to prevent excessive memory consumption and serialization overhead.
- Simplified the join structure to ensure better use of the database index.
2. Parallelized the Go comparison loop in pkg/utils/phash.go:
- Utilizes all available CPU cores to perform Hamming distance calculations.
- Uses a lock-free design to minimize synchronization overhead.
- This makes non-zero distance searches significantly faster on multi-core systems.
This update provides significant performance improvements for both image and scene duplicate searching:
1. Optimized the core Hamming distance algorithm in pkg/utils/phash.go:
- Uses native CPU popcount instructions (math/bits) for bit counting.
- Pre-calculates hash values to eliminate object allocations in the hot loop.
- Halves the number of comparisons by leveraging the symmetry of the Hamming distance.
- The loop is now several orders of magnitude faster and allocation-free.
2. Solved the N+1 database query bottleneck:
- Replaced individual database lookups for each duplicate group with a single batched query for all duplicate IDs.
- This optimization was applied to both Image and Scene repositories.
3. Simplified the SQL fast path for exact image matches to remove redundant table joins.
- Removed unused `strconv` import from `pkg/sqlite/image.go`.
- Added missing `github.com/stashapp/stash/pkg/utils` import to resolve the undefined `utils` reference.
- Fixed pagination prop in ImageDuplicateChecker component.
- Formatted modified go files using gofmt.
- Ran prettier over the UI codebase to resolve the formatting check CI failure.
This change unifies the duplicate detection logic by leveraging the shared phash utility. It also enhances the UI with:
- Pagination for large result sets.
- Sorting duplicate groups by total file size.
- A more detailed table view with image thumbnails, paths, and dimensions.
- Consistency with the existing Scene Duplicate Checker tool.
* Limit duplicate matching to files that have ~ same duration
* Add UI for duration diff
---------
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
* Restructure data layer part 2 (#2599)
* Refactor and separate image model
* Refactor image query builder
* Handle relationships in image query builder
* Remove relationship management methods
* Refactor gallery model/query builder
* Add scenes to gallery model
* Convert scene model
* Refactor scene models
* Remove unused methods
* Add unit tests for gallery
* Add image tests
* Add scene tests
* Convert unnecessary scene value pointers to values
* Convert unnecessary pointer values to values
* Refactor scene partial
* Add scene partial tests
* Refactor ImagePartial
* Add image partial tests
* Refactor gallery partial update
* Add partial gallery update tests
* Use zero/null package for null values
* Add files and scan system
* Add sqlite implementation for files/folders
* Add unit tests for files/folders
* Image refactors
* Update image data layer
* Refactor gallery model and creation
* Refactor scene model
* Refactor scenes
* Don't set title from filename
* Allow galleries to freely add/remove images
* Add multiple scene file support to graphql and UI
* Add multiple file support for images in graphql/UI
* Add multiple file for galleries in graphql/UI
* Remove use of some deprecated fields
* Remove scene path usage
* Remove gallery path usage
* Remove path from image
* Move funscript to video file
* Refactor caption detection
* Migrate existing data
* Add post commit/rollback hook system
* Lint. Comment out import/export tests
* Add WithDatabase read only wrapper
* Prepend tasks to list
* Add 32 pre-migration
* Add warnings in release and migration notes