This update resolves major performance regressions when processing large libraries:
1. Optimized FindMany in both Image and Scene stores to use map-based ID lookups. Previously, this function used slices.Index in a loop, resulting in O(N^2) complexity. On a library with 300k items, this was causing the server to hang indefinitely.
2. Refined the exact image duplicate SQL query to match the scene checker's level of optimization. It now joins the files table and orders results by total duplicate file size, ensuring that the most impactful duplicates are shown first.
3. Removed the temporary LIMIT 1000 from the image duplicate query now that the algorithmic bottlenecks have been resolved.
This update provides additional performance improvements specifically targeted at large image libraries (e.g. 300k+ images):
1. Optimized the exact match SQL query for images:
- Added filtering for zero/empty fingerprints to avoid massive false-positive groups.
- Added a LIMIT of 1000 duplicate groups to prevent excessive memory consumption and serialization overhead.
- Simplified the join structure to ensure better use of the database index.
2. Parallelized the Go comparison loop in pkg/utils/phash.go:
- Utilizes all available CPU cores to perform Hamming distance calculations.
- Uses a lock-free design to minimize synchronization overhead.
- This makes non-zero distance searches significantly faster on multi-core systems.
This update provides significant performance improvements for both image and scene duplicate searching:
1. Optimized the core Hamming distance algorithm in pkg/utils/phash.go:
- Uses native CPU popcount instructions (math/bits) for bit counting.
- Pre-calculates hash values to eliminate object allocations in the hot loop.
- Halves the number of comparisons by leveraging the symmetry of the Hamming distance.
- The loop is now several orders of magnitude faster and allocation-free.
2. Solved the N+1 database query bottleneck:
- Replaced individual database lookups for each duplicate group with a single batched query for all duplicate IDs.
- This optimization was applied to both Image and Scene repositories.
3. Simplified the SQL fast path for exact image matches to remove redundant table joins.
This change adds a specialized SQL query to find exact image duplicate matches (distance 0) directly in the database.
Previously, the image duplicate checker always used an O(N^2) Go-based comparison loop, which caused indefinite loading and timeouts on libraries with a large number of images. The new SQL fast path reduces the time to find exact duplicates from minutes/hours to milliseconds.
This fixes a bug where identical image duplicates were not being detected.
The implementation was incorrectly scanning the phash BLOB into a string and then attempting to parse it as a hex string. Since phashes are stored as 64-bit integers, they were being converted to decimal strings. For phashes with the MSB set (negative when treated as int64), the resulting decimal string started with a '-', which caused the hex parser to fail and skip the image entirely.
Additionally, even for non-negative phashes, parsing a decimal string as hex yielded incorrect hash values.
Scanning directly into the utils.Phash struct (which uses int64) matches how Scene phashes are handled and ensures the hash values are correct.
- Wrap FindDuplicateImages query in r.withReadTxn() to ensure a database transaction in context.
- Use queryFunc instead of queryStruct for fetching multiple hashes, preventing runtime errors.
- Fix N+1 query issue in duplicate grouping by using qb.FindMany() instead of qb.Find() for each duplicate image.
- Revert searchColumns array to exclude "images.details" which was from another PR and remove related failing test.
- Removed unused `strconv` import from `pkg/sqlite/image.go`.
- Added missing `github.com/stashapp/stash/pkg/utils` import to resolve the undefined `utils` reference.
- Fixed pagination prop in ImageDuplicateChecker component.
- Formatted modified go files using gofmt.
- Ran prettier over the UI codebase to resolve the formatting check CI failure.
This change unifies the duplicate detection logic by leveraging the shared phash utility. It also enhances the UI with:
- Pagination for large result sets.
- Sorting duplicate groups by total file size.
- A more detailed table view with image thumbnails, paths, and dimensions.
- Consistency with the existing Scene Duplicate Checker tool.
This change introduces a new tool to identify duplicate images based on their perceptual hash (phash). It includes:
- Backend implementation for phash distance comparison and grouping.
- GraphQL schema updates and API resolvers.
- Frontend UI for the Image Duplicate Checker tool.
- Unit tests for the image search and duplicate detection logic.
* fix: support string-based fingerprints in hashes filter
* Fix tests and add phash test
File fingerprints weren't using correct types. Filter test wasn't using correct types. Add phash to general files.
---------
Co-authored-by: hyper440 <hyper440@users.noreply.github.com>
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
* Add useDebouncedState hook
* Add basename to folder sort whitelist
* Add parent_folder criterion to gallery
* Add selection on enter if single result
* Add basename field to folder
* Add parent_folders field to folder
* Add basename column to folder table
* Add basename filter field
* Create missing folder hierarchies during migration
* Treat files/folders in zips where path can't be made relative as not found
Addresses an issue during clean where corrupt folder entries in zip files could not be removed due to an error during the call to Rel.
* Add group filter criteria to tag and studio
* Add sidebar to groups list
* Refactor ListOperations to accept buttons
* Move create new button back to navbar
Having the create new button with a plus icon conflicted with the add sub-group button in the sub-groups view.
* Simplify group-sub-groups view
* Fix custom field import/export for studio
* Update studio unit tests
* Add tag create and update unit tests
* Add custom fields to tag filter graphql
* Add unit tests for tag filtering
* Add filter unit tests for studio
* Implement stash_ids_endpoint for the SceneFilterType
* Reduce code duplication by calling the stashIDsCriterionHandler from the stashIDCriterionHandler
* Mark stash_id_endpoint in SceneFilterType, StudioFilterType, and PerformerFilterType as deprecated
* Implement merging of performers
* Make the tag merge UI consistent with other types of merges
* Add merge action in scene menu
---------
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
* Remove month/year only formats from ParseDateStringAsTime
* Add precision field to Date and handle parsing year/month-only dates
* Add date precision columns for date columns
* Adjust UI to account for fuzzy dates
* Change queryStruct to use tx.Get instead of queryFunc
Using queryFunc meant that the performance logging was inaccurate due to the query actually being executed during the call to Scan.
* Only add join args if join was added
* Omit joins that are only used for sorting when skipping sorting
Should provide some marginal improvement on systems with a lot of items.
* Make all calls to the database pass context.
This means that long queries can be cancelled by navigating to another page. Previously the query would continue to run, impacting on future queries.
* Find existing files with case insensitivity if filesystem is case insensitive
* Handle case change in folders
* Optimise to only test file system case sensitivity if the first query found nothing
This limits the overhead to new paths, and adds an extra query for new paths to windows installs
* Filter out empty alias strings in studio modal create
* Reject empty alias strings in backend
* Remove invalid ValidateAliases call from UpdatePartial
This was calling using the values which are not necessarily the final values.
---------
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
* Backend support for studio URLs
* FrontEnd addition
* Support URLs in BulkStudioUpdate
* Update tagger modal for URLs
---------
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>