Additionally, improve HTML pre-processing:
* Ensure a new line between blocks of lyrics text from letras.mus.br.
* Parse a missing last block of lyrics text from lacocinelle.net.
* Parse a missing last block of lyrics text from paroles.net.
* Fix encoding issues with AZLyrics by setting response encoding to
None, allowing `requests` to handle it.
* Type the response data that Google Custom Search API return.
* Exclude some 'letras.mus.br' pages that do not contain lyric.
* Exclude results from Musixmatch as we cannot access their pages.
* Improve parsing of the URL title:
- Handle long URL titles that get truncated (end with ellipsis) for
long searches
- Remove domains starting with 'www'
- Parse the title AND the artist. Previously this would only parse the
title, and fetch lyrics even when the artist did not match.
* Remove now redundant credits cleanup and checks for valid lyrics.
Tidy up 'Google.is_page_candidate' method and remove 'Google.sluggify'
method which was a duplicate of 'slug'.
Since 'GeniusFetchTest' only tested whether the artist name is cleaned
up (the rest of the functionality is patched), remove it and move its
test cases to the 'test_slug' test.
Having removed it I fuond that only the Genius lyrics changed: it had en
extra new line. Thus I defined a function 'collapse_newlines' which now
gets called for the Genius lyrics.
This commit introduces a distance threshold mechanism for the Genius and
Google backends.
- Create a new `SearchBackend` base class with a method `check_match`
that performs checking.
- Start using undocumented `dist_thresh` configuration option for good,
and mention it in the docs. This controls the maximum allowable
distance for matching artist and title names.
These changes aim to improve the accuracy of lyrics matching, especially
when there are slight variations in artist or title names, see #4791.
- Rename method _dedup_genre, since it's only used for
finalizing/polishing existing genres.
- Return separator-delimited string already.
- Decide on not passing "separator" to methods, it's a config
setting available throughout the plugin. Assign to variable where
useful for readability though.
- In the force branch, remove re-assigning keep_genres to empty list.
- Fix a test. Existing genres are "polished" now, which means:
configured title_case is applied.
- Fix/add type hints on all touched and new methods
- Adapt tests to _resolve_genres returning a list with not yet formatted genres.
- Rename and adapt test_count -> test_to_delimited_string. Note that the
new function does not apply whitelist, prefer anything. It just cuts
to count and formats!
- No idea where a missing separator (which is default) could
happen...just set it explicitely.
- Since we now refactored fetch_genre to returning a list we can add
mock multiple fetched gernes easier.
I found that the `/get` endpoint often returns incorrect or unsynced
lyrics, while results returned by the `/search` more accurate options.
Thus I reversed the change in the previous commit to prioritize
searching first.
Adjust the base URL to perform a '/search' instead of attempting to
'/get' specific lyrics where we're unlikely to find lyrics for the
specific combination of album, artist, track names and the duration (see
https://lrclib.net/docs).
Since we receive an array of matching lyrics candidates, rank them by
their duration similarity to the item's duration, and whether they
contain synced lyrics.
Add explicit checks for lyrics texts fetched from the tested sources.
- Introduced `LyricsPage` class to represent lyrics pages for integrated
tests.
- Configured expected lyrics for each of the URLs that are being
fetched.
- Consolidated integrated tests in a new `TestLyricsSources` class.
- Mocked Google Search API to return the lyrics page under test.
Since at least one Backend requires album` and `duration` arguments
(`LRCLib`), the caller (`LyricsPlugin.fetch_item_lyrics`) must always
provide them.
Since they need to provided, we need to enforce this by defining them as
positional arguments.
Why is this important? I found that integrated `LRCLib` tests have been
passing, but they called `LRCLib.fetch` with values for `artist` and
`title` fields only, while the actual functionality *always* provides
values for `album` and `duration` fields too.
When I adjusted the test to provide values for the missing fields,
I found that it failed. This makes sense: Lib `album` and `duration`
filters are strict on LRCLib, so I was not surprised the lyrics could
not be found.
Thus I adjusted `LRCLib` backend implementation to only filter by each
of these fields when their values are truthy.
Modified `search_pairs` function in `lyrics.py` to:
* Firstly strip each of `artist`, `artist_sort` and `title` fields
* Only generate alternatives if both `artist` and `title` are not empty
* Ensure that `artist_sort` is not empty and not equal to artist (ignoring
case) before appending it to the artists
Extended tests to cover the changes.
- Consolidated multiple test cases into parameterized tests for better
readability and maintainability.
- Simplified assertions by comparing lists of actual and expected
artists/titles.
- Added `unexpected_empty_artist` marker to handle cases which
unexpectedly return an empty artist. This seems to be happen when
`artist_sort` field is empty.
- Replaced unittest.mock with pytest fixtures for better test isolation and readability.
- Simplified test cases by using parameterized tests.
- Added `requests-mock` dependency to `pyproject.toml` and `poetry.lock`.
- Removed redundant helper functions and classes.