The return type of the stage decorator should in theory be `T|None`
but the return of task types is not consistent in its usage. Would need
some bigger changes for which I'm not ready at the moment.
Refactor `beetsplug` to use native namespace packages by removing
`__init__.py`. Update documentation and `setup.cfg` to support namespace
packages.
### Motivation
Adopt PEP 420 native namespace packages to simplify plugin management
and eliminate the need for `__init__.py`.
See https://realpython.com/python-namespace-package.
This setup is backwards-compatible, so plugins using the old
pkgutil-based setup will continue working fine.
The advantage with this setup is that external plugins will now be able
to import modules from 'beetsplug' package for typing purposes.
Previously, mypy could not resolve these modules due to presence of
`__init__.py`.
URL-encode additional item `fields` within generated EXTM3U playlists instead of JSON-encoding them.
This is because JSON-encoding additional fields/attributes made it difficult to parse the `EXTINF` line but using URL-encoding for these values makes parsing easy (because URL-encoded values cannot contain commas, quotation marks and spaces).
I introduced the generation of additional EXTM3U item fields earlier this year and I want to correct that now.
**Design/definition background:**
Unfortunately, I didn't find a clear definition of how additional playlist item attributes should be encoded - apparently there is none.
Given that item URIs within an M3U playlist can be URL-encoded already, defining the values of additional attributes to be URL-encoded is consistent design.
I didn't find examples of additional EXTM3U item attributes in the web where the attribute value contains a comma, space or quotation mark but examples that specified numeric IDs and URLs as attribute values.
Because the URL attribute examples I found didn't contain URL-encoded characters and because it is more readable and unproblematic for parsing, I've let the attribute URL encoding treat `:` and `/` as safe characters.
**Breaking change:**
While this is a breaking change in theory, in practice it is not since afaik all integrations of the smartplaylist plugin's additional EXTM3U item attribute generation feature (beets-webm3u) work with simple attribute values such as the item ID (numeric) whose formatting/encoding is not affected when changing from JSON to URL-encoding.
In other words the change is backward-compatible with the beets-webm3u plugin (which I'll adjust correspondingly after this beets PR was merged).
- genres are now called tags
- tags needs to be in "mb fetch includes"
- release-group has them
- release has them
- and recording as well but we don't use them
- not sure what this outdated check was doing
- see musicbrainz.VALID_INCLUDES for reference
See https://realpython.com/python-namespace-package.
This setup is backwards-compatible, so plugins using the old
pkgutil-based setup will continue working fine.
This setup has an advantage where external plugins will now be able to
import modules from 'beetsplug' package for typing purposes. Previously,
mypy could not resolve these modules due to presence of `__init__.py`.
### Bug Fixes
- Fixed#4791: Resolved an issue with the Genius backend where it
couldn't match lyrics if there was a slight variation in the artist's
name.
### Plugin Enhancements
* **Session Management**: Introduced a `TimeoutSession` to enable
connection pooling and maintain consistent configuration across
requests.
* **Error Handling**: Centralized error handling logic in a new
`RequestsHandler` class, which includes methods for retrieving either
HTML text or JSON data.
* **Logging**: Added methods to ensure the backend name is included in
log messages.
### Configuration Changes
* Added a new `dist_thresh` field to the configuration, allowing users
to control the maximum tolerable mismatch between the artist and title
of the lyrics search result and their item. Interestingly, this field
was previously available (though undocumented) and used in the
`Tekstowo` backend. Now, this threshold has also been applied to
**Genius** and **Google** search logic.
### Backend Updates
* All backends that perform searches now validate each result against
the configured `dist_thresh`.
#### Genius
* Removed the need to scrape HTML tags for lyrics; instead, lyrics are
now parsed from the JSON data embedded in the HTML. This change should
reduce our vulnerability to Genius' frequent alterations in their HTML
structure.
* Documented the structure of their search JSON data.
#### Google
* Typed the response data returned by the Google Custom Search API.
* Excluded certain pages under **https://letras.mus.br** that do not
contain lyrics.
* Excluded all results from MusiXmatch, as we cannot access their pages.
* Improved parsing of URL titles (used for matching item/lyrics
artist/title):
- Handled results from long search queries where URL titles are
truncated with an ellipsis.
- Enhanced URL title cleanup logic.
- Added functionality to determine (or rather, guess) not only the track
title but also the artist from the URL title.
* Similar to #5406, search results are now compared to the original item
and sorted by distance. Results exceeding the configured `dist_thresh`
value are discarded. The previous functionality simply selected the
first result containing the track's title in its URL, which often led to
returning lyrics for the wrong artist, particularly for short track
titles.
* Since we now fetch lyrics confidently, redundant checks for valid
lyrics and credits cleanup have been removed.
### HTML Cleanup
* Organized regex patterns into a new `Html` class.
* Adjusted patterns to ensure new lines between blocks of lyrics text
scraped from `letras.mus.br` and `musica.com`.
* Modified patterns to scrape missing lyrics text on `paroles.net` and
`lacoccinelle.net`. See the diff in `test/plugins/lyrics_page.py`.
If we get caught by Cloudfare, it forwards our request somewhere else
and returns some validation text response. To make sure that this text
does not get assumed for lyrics, we can disable redirects for the Google
backend, check the response code and raise if there's a redirect
attempt. This source will then be skipped and the backend continues with
the next one.
Additionally, improve HTML pre-processing:
* Ensure a new line between blocks of lyrics text from letras.mus.br.
* Parse a missing last block of lyrics text from lacocinelle.net.
* Parse a missing last block of lyrics text from paroles.net.
* Fix encoding issues with AZLyrics by setting response encoding to
None, allowing `requests` to handle it.
* Type the response data that Google Custom Search API return.
* Exclude some 'letras.mus.br' pages that do not contain lyric.
* Exclude results from Musixmatch as we cannot access their pages.
* Improve parsing of the URL title:
- Handle long URL titles that get truncated (end with ellipsis) for
long searches
- Remove domains starting with 'www'
- Parse the title AND the artist. Previously this would only parse the
title, and fetch lyrics even when the artist did not match.
* Remove now redundant credits cleanup and checks for valid lyrics.
Tidy up 'Google.is_page_candidate' method and remove 'Google.sluggify'
method which was a duplicate of 'slug'.
Since 'GeniusFetchTest' only tested whether the artist name is cleaned
up (the rest of the functionality is patched), remove it and move its
test cases to the 'test_slug' test.
Having removed it I fuond that only the Genius lyrics changed: it had en
extra new line. Thus I defined a function 'collapse_newlines' which now
gets called for the Genius lyrics.
Improve requests performance with requests.Session which uses connection
pooling for repeated requests to the same host.
Additionally, this centralizes request configuration, making sure that
we use the same timeout and provide beets user agent for all requests.
This commit introduces a distance threshold mechanism for the Genius and
Google backends.
- Create a new `SearchBackend` base class with a method `check_match`
that performs checking.
- Start using undocumented `dist_thresh` configuration option for good,
and mention it in the docs. This controls the maximum allowable
distance for matching artist and title names.
These changes aim to improve the accuracy of lyrics matching, especially
when there are slight variations in artist or title names, see #4791.
## Description
Added a quick checkpoint to ensure the config file is set up correctly
prior to users importing their music library. This was something I
discovered later after running into an issue with my config file and
hope it helps new users avoid the issues I had.
## New config option `keep_existing` (#4982)
- Fix the behavior of the`force` option. Previously disabling the option
had "incomplete" behaviour:
- If content was found, a whitelist check was issued and if valid the
plugin exited early and logged ("keep").
- This whitelist check was not aware of multiple genres (separated
typically by a string like `, `), thus it failed erased all existing
genres and overwrote with new ones.
_**This didn't feel like a typical behaviour of a `force` option, which
this PR tries to improve as follows...**_
- String-separated multi-genres are now compiled into a list and
depending on the `whitelist` option are kept and enriched with freshly
fetched last.fm genres.
- If force is off, pre-populated tags are not touched.
- A lot of refactoring was done, some absolutely required, some as a
preparation for future work on the plugin.
- The main processing function `_get_genre` was massively overhauled and
got a new `pytest.mark.parametrize` test which includes much more test
cases.