beets/beetsplug
Francesco Grillo a79a86d5d6
Fix lyrics Unicode corruption and escaped quotes in Genius plugin
## Problem
The lyrics plugin has two bugs that corrupt fetched lyrics:

1. **Unicode corruption**: Characters like `ò`, `è`, `à` are corrupted to `√≤`, `√®`, etc.
2. **Escaped quotes**: Quotes appear as `\"` instead of `"` in lyrics

## Root Causes

### Issue 1: MacRoman encoding misdetection
- **Location**: `RequestHandler.fetch_text()` line 220
- **Cause**: Setting `r.encoding = None` forces requests to use `apparent_encoding`
- **Problem**: For Genius.com (and others), requests incorrectly detects MacRoman instead of UTF-8
- **Result**: UTF-8 bytes `c3 b2` (ò) decoded as MacRoman produces "√≤" (U+221A U+2264)

### Issue 2: Incomplete JSON unescape
- **Location**: `Genius.scrape()` line 576
- **Cause**: The `remove_backslash` regex doesn't handle all escape patterns in JSON
- **Problem**: Genius embeds lyrics in JSON with patterns like `\\"` and `\\\\"` 
- **Result**: After BeautifulSoup processing, escaped quotes remain in final text

## Solution

### Fix 1: Trust server encoding, fallback to UTF-8
```python
# OLD: r.encoding = None
# NEW:
if not r.encoding:
    r.encoding = 'utf-8'
```
- Respects server's declared encoding (UTF-8 for Genius)
- Falls back to UTF-8 if no encoding specified (safer than apparent_encoding)
- Preserves original intent of handling misconfigured servers

### Fix 2: Iteratively clean escaped quotes
```python
while '\\"' in lyrics:
    lyrics = lyrics.replace('\\"', '"')
```
- Handles variable escape levels (`\"`, `\\\"`, `\\\\\"`)
- Minimal change - keeps original `remove_backslash` regex
- Applied after BeautifulSoup to avoid interfering with HTML parsing

## Testing

Tested with:
- Caparezza - "Argenti Vive" (Italian, many accented characters)
- WestsideGunn - "Heel Cena" (escaped quotes in lyrics)

Before:
```
mi si parò davanti
\\"I got big moves\\"
```

After:
```
mi si parò davanti
"I got big moves"
```

## Impact
- Fixes lyrics for all languages with non-ASCII characters
- Fixes Genius lyrics with quotes
- No breaking changes - maintains backward compatibility
- Minimal code changes (14 lines total)
2025-12-23 22:31:21 +02:00
..
_utils Add retries for connection errors 2025-12-21 01:03:20 +00:00
bpd Catch ValueError when setting gst required version 2025-11-19 14:43:30 +03:00
lastgenre remove changes for lastgenre as there was an existing PR for that work 2025-12-17 15:57:23 -08:00
metasync Replace logging f-strings with arguments 2025-08-30 23:10:21 +01:00
web Web plugin: add type hint for g.lib 2025-11-15 21:02:43 +01:00
_typing.py Resurrect translation functionality 2025-02-20 03:47:04 +00:00
absubmit.py Do not use backslashes to deal with long strings 2025-08-30 23:10:20 +01:00
acousticbrainz.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
advancedrewrite.py refactor: convert _types from class attributes to cached properties 2025-07-16 14:45:25 +01:00
albumtypes.py Move musicbrainz to beetsplug directory 2025-05-16 19:56:50 +01:00
aura.py Replace string concatenation (' + ') 2025-08-30 23:10:15 +01:00
autobpm.py Fix plugin types 2025-07-16 14:06:34 +01:00
badfiles.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
bareasc.py Do not assign args to query 2025-07-08 11:37:34 +01:00
beatport.py pyupgrade Python 3.10 2025-11-08 12:09:52 +00:00
bench.py Move vfs.py to beetsplug._utils package to avoid polluting core namespace (#6017) 2025-10-01 12:28:18 +02:00
bpm.py Do not use explicit indices for logging args when not needed 2025-08-30 23:10:21 +01:00
bpsync.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
bucket.py Replace string concatenation (' + ') 2025-08-30 23:10:15 +01:00
chroma.py pyupgrade Python 3.10 2025-11-08 12:09:52 +00:00
convert.py Fix convert --format with never_convert_lossy_files (#6171) 2025-12-03 22:48:41 +01:00
deezer.py pyupgrade Python 3.10 2025-11-08 12:09:52 +00:00
discogs.py pyupgrade Python 3.10 2025-11-08 12:09:52 +00:00
duplicates.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
edit.py Fix verbose comments and add e,c test 2025-12-09 12:14:03 -05:00
embedart.py New import location for art.py 2025-09-21 08:01:48 -07:00
embyupdate.py Replace logging f-strings with arguments 2025-08-30 23:10:21 +01:00
export.py Do not use explicit indices for logging args when not needed 2025-08-30 23:10:21 +01:00
fetchart.py pyupgrade Python 3.10 2025-11-08 12:09:52 +00:00
filefilter.py Reformat the codebase 2024-09-21 11:57:48 +01:00
fish.py Replace string concatenation (' + ') 2025-08-30 23:10:15 +01:00
freedesktop.py Reformat the codebase 2024-09-21 11:57:48 +01:00
fromfilename.py Improve regexp and module docstring 2025-09-30 15:46:26 +02:00
ftintitle.py Add album template value in ftintitle plugin 2025-11-21 18:31:59 +01:00
fuzzy.py Reformat the codebase 2024-09-21 11:57:48 +01:00
hook.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
ihate.py Do not use explicit indices for logging args when not needed 2025-08-30 23:10:21 +01:00
importadded.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
importfeeds.py Do not use explicit indices for logging args when not needed 2025-08-30 23:10:21 +01:00
importsource.py importsource: fix potential prevent_suggest_removal crash 2025-12-21 13:07:02 +01:00
info.py Do not use explicit indices for logging args when not needed 2025-08-30 23:10:21 +01:00
inline.py Fix recursion in inline plugin when item_fields shadow DB fields (#6115) 2025-11-20 15:57:22 -05:00
ipfs.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
keyfinder.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
kodiupdate.py Do not use explicit indices for logging args when not needed 2025-08-30 23:10:21 +01:00
lastimport.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
limit.py Do not assign args to query 2025-07-08 11:37:34 +01:00
listenbrainz.py Removed data source as listenbrainz is not an metadata source plugin. 2025-09-04 17:41:12 +02:00
loadext.py Use only plugins/disabled_plugins config in plugin loading 2025-08-09 15:11:58 +01:00
lyrics.py Fix lyrics Unicode corruption and escaped quotes in Genius plugin 2025-12-23 22:31:21 +02:00
mbcollection.py Do not use explicit indices for logging args when not needed 2025-08-30 23:10:21 +01:00
mbpseudo.py musicbrainz: remove error handling 2025-12-20 01:35:52 +00:00
mbsubmit.py Move PromptChoice to beets.util module 2025-12-02 01:51:14 +00:00
mbsync.py Renamed import in mbsync and missing plugins. 2025-07-15 15:03:14 +02:00
missing.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
mpdstats.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
mpdupdate.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
musicbrainz.py Ensure that inc are joined with a plus 2025-12-21 01:03:20 +00:00
parentwork.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
permissions.py Apply formatting 2024-03-01 15:21:25 +10:00
play.py Move PromptChoice to beets.util module 2025-12-02 01:51:14 +00:00
playlist.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
plexupdate.py Replace string concatenation (' + ') 2025-08-30 23:10:15 +01:00
random.py Do not assign args to query 2025-07-08 11:37:34 +01:00
replace.py Feat: Add replace plugin (#5644) 2025-05-27 00:17:52 +02:00
replaygain.py pyupgrade Python 3.10 2025-11-08 12:09:52 +00:00
rewrite.py Do not use explicit indices for logging args when not needed 2025-08-30 23:10:21 +01:00
scrub.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
smartplaylist.py Fix URL-encoding path conversion 2025-12-02 09:27:24 -05:00
sonosupdate.py Apply formatting tools to all files 2023-10-22 09:53:18 +10:00
spotify.py expand tests to include check for track artists 2025-12-18 16:23:58 -08:00
subsonicplaylist.py Replace string concatenation (' + ') 2025-08-30 23:10:15 +01:00
subsonicupdate.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
substitute.py Apply substitute rules in sequence 2024-10-16 16:36:36 +02:00
the.py Do not use explicit indices for logging args when not needed 2025-08-30 23:10:21 +01:00
thumbnails.py Delegate attribute access to logging 2025-08-30 23:10:21 +01:00
titlecase.py Titlecase Plugin Improvements: Add preserving all lowercase and all upper case strings; Fix spelling of 'separator' in config, docs and code; Move most of the logging for the plugin to debug to keep log cleaner. 2025-12-16 18:56:39 -08:00
types.py Replace format calls with f-strings 2025-08-30 18:42:26 +01:00
unimported.py Replace string concatenation (' + ') 2025-08-30 23:10:15 +01:00
zero.py Remove tests. Update docs. Remove unnecessary return 2025-10-14 03:17:34 +01:00