Address failing google sources tests

Two google sources failed to return the expected output. I looked into
each case why parsing failed:

- lyrics on musica.com contain <aside> Google Ads
- each lyrics line on lacoccinelle.net is wrapped within alternating
  <em> and <strong> tags

Thus remove these tags as part of the HTML cleanup logic.
This commit is contained in:
Šarūnas Nejus 2024-10-02 01:35:01 +01:00
parent e99d457c9d
commit 3b73a26002
No known key found for this signature in database
GPG key ID: DD28F6704DBE3435

View file

@ -536,6 +536,8 @@ def _scrape_strip_cruft(html, plain_text_out=False):
html = BREAK_RE.sub("\n", html) # <br> eats up surrounding '\n'.
html = re.sub(r"(?s)<(script).*?</\1>", "", html) # Strip script tags.
html = re.sub("\u2005", " ", html) # replace unicode with regular space
html = re.sub("<aside .+?</aside>", "", html) # remove Google Ads tags
html = re.sub(r"</?(em|strong)[^>]*>", "", html) # remove italics / bold
if plain_text_out: # Strip remaining HTML tags
html = COMMENT_RE.sub("", html)