1
0
Fork 0
mirror of https://github.com/kemayo/leech synced 2025-12-06 08:22:56 +01:00
Commit graph

34 commits

Author SHA1 Message Date
David Lynch
5cb887f767 Move image processing into sites
The epub-builder still downloads the image, but all the html-mangling
is done in the extraction process now.

Turns footnotes into a chapter-object, for easier processing later on.
2025-03-22 19:39:16 -05:00
Kevin Pedro
de6913a9af simplify algorithm 2025-03-08 09:48:32 -06:00
Kevin Pedro
d4e1214be3 return to loop-based algorithm 2025-03-08 09:40:42 -06:00
Kevin Pedro
b2f15eb76c satisfy linter 2025-03-05 21:03:35 -06:00
Kevin Pedro
280b242a27 stop loop once a new link is found 2025-03-05 20:56:47 -06:00
Kevin Pedro
0066a148bb process all next_link items 2025-03-05 20:56:47 -06:00
David Lynch
ffb8e54e91 Better error for an Arbitrary story that fetches no content 2024-11-23 23:07:16 -06:00
David Lynch
a39e1e9f89 Use the newer syntax for attrs 2024-11-23 19:42:35 -06:00
David Lynch
9510a22cb0 Remove arbitrary's special-case image loading, since the default works 2024-11-23 16:33:01 -06:00
David Lynch
21834bb5ed _clean takes a base argument and reformats image srcs into absolute urls 2024-11-23 15:30:57 -06:00
David Lynch
a0a057c48c _soup always returns a base URL 2024-11-23 15:15:29 -06:00
Idan Dor
31f663c6e0 Added image embedding support for epub
Specifically, added image_selector for arbitrary sites that allows
selecting img tags from chapters, downloading them
and embedding them within the resulting epub.

In the case of Pale, this means that the character banners and
extra materials do not require an internet connection to view.

Also made the two pale.json's more consistent (pale.json now correctly
includes the title of the chapters).
2024-11-23 13:22:53 -06:00
David Lynch
f25befc237 Decode cloudflare email address protection
Makes a generic _clean function on Site that can be called. Will
probably want to migrate some other generic bits into there after
analysis of what's *really* generic.
2021-03-27 10:46:39 -05:00
claasjg
d4f3986515
Detect URL loop with next selector 2021-03-19 14:49:38 +01:00
David Lynch
28cc1fbcc7 Arbitrary should store contents as a string, not a bs4 Tag
It coincidentally works by being string-like for previous uses, but it's
not string-like enough for the new unicode stuff.

Fixes #54
2021-02-05 19:58:47 -06:00
IdanDor
6d7b5ffcf0 Removed trailing whitespace. 2021-01-23 13:30:03 +02:00
IdanDor
1afac50437 Made arbitrary sites no longer leak memory and fixed worm epub.
Each `Chapter` object had a reference to the entire page tree, meaning that the program rose in RAM usage by a lot.

Transformed Worm to be with next_selector so the chapters are correctly ordered, E.2 is not skipped and the download does not crush due to `?share=twitter` url matched before.

Fixed Worm titles.
2021-01-23 12:12:48 +02:00
David Lynch
c208e33752 Arbitrary: strip all namespaced elements
This is `fb:like` and similar, which break some epub readers.

Refs: #41, #43
2020-09-08 23:04:47 -05:00
David Lynch
6fbdc8843d Make arbitrary site chapter-title selectors more resilient 2020-04-29 17:55:20 -05:00
David Lynch
532a7c6682 Fix typo of title_element in arbitrary
Fixes #25
2019-07-30 09:37:03 -05:00
David Lynch
2bd5d77715 Helper for URL-joining 2019-05-29 01:55:35 -05:00
David Lynch
02bd6ae0c6
Merge pull request #16 from AlexRaubach/covers
Download cover art from RR and arbitrary sites
2018-10-01 12:18:39 -05:00
David Lynch
929284b67d New features for arbitrary sites
* next_selector: find next content page, if not using chapter selector
* content_title_selector: pull a chapter title from the content
* content_text_selector: pull specific text from the content element

`content_selector` will now fetch all content elements on the page, each
as a Chapter, not just the first one that matches.
2018-10-01 11:18:39 -05:00
Alex Raubach
ff568eef10 Allow arbitrary sites to include a cover url 2018-09-02 22:08:36 -04:00
Alex Raubach
1bfc9b75f7
Remove unneeded whitespace 2018-08-28 23:24:59 -04:00
Alex Raubach
2019616505
Check that the chapter has content before parsing
Trying to select the first element in line 87 will throw a list index out of range error if there is no content matching the selector.
2018-08-28 21:59:16 -04:00
David Lynch
6d52c72c99 Use logging instead of print
Fixes #10
2017-11-04 00:09:09 -05:00
David Lynch
257ab69394 Arbitrary handler: canonicalize URLs 2017-10-22 17:31:10 -05:00
David Lynch
dc0d2162fb Arbitrary handler had misplaced url arg 2017-10-22 17:06:40 -05:00
Will Oursler
5bd07a5b90 Splits out ebook generation logic into a seperate module, in anticipation of maybe supporting multiple output formats. 2017-10-12 09:49:32 -04:00
David Lynch
d60c21cae3 Remove TODO from arbitrary
529b85c7 implemented this, so it's good.
2017-10-06 14:08:18 -05:00
David Lynch
529b85c7a6 Adjust Arbitrary so it can handle non-chapter works 2017-04-29 20:59:04 -05:00
David Lynch
17664125f3 Changed mind for arbitrary: JSON definitions 2017-04-24 22:02:16 -05:00
David Lynch
7171d2c9ea Add an arbitrary-site handler 2017-04-24 01:09:43 -05:00