1
0
Fork 0
mirror of https://github.com/kemayo/leech synced 2025-12-06 16:33:16 +01:00

Example of a smarter approach to books with a "next" link

Authors are often strangely bad at having an accurate table of
contents. (E.g. practical guide has at least one mislinked chapter in
there.) Show how to follow a "next" link, stopping when hitting a
certain URL.

For practical guide, this also has the benefit of dropping in the extra
chapters where they were originally experienced.
This commit is contained in:
David Lynch 2020-09-08 22:02:58 -05:00
parent 91747edb53
commit 9c9877ed26
7 changed files with 65 additions and 24 deletions

View file

@ -83,7 +83,7 @@ Example `practical.json`:
"url": "https://practicalguidetoevil.wordpress.com/table-of-contents/", "url": "https://practicalguidetoevil.wordpress.com/table-of-contents/",
"title": "A Practical Guide To Evil: Book 1", "title": "A Practical Guide To Evil: Book 1",
"author": "erraticerrata", "author": "erraticerrata",
"chapter_selector": "#main .entry-content > ul > li > a", "chapter_selector": "#main .entry-content > ul:nth-of-type(1) > li > a",
"content_selector": "#main .entry-content", "content_selector": "#main .entry-content",
"filter_selector": ".sharedaddy, .wpcnt, style", "filter_selector": ".sharedaddy, .wpcnt, style",
"cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png" "cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png"
@ -98,7 +98,27 @@ This tells leech to load `url`, follow the links described by `chapter_selector`
If `chapter_selector` isn't given, it'll create a single-chapter book by applying `content_selector` to `url`. If `chapter_selector` isn't given, it'll create a single-chapter book by applying `content_selector` to `url`.
This is a fairly viable way to extract a story from, say, a random Wordpress installation. It's relatively likely to get you at least *most* of the way to the ebook you want, with maybe some manual editing needed. This is a fairly viable way to extract a story from, say, a random Wordpress installation with a convenient table of contents. It's relatively likely to get you at least *most* of the way to the ebook you want, with maybe some manual editing needed.
A more advanced example with JSON would be:
```
{
"url": "https://practicalguidetoevil.wordpress.com/2015/03/25/prologue/",
"title": "A Practical Guide To Evil: Book 1",
"author": "erraticerrata",
"content_selector": "#main .entry-wrapper",
"content_title_selector": "h1.entry-title",
"content_text_selector": ".entry-content",
"filter_selector": ".sharedaddy, .wpcnt, style",
"next_selector": "a[rel=\"next\"]:not([href*=\"prologue\"])",
"cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png"
}
```
Because there's no `chapter_selector` here, leech will keep on looking for a link which it can find with `next_selector` and following that link. *Yes*, it would be easy to make this an endless loop; don't do that. We also see more advanced metadata acquisition here, with `content_title_selector` and `content_text_selector` being used to find specific elements from within the content.
If multiple matches for `content_selector` are found, leech will assume multiple chapters are present on one page, and will handle that. If you find a story that you want on a site which has all the chapters in the right order and next-page links, this is a notably efficient way to download it. See `examples/dungeonkeeperami.json` for this being used.
If you need more advanced behavior, consider looking at... If you need more advanced behavior, consider looking at...

View file

@ -1,9 +1,11 @@
{ {
"url": "https://practicalguidetoevil.wordpress.com/table-of-contents/", "url": "https://practicalguidetoevil.wordpress.com/2015/03/25/prologue/",
"title": "A Practical Guide To Evil: Book 1", "title": "A Practical Guide To Evil: Book 1",
"author": "erraticerrata", "author": "erraticerrata",
"chapter_selector": "#main .entry-content > ul:nth-of-type(1) > li > a", "content_selector": "#main .entry-wrapper",
"content_selector": "#main .entry-content", "content_title_selector": "h1.entry-title",
"content_text_selector": ".entry-content",
"filter_selector": ".sharedaddy, .wpcnt, style", "filter_selector": ".sharedaddy, .wpcnt, style",
"next_selector": "a[rel=\"next\"]:not([href*=\"prologue\"])",
"cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png" "cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png"
} }

View file

@ -1,9 +1,11 @@
{ {
"url": "https://practicalguidetoevil.wordpress.com/table-of-contents/", "url": "https://practicalguidetoevil.wordpress.com/2015/11/04/prologue-2/",
"title": "A Practical Guide To Evil: Book 2", "title": "A Practical Guide To Evil: Book 2",
"author": "erraticerrata", "author": "erraticerrata",
"chapter_selector": "#main .entry-content > ul:nth-of-type(2) > li > ul > li > a", "content_selector": "#main .entry-wrapper",
"content_selector": "#main .entry-content", "content_title_selector": "h1.entry-title",
"content_text_selector": ".entry-content",
"filter_selector": ".sharedaddy, .wpcnt, style", "filter_selector": ".sharedaddy, .wpcnt, style",
"next_selector": "a[rel=\"next\"]:not([href*=\"prologue\"])",
"cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png" "cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png"
} }

View file

@ -1,9 +1,11 @@
{ {
"url": "https://practicalguidetoevil.wordpress.com/table-of-contents/", "url": "https://practicalguidetoevil.wordpress.com/2017/02/08/prologue-3/",
"title": "A Practical Guide To Evil: Book 3", "title": "A Practical Guide To Evil: Book 3",
"author": "erraticerrata", "author": "erraticerrata",
"chapter_selector": "#main .entry-content > ul:nth-of-type(3) > li > a", "content_selector": "#main .entry-wrapper",
"content_selector": "#main .entry-content", "content_title_selector": "h1.entry-title",
"content_text_selector": ".entry-content",
"filter_selector": ".sharedaddy, .wpcnt, style", "filter_selector": ".sharedaddy, .wpcnt, style",
"next_selector": "a[rel=\"next\"]:not([href*=\"prologue\"])",
"cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png" "cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png"
} }

View file

@ -1,9 +1,11 @@
{ {
"url": "https://practicalguidetoevil.wordpress.com/table-of-contents/", "url": "https://practicalguidetoevil.wordpress.com/2018/04/09/prologue-4/",
"title": "A Practical Guide To Evil: Book 4", "title": "A Practical Guide To Evil: Book 4",
"author": "erraticerrata", "author": "erraticerrata",
"chapter_selector": "#main .entry-content > ul:nth-of-type(4) > li > a", "content_selector": "#main .entry-wrapper",
"content_selector": "#main .entry-content", "content_title_selector": "h1.entry-title",
"content_text_selector": ".entry-content",
"filter_selector": ".sharedaddy, .wpcnt, style", "filter_selector": ".sharedaddy, .wpcnt, style",
"next_selector": "a[rel=\"next\"]:not([href*=\"prologue\"])",
"cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png" "cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png"
} }

View file

@ -1,9 +1,11 @@
{ {
"url": "https://practicalguidetoevil.wordpress.com/table-of-contents/", "url": "https://practicalguidetoevil.wordpress.com/2019/01/14/prologue-5/",
"title": "A Practical Guide To Evil: Book 5", "title": "A Practical Guide To Evil: Book 5",
"author": "erraticerrata", "author": "erraticerrata",
"chapter_selector": "#main .entry-content > ul:nth-of-type(5) > li > a", "content_selector": "#main .entry-wrapper",
"content_selector": "#main .entry-content", "content_title_selector": "h1.entry-title",
"content_text_selector": ".entry-content",
"filter_selector": ".sharedaddy, .wpcnt, style", "filter_selector": ".sharedaddy, .wpcnt, style",
"next_selector": "a[rel=\"next\"]:not([href*=\"prologue\"])",
"cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png" "cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png"
} }

11
examples/practical6.json Normal file
View file

@ -0,0 +1,11 @@
{
"url": "https://practicalguidetoevil.wordpress.com/2020/01/06/prologue-6/",
"title": "A Practical Guide To Evil: Book 6",
"author": "erraticerrata",
"content_selector": "#main .entry-wrapper",
"content_title_selector": "h1.entry-title",
"content_text_selector": ".entry-content",
"filter_selector": ".sharedaddy, .wpcnt, style",
"next_selector": "a[rel=\"next\"]:not([href*=\"prologue\"])",
"cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png"
}