1
0
Fork 0
mirror of https://github.com/kemayo/leech synced 2025-12-06 08:22:56 +01:00

Detect URL loop with next selector

This commit is contained in:
claasjg 2021-03-19 14:47:10 +01:00
parent ce998c84c3
commit d4f3986515
No known key found for this signature in database
GPG key ID: C57EEC36968F8EA0
2 changed files with 5 additions and 2 deletions

View file

@ -116,7 +116,7 @@ A more advanced example with JSON would be:
}
```
Because there's no `chapter_selector` here, leech will keep on looking for a link which it can find with `next_selector` and following that link. *Yes*, it would be easy to make this an endless loop; don't do that. We also see more advanced metadata acquisition here, with `content_title_selector` and `content_text_selector` being used to find specific elements from within the content.
Because there's no `chapter_selector` here, leech will keep on looking for a link which it can find with `next_selector` and following that link. We also see more advanced metadata acquisition here, with `content_title_selector` and `content_text_selector` being used to find specific elements from within the content.
If multiple matches for `content_selector` are found, leech will assume multiple chapters are present on one page, and will handle that. If you find a story that you want on a site which has all the chapters in the right order and next-page links, this is a notably efficient way to download it. See `examples/dungeonkeeperami.json` for this being used.

View file

@ -75,8 +75,11 @@ class Arbitrary(Site):
for chapter in self._chapter(chapter_url, definition, title=chapter_link.string):
story.add(chapter)
else:
# set of already processed urls. Stored to detect loops.
found_content_urls = set()
content_url = definition.url
while content_url:
while content_url and content_url not in found_content_urls:
found_content_urls.add(content_url)
for chapter in self._chapter(content_url, definition):
story.add(chapter)
if definition.next_selector: