Compare commits

..

No commits in common. "main" and "v2.5.0" have entirely different histories.
main ... v2.5.0

508 changed files with 43027 additions and 163391 deletions

9
.gitignore vendored
View file

@ -15,13 +15,6 @@
# usually perl -pi.back -e edits.
*.back
*.bak
# pycharm project specific settings files
.idea
# vscode project specific settings file
.vscode
cleanup.sh
FanFictionDownLoader.zip
@ -33,5 +26,3 @@ build
dist
FanFicFare.egg-info
personal.ini
appcfg_oauth2_tokens
venv/

View file

@ -1,3 +1 @@
include DESCRIPTION.rst
include README.md
include LICENSE

View file

@ -1,71 +1,19 @@
[FanFicFare](https://github.com/JimmXinu/FanFicFare)
FanFicFare
==========
FanFicFare makes reading stories from various websites much easier by helping
you download them to EBook files.
This is a repository for the FanFicFare project.
FanFicFare was previously known as FanFictionDownLoader (AKA
FFDL, AKA fanficdownloader).
FanFicFare is the rename and move of the FanFictionDownLoader (AKA
FFDL, AKA fanficdownloader) project previously hosted as a
[code.google project].
Main features:
This program is available as a calibre plugin, a command-line
interface, and a web service.
- Download FanFiction stories from over [100 different sites](https://github.com/JimmXinu/FanFicFare/wiki/SupportedSites). into ebooks.
- Update previously downloaded EPUB format ebooks, downloading only new chapters.
- Get Story URLs from Web Pages.
- Support for downloading images in the story text. (EPUB and HTML
only -- download EPUB and convert to AZW3 for Kindle) More details on
configuring images in stories and cover images can be found in the
[FAQs] or [this post in the old FFDL thread].
- Support for cover image. (EPUB only)
- Optionally keep an Update Log of past updates (EPUB only).
FanFicFare has now been launched. New versions, features and updates
will all be in FanFicFare.
There's additional info in the project [wiki] pages.
There's also a [FanFicFare maillist] for discussion and announcements and a [discussion thread] for the Calibre plugin.
Getting FanFicFare
==================
### Official Releases
This program is available as:
- A Calibre plugin from within Calibre or directly from the plugin [discussion thread], or;
- A Command Line Interface (CLI) [Python
package](https://pypi.python.org/pypi/FanFicFare) that you can
install with:
```
pip install FanFicFare
```
- _As of late November 2019, the web service version is shutdown. See the [Wiki Home](https://github.com/JimmXinu/FanFicFare/wiki#web-service-version) page for details._
### Test Versions
FanFicFare is released roughly every month, but new test versions are posted more frequently as changes are made.
Test versions are available at:
- The [test plugin] is posted at MobileRead.
- The test version of CLI for pip install is uploaded to the testpypi repository and can be installed with:
```
pip install --extra-index-url https://test.pypi.org/simple/ --upgrade FanFicFare
```
### Other Releases
Other versions may be available depending on your OS. I(JimmXinu) don't directly support these:
- **Arch Linux**: The latest CLI release can be obtained from the [fanficfare](https://aur.archlinux.org/packages/fanficfare) AUR package. It will install the calibre plugin, if calibre is installed.
[this post in the old FFDL thread]: https://www.mobileread.com/forums/showthread.php?p=1982785#post1982785
[FAQs]: https://github.com/JimmXinu/FanFicFare/wiki/FAQs#can-fanficfare-download-a-story-containing-images
[FanFicFare maillist]: https://groups.google.com/group/fanfic-downloader
[code.google project]: http://google-opensource.blogspot.com/2015/03/farewell-to-google-code.html
[wiki]: https://github.com/JimmXinu/FanFicFare/wiki
[discussion thread]: https://www.mobileread.com/forums/showthread.php?t=259221
[test plugin]: https://www.mobileread.com/forums/showthread.php?p=3084025&postcount=2

View file

@ -1,9 +1,8 @@
[main]
host = https://www.transifex.com
[o:calibre:p:calibre-plugins:r:fanfictiondownloader]
[calibre-plugins.fanfictiondownloader]
file_filter = translations/<lang>.po
source_file = translations/en.po
source_lang = en
type = PO
type = PO

View file

@ -4,7 +4,7 @@ from __future__ import (unicode_literals, division, absolute_import,
print_function)
__license__ = 'GPL v3'
__copyright__ = '2019, Jim Miller'
__copyright__ = '2016, Jim Miller'
__docformat__ = 'restructuredtext en'
import sys, os
@ -32,9 +32,6 @@ except NameError:
# The class that all Interface Action plugin wrappers must inherit from
from calibre.customize import InterfaceActionBase
# pulled out from FanFicFareBase for saving in prefs.py
__version__ = (4, 57, 7)
## Apparently the name for this class doesn't matter--it was still
## 'demo' for the first few versions.
class FanFicFareBase(InterfaceActionBase):
@ -51,8 +48,8 @@ class FanFicFareBase(InterfaceActionBase):
description = _('UI plugin to download FanFiction stories from various sites.')
supported_platforms = ['windows', 'osx', 'linux']
author = 'Jim Miller'
version = __version__
minimum_calibre_version = (2, 85, 1)
version = (2, 5, 0)
minimum_calibre_version = (1, 48, 0)
#: This field defines the GUI plugin class that contains all the code
#: that actually does something. Its format is module_path:class_name
@ -105,19 +102,8 @@ class FanFicFareBase(InterfaceActionBase):
ac.apply_settings()
def load_actual_plugin(self, gui):
# so the sys.path was modified while loading the plug impl.
with self:
# Make sure the fanficfare module is available globally
# under its simple name, -- This is the only reason other
# plugin files can import fanficfare instead of
# calibre_plugins.fanficfare_plugin.fanficfare.
#
# Added specifically for the benefit of
# eli-schwartz/eschwartz's Arch Linux distro that wants to
# package FFF plugin outside Calibre.
import fanficfare
with self: # so the sys.path was modified while loading the
# plug impl.
return InterfaceActionBase.load_actual_plugin(self,gui)
def cli_main(self,argv):
@ -125,10 +111,11 @@ class FanFicFareBase(InterfaceActionBase):
with self: # so the sys.path was modified appropriately
# I believe there's no performance hit loading these here when
# CLI--it would load everytime anyway.
from StringIO import StringIO
from calibre.library import db
from fanficfare.cli import main as fff_main
from calibre_plugins.fanficfare_plugin.fanficfare.cli import main as fff_main
from calibre_plugins.fanficfare_plugin.prefs import PrefsFacade
from fanficfare.six import ensure_text
from calibre.utils.config import prefs as calibre_prefs
from optparse import OptionParser
parser = OptionParser('%prog --run-plugin '+self.name+' -- [options] <storyurl>')
@ -140,11 +127,12 @@ class FanFicFareBase(InterfaceActionBase):
pargs = [x for x in argv if x.startswith('--with-library') or x.startswith('--library-path')
or not x.startswith('-')]
opts, args = parser.parse_args(pargs)
fff_prefs = PrefsFacade(db(path=opts.library_path,
read_only=True))
read_only=True))
fff_main(argv[1:],
parser=parser,
passed_defaultsini=ensure_text(get_resources("fanficfare/defaults.ini")),
passed_personalini=ensure_text(fff_prefs["personal.ini"]),
passed_defaultsini=StringIO(get_resources("fanficfare/defaults.ini")),
passed_personalini=StringIO(fff_prefs["personal.ini"]),
)

View file

@ -1,6 +1,6 @@
<hr />
<p>Plugin created by Jim Miller, originally borrowing heavily from Grant Drake's
<p>Plugin created by Jim Miller, borrowing heavily from Grant Drake's
'<a href="http://www.mobileread.com/forums/showthread.php?t=134856">Reading List</a>',
'<a href="http://www.mobileread.com/forums/showthread.php?t=126727">Extract ISBN</a>' and
'<a href="http://www.mobileread.com/forums/showthread.php?t=134000">Count Pages</a>'
@ -8,12 +8,12 @@
<p>
Calibre officially distributes plugins from the mobileread.com forum site.
The official distro channel and discussion thread for this plugin is there: <a href="http://www.mobileread.com/forums/showthread.php?t=259221">FanFicFare</a>
The official distro channel for this plugin is there: <a href="http://www.mobileread.com/forums/showthread.php?t=259221">FanFicFare</a>
</p>
<p> I also monitor the
<a href="http://groups.google.com/group/fanfic-downloader">general users
group</a> for the downloader CLI, too.
group</a> for the downloader. That covers the web application and CLI, too.
</p>
<p>

View file

@ -1,20 +0,0 @@
from __future__ import (unicode_literals, division, absolute_import,
print_function)
__license__ = 'GPL v3'
__copyright__ = '2024, Jim Miller'
__docformat__ = 'restructuredtext en'
## References:
## https://www.mobileread.com/forums/showthread.php?p=4435205&postcount=65
## https://www.mobileread.com/forums/showthread.php?p=4102834&postcount=389
from calibre_plugins.action_chains.events import ChainEvent
class FanFicFareDownloadFinished(ChainEvent):
# replace with the name of your event
name = 'FanFicFare Download Finished'
def get_event_signal(self):
return self.gui.iactions['FanFicFare'].download_finished_signal

View file

@ -1,62 +1,64 @@
# -*- coding: utf-8 -*-
from __future__ import (absolute_import, unicode_literals, division,
print_function)
__license__ = 'GPL v3'
__copyright__ = '2015, Jim Miller'
__docformat__ = 'restructuredtext en'
import re
from PyQt5.Qt import (Qt, QSyntaxHighlighter, QTextCharFormat, QBrush)
from fanficfare.six import string_types
class BasicIniHighlighter(QSyntaxHighlighter):
'''
QSyntaxHighlighter class for use with QTextEdit for highlighting
ini config files.
I looked high and low to find a high lighter for basic ini config
format, so I'm leaving this in the project even though I'm not
using.
'''
def __init__( self, parent, theme ):
QSyntaxHighlighter.__init__( self, parent )
self.parent = parent
self.highlightingRules = []
# keyword
self.highlightingRules.append( HighlightingRule( r"^[^:=\s][^:=]*[:=]",
Qt.blue,
Qt.SolidPattern ) )
# section
self.highlightingRules.append( HighlightingRule( r"^\[[^\]]+\]",
Qt.darkBlue,
Qt.SolidPattern ) )
# comment
self.highlightingRules.append( HighlightingRule( r"#[^\n]*" ,
Qt.darkYellow,
Qt.SolidPattern ) )
def highlightBlock( self, text ):
for rule in self.highlightingRules:
for match in rule.pattern.finditer(text):
self.setFormat( match.start(), match.end()-match.start(), rule.highlight )
self.setCurrentBlockState( 0 )
class HighlightingRule():
def __init__( self, pattern, color, style ):
if isinstance(pattern, string_types):
self.pattern = re.compile(pattern)
else:
self.pattern=pattern
charfmt = QTextCharFormat()
brush = QBrush(color, style)
charfmt.setForeground(brush)
self.highlight = charfmt
# -*- coding: utf-8 -*-
from __future__ import (unicode_literals, division,
print_function)
__license__ = 'GPL v3'
__copyright__ = '2015, Jim Miller'
__docformat__ = 'restructuredtext en'
import re
try:
from PyQt5.Qt import (Qt, QSyntaxHighlighter, QTextCharFormat, QBrush)
except ImportError as e:
from PyQt4.Qt import (Qt, QSyntaxHighlighter, QTextCharFormat, QBrush)
class BasicIniHighlighter(QSyntaxHighlighter):
'''
QSyntaxHighlighter class for use with QTextEdit for highlighting
ini config files.
I looked high and low to find a high lighter for basic ini config
format, so I'm leaving this in the project even though I'm not
using.
'''
def __init__( self, parent, theme ):
QSyntaxHighlighter.__init__( self, parent )
self.parent = parent
self.highlightingRules = []
# keyword
self.highlightingRules.append( HighlightingRule( r"^[^:=\s][^:=]*[:=]",
Qt.blue,
Qt.SolidPattern ) )
# section
self.highlightingRules.append( HighlightingRule( r"^\[[^\]]+\]",
Qt.darkBlue,
Qt.SolidPattern ) )
# comment
self.highlightingRules.append( HighlightingRule( r"#[^\n]*" ,
Qt.darkYellow,
Qt.SolidPattern ) )
def highlightBlock( self, text ):
for rule in self.highlightingRules:
for match in rule.pattern.finditer(text):
self.setFormat( match.start(), match.end()-match.start(), rule.highlight )
self.setCurrentBlockState( 0 )
class HighlightingRule():
def __init__( self, pattern, color, style ):
if isinstance(pattern,basestring):
self.pattern = re.compile(pattern)
else:
self.pattern=pattern
charfmt = QTextCharFormat()
brush = QBrush(color, style)
charfmt.setForeground(brush)
self.highlight = charfmt

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -1,116 +1,49 @@
# -*- coding: utf-8 -*-
from __future__ import (unicode_literals, division, absolute_import,
print_function)
__license__ = 'GPL v3'
__copyright__ = '2020, Jim Miller'
__docformat__ = 'restructuredtext en'
from functools import reduce
from io import StringIO
import logging
logger = logging.getLogger(__name__)
from fanficfare import adapters
from fanficfare.configurable import Configuration
from calibre_plugins.fanficfare_plugin.prefs import prefs
from fanficfare.six import ensure_text
from fanficfare.six.moves import configparser
from fanficfare.six.moves import collections_abc
def get_fff_personalini():
return prefs['personal.ini']
def get_fff_config(url,fileform="epub",personalini=None):
if not personalini:
personalini = get_fff_personalini()
sections=['unknown']
try:
sections = adapters.getConfigSectionsFor(url)
except Exception as e:
logger.debug("Failed trying to get ini config for url(%s): %s, using section %s instead"%(url,e,sections))
configuration = Configuration(sections,fileform)
configuration.read_file(StringIO(ensure_text(get_resources("plugin-defaults.ini"))))
configuration.read_file(StringIO(ensure_text(personalini)))
return configuration
def get_fff_adapter(url,fileform="epub",personalini=None):
return adapters.getAdapter(get_fff_config(url,fileform,personalini),url)
def test_config(initext):
try:
configini = get_fff_config("test1.com?sid=555",
personalini=initext)
errors = configini.test_config()
except configparser.ParsingError as pe:
errors = pe.errors
return errors
class OrderedSet(collections_abc.MutableSet):
def __init__(self, iterable=None):
self.end = end = []
end += [None, end, end] # sentinel node for doubly linked list
self.map = {} # key --> [key, prev, next]
if iterable is not None:
self |= iterable
def __len__(self):
return len(self.map)
def __contains__(self, key):
return key in self.map
def add(self, key):
if key not in self.map:
end = self.end
curr = end[1]
curr[2] = end[1] = self.map[key] = [key, curr, end]
def discard(self, key):
if key in self.map:
key, prev, next = self.map.pop(key)
prev[2] = next
next[1] = prev
def __iter__(self):
end = self.end
curr = end[2]
while curr is not end:
yield curr[0]
curr = curr[2]
def __reversed__(self):
end = self.end
curr = end[1]
while curr is not end:
yield curr[0]
curr = curr[1]
def pop(self, last=True):
if not self:
raise KeyError('set is empty')
key = self.end[1][0] if last else self.end[2][0]
self.discard(key)
return key
def __repr__(self):
if not self:
return '%s()' % (self.__class__.__name__,)
return '%s(%r)' % (self.__class__.__name__, list(self))
def __eq__(self, other):
if isinstance(other, OrderedSet):
return len(self) == len(other) and list(self) == list(other)
return set(self) == set(other)
def get_common_elements(ll):
## returns a list of elements common to all lists in ll
## https://www.tutorialspoint.com/find-common-elements-in-list-of-lists-in-python
return list(reduce(lambda i, j: i & j, (OrderedSet(n) for n in ll)))
# -*- coding: utf-8 -*-
from __future__ import (unicode_literals, division, absolute_import,
print_function)
__license__ = 'GPL v3'
__copyright__ = '2015, Jim Miller'
__docformat__ = 'restructuredtext en'
from StringIO import StringIO
from ConfigParser import ParsingError
import logging
logger = logging.getLogger(__name__)
from calibre_plugins.fanficfare_plugin.fanficfare import adapters, exceptions
from calibre_plugins.fanficfare_plugin.fanficfare.configurable import Configuration
from calibre_plugins.fanficfare_plugin.prefs import prefs
def get_fff_personalini():
return prefs['personal.ini']
def get_fff_config(url,fileform="epub",personalini=None):
if not personalini:
personalini = get_fff_personalini()
sections=['unknown']
try:
sections = adapters.getConfigSectionsFor(url)
except Exception as e:
logger.debug("Failed trying to get ini config for url(%s): %s, using section %s instead"%(url,e,sections))
configuration = Configuration(sections,fileform)
configuration.readfp(StringIO(get_resources("plugin-defaults.ini")))
configuration.readfp(StringIO(personalini))
return configuration
def get_fff_adapter(url,fileform="epub",personalini=None):
return adapters.getAdapter(get_fff_config(url,fileform,personalini),url)
def test_config(initext):
try:
configini = get_fff_config("test1.com?sid=555",
personalini=initext)
errors = configini.test_config()
except ParsingError as pe:
errors = pe.errors
return errors

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

View file

@ -1,159 +1,124 @@
# -*- coding: utf-8 -*-
from __future__ import (absolute_import, unicode_literals, division,
print_function)
__license__ = 'GPL v3'
__copyright__ = '2020, Jim Miller'
__docformat__ = 'restructuredtext en'
import re
import logging
logger = logging.getLogger(__name__)
from PyQt5.Qt import (QApplication, Qt, QColor, QSyntaxHighlighter,
QTextCharFormat, QBrush, QFont)
try:
# qt6 Calibre v6+
QFontNormal = QFont.Weight.Normal
QFontBold = QFont.Weight.Bold
except:
# qt5 Calibre v2-5
QFontNormal = QFont.Normal
QFontBold = QFont.Bold
from fanficfare.six import string_types
class IniHighlighter(QSyntaxHighlighter):
'''
QSyntaxHighlighter class for use with QTextEdit for highlighting
ini config files.
'''
def __init__( self, parent, sections=[], keywords=[], entries=[], entry_keywords=[] ):
QSyntaxHighlighter.__init__( self, parent )
self.parent = parent
self.highlightingRules = []
colors = {
'knownentries':Qt.darkGreen,
'errors':Qt.red,
'allkeywords':Qt.darkMagenta,
'knownkeywords':Qt.blue,
'knownsections':Qt.darkBlue,
'teststories':Qt.darkCyan,
'storyUrls':Qt.darkMagenta,
'comments':Qt.darkYellow
}
try:
if( hasattr(QApplication.instance(),'is_dark_theme')
and QApplication.instance().is_dark_theme ):
colors = {
'knownentries':Qt.green,
'errors':Qt.red,
'allkeywords':Qt.magenta,
'knownkeywords':QColor(Qt.blue).lighter(150),
'knownsections':Qt.darkCyan,
'teststories':Qt.cyan,
'storyUrls':QColor(Qt.magenta).lighter(150),
'comments':Qt.yellow
}
except Exception as e:
logger.error("Failed to set dark theme highlight colors: %s"%e)
if entries:
# *known* entries
reentries = r'('+(r'|'.join(entries))+r')'
self.highlightingRules.append( HighlightingRule( r"\b"+reentries+r"\b", colors['knownentries'] ) )
# true/false -- just to be nice.
self.highlightingRules.append( HighlightingRule( r"\b(true|false)\b", colors['knownentries'] ) )
# *all* keywords -- change known later.
self.errorRule = HighlightingRule( r"^[^:=\s][^:=]*[:=]", colors['errors'] )
self.highlightingRules.append( self.errorRule )
# *all* entry keywords -- change known later.
reentrykeywords = r'('+(r'|'.join([ e % r'[a-zA-Z0-9_]+' for e in entry_keywords ]))+r')'
self.highlightingRules.append( HighlightingRule( r"^(add_to_)?"+reentrykeywords+r"(_filelist)?\s*[:=]", colors['allkeywords'] ) )
if entries: # separate from known entries so entry named keyword won't be masked.
# *known* entry keywords
reentrykeywords = r'('+(r'|'.join([ e % reentries for e in entry_keywords ]))+r')'
self.highlightingRules.append( HighlightingRule( r"^(add_to_)?"+reentrykeywords+r"(_filelist)?\s*[:=]", colors['knownkeywords'] ) )
# *known* keywords
rekeywords = r'('+(r'|'.join(keywords))+r')'
self.highlightingRules.append( HighlightingRule( r"^(add_to_)?"+rekeywords+r"(_filelist)?\s*[:=]", colors['knownkeywords'] ) )
# *all* sections -- change known later.
self.highlightingRules.append( HighlightingRule( r"^\[[^\]]+\].*?$", colors['errors'], QFontBold, blocknum=1 ) )
if sections:
# *known* sections
resections = r'('+(r'|'.join(sections))+r')'
resections = resections.replace('.',r'\.') #escape dots.
self.highlightingRules.append( HighlightingRule( r"^\["+resections+r"\]\s*$", colors['knownsections'], QFontBold, blocknum=2 ) )
# test story sections
self.teststoryRule = HighlightingRule( r"^\[teststory:([0-9]+|defaults)\]", colors['teststories'], blocknum=3 )
self.highlightingRules.append( self.teststoryRule )
# storyUrl sections
# StoryUrls are *not* checked beyond looking for https?://
self.storyUrlRule = HighlightingRule( r"^\[https?://.*\]", colors['storyUrls'], QFontBold, blocknum=2 )
self.highlightingRules.append( self.storyUrlRule )
# NOT comments -- but can be custom columns, so don't flag.
#self.highlightingRules.append( HighlightingRule( r"(?<!^)#[^\n]*" , colors['errors'] ) )
# comments -- comments must start from column 0.
self.commentRule = HighlightingRule( r"^#[^\n]*" , colors['comments'] )
self.highlightingRules.append( self.commentRule )
def highlightBlock( self, text ):
is_comment = False
blocknum = self.previousBlockState()
for rule in self.highlightingRules:
for match in rule.pattern.finditer(text):
self.setFormat( match.start(), match.end()-match.start(), rule.highlight )
if rule == self.commentRule:
is_comment = True
if rule.blocknum > 0:
blocknum = rule.blocknum
if not is_comment:
# unknown section, error all:
if blocknum == 1 and blocknum == self.previousBlockState():
self.setFormat( 0, len(text), self.errorRule.highlight )
# teststory section rules:
if blocknum == 3:
self.setFormat( 0, len(text), self.teststoryRule.highlight )
## changed storyUrl section to also be blocknum=1 April 2023
## storyUrl section rules:
# if blocknum == 4:
# self.setFormat( 0, len(text), self.storyUrlRule.highlight )
self.setCurrentBlockState( blocknum )
class HighlightingRule():
def __init__( self, pattern, color,
weight=QFontNormal,
style=Qt.SolidPattern,
blocknum=0):
if isinstance(pattern, string_types):
self.pattern = re.compile(pattern)
else:
self.pattern=pattern
charfmt = QTextCharFormat()
brush = QBrush(color, style)
charfmt.setForeground(brush)
charfmt.setFontWeight(weight)
self.highlight = charfmt
self.blocknum=blocknum
# -*- coding: utf-8 -*-
from __future__ import (unicode_literals, division,
print_function)
__license__ = 'GPL v3'
__copyright__ = '2016, Jim Miller'
__docformat__ = 'restructuredtext en'
import re
try:
from PyQt5.Qt import (Qt, QSyntaxHighlighter, QTextCharFormat, QBrush, QFont)
except ImportError as e:
from PyQt4.Qt import (Qt, QSyntaxHighlighter, QTextCharFormat, QBrush, QFont)
# r'add_to_+key
class IniHighlighter(QSyntaxHighlighter):
'''
QSyntaxHighlighter class for use with QTextEdit for highlighting
ini config files.
'''
def __init__( self, parent, sections=[], keywords=[], entries=[], entry_keywords=[] ):
QSyntaxHighlighter.__init__( self, parent )
self.parent = parent
self.highlightingRules = []
if entries:
# *known* entries
reentries = r'('+(r'|'.join(entries))+r')'
self.highlightingRules.append( HighlightingRule( r"\b"+reentries+r"\b", Qt.darkGreen ) )
# true/false -- just to be nice.
self.highlightingRules.append( HighlightingRule( r"\b(true|false)\b", Qt.darkGreen ) )
# *all* keywords -- change known later.
self.errorRule = HighlightingRule( r"^[^:=\s][^:=]*[:=]", Qt.red )
self.highlightingRules.append( self.errorRule )
# *all* entry keywords -- change known later.
reentrykeywords = r'('+(r'|'.join([ e % r'[a-zA-Z0-9_]+' for e in entry_keywords ]))+r')'
self.highlightingRules.append( HighlightingRule( r"^(add_to_)?"+reentrykeywords+r"\s*[:=]", Qt.darkMagenta ) )
if entries: # separate from known entries so entry named keyword won't be masked.
# *known* entry keywords
reentrykeywords = r'('+(r'|'.join([ e % reentries for e in entry_keywords ]))+r')'
self.highlightingRules.append( HighlightingRule( r"^(add_to_)?"+reentrykeywords+r"\s*[:=]", Qt.blue ) )
# *known* keywords
rekeywords = r'('+(r'|'.join(keywords))+r')'
self.highlightingRules.append( HighlightingRule( r"^(add_to_)?"+rekeywords+r"\s*[:=]", Qt.blue ) )
# *all* sections -- change known later.
self.highlightingRules.append( HighlightingRule( r"^\[[^\]]+\].*?$", Qt.red, QFont.Bold, blocknum=1 ) )
if sections:
# *known* sections
resections = r'('+(r'|'.join(sections))+r')'
resections = resections.replace('.','\.') #escape dots.
self.highlightingRules.append( HighlightingRule( r"^\["+resections+r"\]\s*$", Qt.darkBlue, QFont.Bold, blocknum=2 ) )
# test story sections
self.teststoryRule = HighlightingRule( r"^\[teststory:([0-9]+|defaults)\]", Qt.darkCyan, blocknum=3 )
self.highlightingRules.append( self.teststoryRule )
# storyUrl sections
self.storyUrlRule = HighlightingRule( r"^\[https?://.*\]", Qt.darkMagenta, blocknum=4 )
self.highlightingRules.append( self.storyUrlRule )
# NOT comments -- but can be custom columns, so don't flag.
#self.highlightingRules.append( HighlightingRule( r"(?<!^)#[^\n]*" , Qt.red ) )
# comments -- comments must start from column 0.
self.commentRule = HighlightingRule( r"^#[^\n]*" , Qt.darkYellow )
self.highlightingRules.append( self.commentRule )
def highlightBlock( self, text ):
is_comment = False
blocknum = self.previousBlockState()
for rule in self.highlightingRules:
for match in rule.pattern.finditer(text):
self.setFormat( match.start(), match.end()-match.start(), rule.highlight )
if rule == self.commentRule:
is_comment = True
if rule.blocknum > 0:
blocknum = rule.blocknum
if not is_comment:
# unknown section, error all:
if blocknum == 1 and blocknum == self.previousBlockState():
self.setFormat( 0, len(text), self.errorRule.highlight )
# teststory section rules:
if blocknum == 3:
self.setFormat( 0, len(text), self.teststoryRule.highlight )
# storyUrl section rules:
if blocknum == 4:
self.setFormat( 0, len(text), self.storyUrlRule.highlight )
self.setCurrentBlockState( blocknum )
class HighlightingRule():
def __init__( self, pattern, color,
weight=QFont.Normal,
style=Qt.SolidPattern,
blocknum=0):
if isinstance(pattern,basestring):
self.pattern = re.compile(pattern)
else:
self.pattern=pattern
charfmt = QTextCharFormat()
brush = QBrush(color, style)
charfmt.setForeground(brush)
charfmt.setFontWeight(weight)
self.highlight = charfmt
self.blocknum=blocknum

View file

@ -1,403 +1,344 @@
# -*- coding: utf-8 -*-
from __future__ import (unicode_literals, division, absolute_import,
print_function)
__license__ = 'GPL v3'
__copyright__ = '2020, Jim Miller, 2011, Grant Drake <grant.drake@gmail.com>'
__docformat__ = 'restructuredtext en'
import logging
logger = logging.getLogger(__name__)
from time import sleep
from datetime import time
from io import StringIO
from collections import defaultdict
import sys
from calibre.utils.date import local_tz
# pulls in translation files for _() strings
try:
load_translations()
except NameError:
pass # load_translations() added in calibre 1.9
# ------------------------------------------------------------------------------
#
# Functions to perform downloads using worker jobs
#
# ------------------------------------------------------------------------------
def do_download_worker_single(site,
book_list,
options,
merge,
notification=lambda x,y:x):
logger.info(options['version'])
## same info debug calibre prints out at startup. For when users
## give me job output instead of debug log.
from calibre.debug import print_basic_debug_info
print_basic_debug_info(sys.stderr)
notification(0.01, _('Downloading FanFiction Stories'))
from calibre_plugins.fanficfare_plugin import FanFicFareBase
fffbase = FanFicFareBase(options['plugin_path'])
with fffbase: # so the sys.path was modified while loading the
# plug impl.
from fanficfare.fff_profile import do_cprofile
## extra function just so I can easily use the same
## @do_cprofile decorator
@do_cprofile
def profiled_func():
count = 0
totals = {}
# can't do direct assignment in list comprehension? I'm sure it
# makes sense to some pythonista.
# [ totals[x['url']]=0.0 for x in book_list if x['good'] ]
[ totals.update({x['url']:0.0}) for x in book_list if x['good'] ]
# logger.debug(sites_lists.keys())
def do_indiv_notif(percent,msg):
totals[msg] = percent/len(totals)
notification(max(0.01,sum(totals.values())), _('%(count)d of %(total)d stories finished downloading')%{'count':count,'total':len(totals)})
do_list = []
done_list = []
logger.info("\n\n"+_("Downloading FanFiction Stories")+"\n%s\n"%("\n".join([ "%(status)s %(url)s %(comment)s" % book for book in book_list])))
## pass failures from metadata through bg job so all results are
## together.
for book in book_list:
if book['good']:
do_list.append(book)
else:
done_list.append(book)
for book in do_list:
# logger.info("%s"%book['url'])
done_list.append(do_download_for_worker(book,options,merge,do_indiv_notif))
count += 1
return finish_download(done_list)
return profiled_func()
def finish_download(donelist):
book_list = sorted(donelist,key=lambda x : x['listorder'])
logger.info("\n"+_("Download Results:")+"\n%s\n"%("\n".join([ "%(status)s %(url)s %(comment)s" % book for book in book_list])))
good_lists = defaultdict(list)
bad_lists = defaultdict(list)
for book in book_list:
if book['good']:
good_lists[book['status']].append(book)
else:
bad_lists[book['status']].append(book)
order = [_('Add'),
_('Update'),
_('Meta'),
_('Different URL'),
_('Rejected'),
_('Skipped'),
_('Bad'),
_('Error'),
]
stnum = 0
for d in [ good_lists, bad_lists ]:
for status in order:
stnum += 1
if d[status]:
l = d[status]
logger.info("\n"+status+"\n%s\n"%("\n".join([book['url'] for book in l])))
for book in l:
# Add prior listorder to 10000 * status num for
# ordering of accumulated results with multiple bg
# jobs
book['reportorder'] = stnum*10000 + book['listorder']
del d[status]
# just in case a status is added but doesn't appear in order.
for status in d.keys():
logger.info("\n"+status+"\n%s\n"%("\n".join([book['url'] for book in d[status]])))
# return the book list as the job result
return book_list
def do_download_for_worker(book,options,merge,notification=lambda x,y:x):
'''
Child job, to download story when run as a worker job
'''
from calibre_plugins.fanficfare_plugin import FanFicFareBase
fffbase = FanFicFareBase(options['plugin_path'])
with fffbase: # so the sys.path was modified while loading the
# plug impl.
from calibre_plugins.fanficfare_plugin.prefs import (
SAVE_YES, SAVE_YES_UNLESS_SITE, OVERWRITE, OVERWRITEALWAYS, UPDATE,
UPDATEALWAYS, ADDNEW, SKIP, CALIBREONLY, CALIBREONLYSAVECOL)
from calibre_plugins.fanficfare_plugin.wordcount import get_word_count
from fanficfare import adapters, writers
from fanficfare.epubutils import get_update_data
from fanficfare.exceptions import NotGoingToDownload
from fanficfare.six import text_type as unicode
from calibre_plugins.fanficfare_plugin.fff_util import get_fff_config
try:
logger.info("\n\n" + ("-"*80) + " " + book['url'])
## No need to download at all. Can happen now due to
## collision moving into book for CALIBREONLY changing to
## ADDNEW when story URL not in library.
if book['collision'] in (CALIBREONLY, CALIBREONLYSAVECOL):
logger.info("Skipping CALIBREONLY 'update' down inside worker")
return book
book['comment'] = _('Download started...')
configuration = get_fff_config(book['url'],
options['fileform'],
options['personal.ini'])
# images only for epub, html, even if the user mistakenly
# turned it on else where.
if options['fileform'] not in ("epub","html"):
configuration.set("overrides","include_images","false")
adapter = adapters.getAdapter(configuration,book['url'])
adapter.is_adult = book['is_adult']
adapter.username = book['username']
adapter.password = book['password']
adapter.totp = book['totp']
adapter.setChaptersRange(book['begin'],book['end'])
## each site download job starts with a new copy of the
## cookiejar and basic_cache from the FG process. They
## are not shared between different sites' BG downloads
if 'basic_cache' in options:
configuration.set_basic_cache(options['basic_cache'])
else:
options['basic_cache'] = configuration.get_basic_cache()
options['basic_cache'].load_cache(options['basic_cachefile'])
if 'cookiejar' in options:
configuration.set_cookiejar(options['cookiejar'])
else:
options['cookiejar'] = configuration.get_cookiejar()
options['cookiejar'].load_cookiejar(options['cookiejarfile'])
story = adapter.getStoryMetadataOnly()
if not story.getMetadata("series") and 'calibre_series' in book:
adapter.setSeries(book['calibre_series'][0],book['calibre_series'][1])
# logger.debug(merge)
# logger.debug(book.get('epub_for_update','(NONE)'))
# logger.debug(options.get('mergebook','(NOMERGEBOOK)'))
# is a merge, is a pre-existing anthology, and is not a pre-existing book in anthology.
if merge and 'mergebook' in options and 'epub_for_update' not in book:
# internal for plugin anthologies to mark chapters
# (new) in new stories
story.setMetadata("newforanthology","true")
logger.debug("metadata newforanthology:%s"%story.getMetadata("newforanthology"))
# set PI version instead of default.
if 'version' in options:
story.setMetadata('version',options['version'])
book['title'] = story.getMetadata("title", removeallentities=True)
book['author_sort'] = book['author'] = story.getList("author", removeallentities=True)
book['publisher'] = story.getMetadata("publisher")
book['url'] = story.getMetadata("storyUrl", removeallentities=True)
book['comments'] = story.get_sanitized_description()
book['series'] = story.getMetadata("series", removeallentities=True)
if story.getMetadataRaw('datePublished'):
book['pubdate'] = story.getMetadataRaw('datePublished').replace(tzinfo=local_tz)
if story.getMetadataRaw('dateUpdated'):
book['updatedate'] = story.getMetadataRaw('dateUpdated').replace(tzinfo=local_tz)
if story.getMetadataRaw('dateCreated'):
book['timestamp'] = story.getMetadataRaw('dateCreated').replace(tzinfo=local_tz)
else:
book['timestamp'] = datetime.now().replace(tzinfo=local_tz) # need *something* there for calibre.
writer = writers.getWriter(options['fileform'],configuration,adapter)
outfile = book['outfile']
## checks were done earlier, it's new or not dup or newer--just write it.
if book['collision'] in (ADDNEW, SKIP, OVERWRITE, OVERWRITEALWAYS) or \
('epub_for_update' not in book and book['collision'] in (UPDATE, UPDATEALWAYS)):
# preserve logfile even on overwrite.
if 'epub_for_update' in book:
adapter.logfile = get_update_data(book['epub_for_update'])[6]
# change the existing entries id to notid so
# write_epub writes a whole new set to indicate overwrite.
if adapter.logfile:
adapter.logfile = adapter.logfile.replace("span id","span notid")
if book['collision'] == OVERWRITE and 'fileupdated' in book:
lastupdated=story.getMetadataRaw('dateUpdated')
fileupdated=book['fileupdated']
# updated doesn't have time (or is midnight), use dates only.
# updated does have time, use full timestamps.
if (lastupdated.time() == time.min and fileupdated.date() > lastupdated.date()) or \
(lastupdated.time() != time.min and fileupdated > lastupdated):
raise NotGoingToDownload(_("Not Overwriting, web site is not newer."),'edit-undo.png',showerror=False)
logger.info("write to %s"%outfile)
inject_cal_cols(book,story,configuration)
writer.writeStory(outfilename=outfile,
forceOverwrite=True,
notification=notification)
if adapter.story.chapter_error_count > 0:
book['comment'] = _('Download %(fileform)s completed, %(failed)s failed chapters, %(total)s total chapters.')%\
{'fileform':options['fileform'],
'failed':adapter.story.chapter_error_count,
'total':story.getMetadata("numChapters")}
book['chapter_error_count'] = adapter.story.chapter_error_count
else:
book['comment'] = _('Download %(fileform)s completed, %(total)s chapters.')%\
{'fileform':options['fileform'],
'total':story.getMetadata("numChapters")}
book['all_metadata'] = story.getAllMetadata(removeallentities=True)
if options['savemetacol'] != '':
book['savemetacol'] = story.dump_html_metadata()
## checks were done earlier, just update it.
elif 'epub_for_update' in book and book['collision'] in (UPDATE, UPDATEALWAYS):
# update now handled by pre-populating the old images and
# chapters in the adapter rather than merging epubs.
#urlchaptercount = int(story.getMetadata('numChapters').replace(',',''))
# returns int adjusted for start-end range.
urlchaptercount = story.getChapterCount()
(url,
chaptercount,
adapter.oldchapters,
adapter.oldimgs,
adapter.oldcover,
adapter.calibrebookmark,
adapter.logfile,
adapter.oldchaptersmap,
adapter.oldchaptersdata) = get_update_data(book['epub_for_update'])[0:9]
# dup handling from fff_plugin needed for anthology updates & BG metadata.
if book['collision'] in (UPDATE,UPDATEALWAYS):
if chaptercount == urlchaptercount and book['collision'] == UPDATE:
if merge:
## Deliberately pass for UPDATEALWAYS merge.
book['comment']=_("Already contains %d chapters. Reuse as is.")%chaptercount
book['all_metadata'] = story.getAllMetadata(removeallentities=True)
if options['savemetacol'] != '':
book['savemetacol'] = story.dump_html_metadata()
book['outfile'] = book['epub_for_update'] # for anthology merge ops.
return book
else:
raise NotGoingToDownload(_("Already contains %d chapters.")%chaptercount,'edit-undo.png',showerror=False)
elif chaptercount > urlchaptercount and not (book['collision'] == UPDATEALWAYS and adapter.getConfig('force_update_epub_always')):
raise NotGoingToDownload(_("Existing epub contains %d chapters, web site only has %d. Use Overwrite or force_update_epub_always to force update.") % (chaptercount,urlchaptercount),'dialog_error.png')
elif chaptercount == 0:
raise NotGoingToDownload(_("FanFicFare doesn't recognize chapters in existing epub, epub is probably from a different source. Use Overwrite to force update."),'dialog_error.png')
if not (book['collision'] == UPDATEALWAYS and chaptercount == urlchaptercount) \
and adapter.getConfig("do_update_hook"):
chaptercount = adapter.hookForUpdates(chaptercount)
logger.info("Do update - epub(%d) vs url(%d)" % (chaptercount, urlchaptercount))
logger.info("write to %s"%outfile)
inject_cal_cols(book,story,configuration)
writer.writeStory(outfilename=outfile,
forceOverwrite=True,
notification=notification)
if adapter.story.chapter_error_count > 0:
book['comment'] = _('Update %(fileform)s completed, added %(added)s chapters, %(failed)s failed chapters, for %(total)s total.')%\
{'fileform':options['fileform'],
'failed':adapter.story.chapter_error_count,
'added':(urlchaptercount-chaptercount),
'total':urlchaptercount}
book['chapter_error_count'] = adapter.story.chapter_error_count
else:
book['comment'] = _('Update %(fileform)s completed, added %(added)s chapters for %(total)s total.')%\
{'fileform':options['fileform'],'added':(urlchaptercount-chaptercount),'total':urlchaptercount}
book['all_metadata'] = story.getAllMetadata(removeallentities=True)
if options['savemetacol'] != '':
book['savemetacol'] = story.dump_html_metadata()
else:
## Shouldn't ever get here, but hey, it happened once
## before with prefs['collision']
raise Exception("Impossible state reached -- Book: %s:\nOptions:%s:"%(book,options))
if options['do_wordcount'] == SAVE_YES or (
options['do_wordcount'] == SAVE_YES_UNLESS_SITE and not story.getMetadataRaw('numWords') ):
try:
wordcount = get_word_count(outfile)
# logger.info("get_word_count:%s"%wordcount)
# clear cache for the rather unusual case of
# numWords affecting other previously cached
# entries.
story.clear_processed_metadata_cache()
story.setMetadata('numWords',wordcount)
writer.writeStory(outfilename=outfile, forceOverwrite=True)
book['all_metadata'] = story.getAllMetadata(removeallentities=True)
if options['savemetacol'] != '':
book['savemetacol'] = story.dump_html_metadata()
except:
logger.error("WordCount failed")
if options['smarten_punctuation'] and options['fileform'] == "epub":
# for smarten punc
from calibre.ebooks.oeb.polish.main import polish, ALL_OPTS
from calibre.utils.logging import Log
from collections import namedtuple
# do smarten_punctuation from calibre's polish feature
data = {'smarten_punctuation':True}
opts = ALL_OPTS.copy()
opts.update(data)
O = namedtuple('Options', ' '.join(ALL_OPTS.keys()))
opts = O(**opts)
log = Log(level=Log.DEBUG)
polish({outfile:outfile}, opts, log, logger.info)
## here to catch tags set in chapters in literotica for
## both overwrites and updates.
book['tags'] = story.getSubjectTags(removeallentities=True)
except NotGoingToDownload as d:
book['good']=False
book['status']=_('Bad')
book['showerror']=d.showerror
book['comment']=unicode(d)
book['icon'] = d.icon
except Exception as e:
book['good']=False
book['status']=_('Error')
book['comment']=unicode(e)
book['icon']='dialog_error.png'
book['status'] = _('Error')
logger.info("Exception: %s:%s"%(book,book['comment']),exc_info=True)
return book
## calibre's columns for an existing book are passed in and injected
## into the story's metadata. For convenience, we also add labels and
## valid_entries for them in a special [injected] section that has
## even less precedence than [defaults]
def inject_cal_cols(book,story,configuration):
configuration.remove_section('injected')
if 'calibre_columns' in book:
injectini = ['[injected]']
extra_valid = []
for k in book['calibre_columns'].keys():
v = book['calibre_columns'][k]
story.setMetadata(k,v['val'])
injectini.append('%s_label:%s'%(k,v['label']))
extra_valid.append(k)
if extra_valid: # if empty, there's nothing to add.
injectini.append("add_to_extra_valid_entries:,"+','.join(extra_valid))
configuration.read_file(StringIO('\n'.join(injectini)))
#print("added:\n%s\n"%('\n'.join(injectini)))
# -*- coding: utf-8 -*-
from __future__ import (unicode_literals, division, absolute_import,
print_function)
__license__ = 'GPL v3'
__copyright__ = '2016, Jim Miller, 2011, Grant Drake <grant.drake@gmail.com>'
__docformat__ = 'restructuredtext en'
import logging
logger = logging.getLogger(__name__)
import traceback
from datetime import time
from StringIO import StringIO
from calibre.utils.ipc.server import Server
from calibre.utils.ipc.job import ParallelJob
from calibre.constants import numeric_version as calibre_version
from calibre.utils.date import local_tz
from calibre.library.comments import sanitize_comments_html
from calibre_plugins.fanficfare_plugin.wordcount import get_word_count
from calibre_plugins.fanficfare_plugin.prefs import (SAVE_YES, SAVE_YES_UNLESS_SITE)
# pulls in translation files for _() strings
try:
load_translations()
except NameError:
pass # load_translations() added in calibre 1.9
# ------------------------------------------------------------------------------
#
# Functions to perform downloads using worker jobs
#
# ------------------------------------------------------------------------------
def do_download_worker(book_list,
options,
cpus,
merge=False,
notification=lambda x,y:x):
'''
Master job, to launch child jobs to extract ISBN for a set of books
This is run as a worker job in the background to keep the UI more
responsive and get around the memory leak issues as it will launch
a child job for each book as a worker process
'''
server = Server(pool_size=cpus)
logger.info(options['version'])
total = 0
alreadybad = []
# Queue all the jobs
logger.info("Adding jobs for URLs:")
for book in book_list:
logger.info("%s"%book['url'])
if book['good']:
total += 1
args = ['calibre_plugins.fanficfare_plugin.jobs',
'do_download_for_worker',
(book,options,merge)]
job = ParallelJob('arbitrary_n',
"url:(%s) id:(%s)"%(book['url'],book['calibre_id']),
done=None,
args=args)
job._book = book
server.add_job(job)
else:
# was already bad before the subprocess ever started.
alreadybad.append(book)
# This server is an arbitrary_n job, so there is a notifier available.
# Set the % complete to a small number to avoid the 'unavailable' indicator
notification(0.01, _('Downloading FanFiction Stories'))
# dequeue the job results as they arrive, saving the results
count = 0
while True:
job = server.changed_jobs_queue.get()
# A job can 'change' when it is not finished, for example if it
# produces a notification. Ignore these.
job.update()
if not job.is_finished:
continue
# A job really finished. Get the information.
book_list.remove(job._book)
book_list.append(job.result)
book_id = job._book['calibre_id']
count = count + 1
notification(float(count)/total, _('%d of %d stories finished downloading')%(count,total))
# Add this job's output to the current log
logger.info('Logfile for book ID %s (%s)'%(book_id, job._book['title']))
logger.info(job.details)
if count >= total:
## ordering first by good vs bad, then by listorder.
good_list = filter(lambda x : x['good'], book_list)
bad_list = filter(lambda x : not x['good'], book_list)
good_list = sorted(good_list,key=lambda x : x['listorder'])
bad_list = sorted(bad_list,key=lambda x : x['listorder'])
logger.info("\n"+_("Download Results:")+"\n%s\n"%("\n".join([ "%(url)s %(comment)s" % book for book in good_list+bad_list])))
logger.info("\n"+_("Successful:")+"\n%s\n"%("\n".join([book['url'] for book in good_list])))
logger.info("\n"+_("Unsuccessful:")+"\n%s\n"%("\n".join([book['url'] for book in bad_list])))
break
server.close()
# return the book list as the job result
return book_list
def do_download_for_worker(book,options,merge,notification=lambda x,y:x):
'''
Child job, to download story when run as a worker job
'''
from calibre_plugins.fanficfare_plugin import FanFicFareBase
fffbase = FanFicFareBase(options['plugin_path'])
with fffbase:
from calibre_plugins.fanficfare_plugin.dialogs import (NotGoingToDownload,
OVERWRITE, OVERWRITEALWAYS, UPDATE, UPDATEALWAYS, ADDNEW, SKIP, CALIBREONLY, CALIBREONLYSAVECOL)
from calibre_plugins.fanficfare_plugin.fanficfare import adapters, writers, exceptions
from calibre_plugins.fanficfare_plugin.fanficfare.epubutils import get_update_data
from calibre_plugins.fanficfare_plugin.fff_util import (get_fff_adapter, get_fff_config)
try:
book['comment'] = _('Download started...')
configuration = get_fff_config(book['url'],
options['fileform'],
options['personal.ini'])
if configuration.getConfig('use_ssl_unverified_context'):
## monkey patch to avoid SSL bug. dupliated from
## fff_plugin.py because bg jobs run in own process
## space.
import ssl
if hasattr(ssl, '_create_unverified_context'):
ssl._create_default_https_context = ssl._create_unverified_context
if not options['updateepubcover'] and 'epub_for_update' in book and options['collision'] in (UPDATE, UPDATEALWAYS):
configuration.set("overrides","never_make_cover","true")
# images only for epub, html, even if the user mistakenly
# turned it on else where.
if options['fileform'] not in ("epub","html"):
configuration.set("overrides","include_images","false")
adapter = adapters.getAdapter(configuration,book['url'])
adapter.is_adult = book['is_adult']
adapter.username = book['username']
adapter.password = book['password']
adapter.setChaptersRange(book['begin'],book['end'])
adapter.load_cookiejar(options['cookiejarfile'])
#logger.debug("cookiejar:%s"%adapter.cookiejar)
adapter.set_pagecache(options['pagecache'])
story = adapter.getStoryMetadataOnly()
if not story.getMetadata("series") and 'calibre_series' in book:
adapter.setSeries(book['calibre_series'][0],book['calibre_series'][1])
# set PI version instead of default.
if 'version' in options:
story.setMetadata('version',options['version'])
book['title'] = story.getMetadata("title", removeallentities=True)
book['author_sort'] = book['author'] = story.getList("author", removeallentities=True)
book['publisher'] = story.getMetadata("site")
book['url'] = story.getMetadata("storyUrl")
book['tags'] = story.getSubjectTags(removeallentities=True)
if story.getMetadata("description"):
book['comments'] = sanitize_comments_html(story.getMetadata("description"))
else:
book['comments']=''
book['series'] = story.getMetadata("series", removeallentities=True)
if story.getMetadataRaw('datePublished'):
book['pubdate'] = story.getMetadataRaw('datePublished').replace(tzinfo=local_tz)
if story.getMetadataRaw('dateUpdated'):
book['updatedate'] = story.getMetadataRaw('dateUpdated').replace(tzinfo=local_tz)
if story.getMetadataRaw('dateCreated'):
book['timestamp'] = story.getMetadataRaw('dateCreated').replace(tzinfo=local_tz)
else:
book['timestamp'] = None # need *something* there for calibre.
writer = writers.getWriter(options['fileform'],configuration,adapter)
outfile = book['outfile']
## No need to download at all. Shouldn't ever get down here.
if options['collision'] in (CALIBREONLY, CALIBREONLYSAVECOL):
logger.info("Skipping CALIBREONLY 'update' down inside worker--this shouldn't be happening...")
book['comment'] = _('Metadata collected.')
book['all_metadata'] = story.getAllMetadata(removeallentities=True)
if options['savemetacol'] != '':
book['savemetacol'] = story.dump_html_metadata()
## checks were done earlier, it's new or not dup or newer--just write it.
elif options['collision'] in (ADDNEW, SKIP, OVERWRITE, OVERWRITEALWAYS) or \
('epub_for_update' not in book and options['collision'] in (UPDATE, UPDATEALWAYS)):
# preserve logfile even on overwrite.
if 'epub_for_update' in book:
adapter.logfile = get_update_data(book['epub_for_update'])[6]
# change the existing entries id to notid so
# write_epub writes a whole new set to indicate overwrite.
if adapter.logfile:
adapter.logfile = adapter.logfile.replace("span id","span notid")
if options['collision'] == OVERWRITE and 'fileupdated' in book:
lastupdated=story.getMetadataRaw('dateUpdated')
fileupdated=book['fileupdated']
# updated doesn't have time (or is midnight), use dates only.
# updated does have time, use full timestamps.
if (lastupdated.time() == time.min and fileupdated.date() > lastupdated.date()) or \
(lastupdated.time() != time.min and fileupdated > lastupdated):
raise NotGoingToDownload(_("Not Overwriting, web site is not newer."),'edit-undo.png',showerror=False)
logger.info("write to %s"%outfile)
inject_cal_cols(book,story,configuration)
writer.writeStory(outfilename=outfile, forceOverwrite=True)
book['comment'] = _('Download %s completed, %s chapters.')%(options['fileform'],story.getMetadata("numChapters"))
book['all_metadata'] = story.getAllMetadata(removeallentities=True)
if options['savemetacol'] != '':
book['savemetacol'] = story.dump_html_metadata()
## checks were done earlier, just update it.
elif 'epub_for_update' in book and options['collision'] in (UPDATE, UPDATEALWAYS):
# update now handled by pre-populating the old images and
# chapters in the adapter rather than merging epubs.
urlchaptercount = int(story.getMetadata('numChapters').replace(',',''))
(url,
chaptercount,
adapter.oldchapters,
adapter.oldimgs,
adapter.oldcover,
adapter.calibrebookmark,
adapter.logfile,
adapter.oldchaptersmap,
adapter.oldchaptersdata) = get_update_data(book['epub_for_update'])[0:9]
# dup handling from fff_plugin needed for anthology updates.
if options['collision'] == UPDATE:
if chaptercount == urlchaptercount:
if merge:
book['comment']=_("Already contains %d chapters. Reuse as is.")%chaptercount
book['all_metadata'] = story.getAllMetadata(removeallentities=True)
if options['savemetacol'] != '':
book['savemetacol'] = story.dump_html_metadata()
book['outfile'] = book['epub_for_update'] # for anthology merge ops.
return book
else: # not merge,
raise NotGoingToDownload(_("Already contains %d chapters.")%chaptercount,'edit-undo.png',showerror=False)
elif chaptercount > urlchaptercount:
raise NotGoingToDownload(_("Existing epub contains %d chapters, web site only has %d. Use Overwrite to force update.") % (chaptercount,urlchaptercount),'dialog_error.png')
elif chaptercount == 0:
raise NotGoingToDownload(_("FanFicFare doesn't recognize chapters in existing epub, epub is probably from a different source. Use Overwrite to force update."),'dialog_error.png')
if not (options['collision'] == UPDATEALWAYS and chaptercount == urlchaptercount) \
and adapter.getConfig("do_update_hook"):
chaptercount = adapter.hookForUpdates(chaptercount)
logger.info("Do update - epub(%d) vs url(%d)" % (chaptercount, urlchaptercount))
logger.info("write to %s"%outfile)
inject_cal_cols(book,story,configuration)
writer.writeStory(outfilename=outfile, forceOverwrite=True)
book['comment'] = _('Update %s completed, added %s chapters for %s total.')%\
(options['fileform'],(urlchaptercount-chaptercount),urlchaptercount)
book['all_metadata'] = story.getAllMetadata(removeallentities=True)
if options['savemetacol'] != '':
book['savemetacol'] = story.dump_html_metadata()
if options['do_wordcount'] == SAVE_YES or (
options['do_wordcount'] == SAVE_YES_UNLESS_SITE and not story.getMetadataRaw('numWords') ):
wordcount = get_word_count(outfile)
logger.info("get_word_count:%s"%wordcount)
story.setMetadata('numWords',wordcount)
writer.writeStory(outfilename=outfile, forceOverwrite=True)
book['all_metadata'] = story.getAllMetadata(removeallentities=True)
if options['savemetacol'] != '':
book['savemetacol'] = story.dump_html_metadata()
if options['smarten_punctuation'] and options['fileform'] == "epub" \
and calibre_version >= (0, 9, 39):
# for smarten punc
from calibre.ebooks.oeb.polish.main import polish, ALL_OPTS
from calibre.utils.logging import Log
from collections import namedtuple
# do smarten_punctuation from calibre's polish feature
data = {'smarten_punctuation':True}
opts = ALL_OPTS.copy()
opts.update(data)
O = namedtuple('Options', ' '.join(ALL_OPTS.iterkeys()))
opts = O(**opts)
log = Log(level=Log.DEBUG)
polish({outfile:outfile}, opts, log, logger.info)
except NotGoingToDownload as d:
book['good']=False
book['showerror']=d.showerror
book['comment']=unicode(d)
book['icon'] = d.icon
except Exception as e:
book['good']=False
book['comment']=unicode(e)
book['icon']='dialog_error.png'
book['status'] = _('Error')
logger.info("Exception: %s:%s"%(book,unicode(e)),exc_info=True)
#time.sleep(10)
return book
## calibre's columns for an existing book are pased in and injected
## into the story's metadata. For convenience, we also add labels and
## valid_entries for them in a special [injected] section that has
## even less precedence than [defaults]
def inject_cal_cols(book,story,configuration):
configuration.remove_section('injected')
if 'calibre_columns' in book:
injectini = ['[injected]']
extra_valid = []
for k, v in book['calibre_columns'].iteritems():
story.setMetadata(k,v['val'])
injectini.append('%s_label:%s'%(k,v['label']))
extra_valid.append(k)
if extra_valid: # if empty, there's nothing to add.
injectini.append("add_to_extra_valid_entries:,"+','.join(extra_valid))
configuration.readfp(StringIO('\n'.join(injectini)))
#print("added:\n%s\n"%('\n'.join(injectini)))

File diff suppressed because it is too large Load diff

View file

@ -3,9 +3,22 @@
[defaults]
## [defaults] section applies to all formats and sites but may be
## overridden at several levels. See
## https://github.com/JimmXinu/FanFicFare/wiki/INI-File for more
## details.
## overridden at several levels. Example:
## [defaults]
## titlepage_entries: category,genre, status
## [www.whofic.com]
## # overrides defaults.
## titlepage_entries: category,genre, status,dateUpdated,rating
## [epub]
## # overrides defaults & site section
## titlepage_entries: category,genre, status,datePublished,dateUpdated,dateCreated
## [www.whofic.com:epub]
## # overrides defaults, site section & format section
## titlepage_entries: category,genre, status,datePublished
## [overrides]
## # overrides all other sections
## titlepage_entries: category
## Some sites also require the user to confirm they are adult for
## adult content. Uncomment by removing '#' in front of is_adult.
@ -16,32 +29,38 @@
## want to make them all look the same? Strip them off, then add them
## back on with add_chapter_numbers. Don't like the way it strips
## numbers or adds them back? See chapter_title_strip_pattern and
## chapter_title_add_pattern in defaults.ini.
## chapter_title_add_pattern.
#strip_chapter_numbers:true
#add_chapter_numbers:true
## Add this to genre if there's more than one category.
#add_genre_when_multi_category: Crossover
[epub]
## Include images from img tags in the body and summary of stories.
## include images from img tags in the body and summary of stories.
## Images will be converted to jpg for size if possible. Images work
## in epub format only. To get mobi or other format with images,
## download as epub and use Calibre to convert.
## true by default, uncomment and set false to not include images.
#include_images:true
## If set false, the summary will have all html stripped for safety.
## If not set, the summary will have all html stripped for safety.
## Both this and include_images must be true to get images in the
## summary.
## true by default, uncomment and set false to not keep summary html.
#keep_summary_html:true
## If set true, and there isn't a specific cover image, the first
## image found in the story will be made the cover image. If
## keep_summary_html is true, images in the summary will be before any
## If set, the first image found will be made the cover image. If
## keep_summary_html is true, any images in summary will be before any
## in chapters.
## true by default, uncomment and set false to turn off
#make_firstimage_cover:true
## Resize images down to width, height, preserving aspect ratio.
## Nook size, with margin.
#image_max_size: 580, 725
## Change image to grayscale, if graphics library allows, to save
## space.
#grayscale_images: false
## Most common, I expect will be using this to save username/passwords
## for different sites. Here are a few examples. See defaults.ini
@ -53,6 +72,28 @@
## default is false
#collect_series: true
[ficwad.com]
#username:YourUsername
#password:YourPassword
[www.adastrafanfic.com]
## Some sites do not require a login, but do require the user to
## confirm they are adult for adult content.
#is_adult:true
[www.twcslibrary.net]
#username:YourName
#password:yourpassword
#is_adult:true
## default is false
#collect_series: true
[www.fictionalley.org]
#is_adult:true
[www.harrypotterfanfiction.com]
#is_adult:true
[www.fimfiction.net]
#is_adult:true
#fail_on_password: false
@ -61,9 +102,8 @@
#is_adult:true
## tth is a little unusual--it doesn't require user/pass, but the site
## keeps track of which chapters you've read and won't send another
## update until it thinks you're up to date. If you set
## username/password, FFF will login to download. Then the site
## thinks you're up to date.
## update until it thinks you're up to date. This way, on download,
## it thinks you're up to date.
#username:YourName
#password:yourpassword

View file

@ -1,282 +1,258 @@
# -*- coding: utf-8 -*-
from __future__ import (unicode_literals, division, absolute_import,
print_function)
__license__ = 'GPL v3'
__copyright__ = '2021, Jim Miller'
__docformat__ = 'restructuredtext en'
import logging
logger = logging.getLogger(__name__)
import copy
from calibre.gui2.ui import get_gui
# pulls in translation files for _() strings
try:
load_translations()
except NameError:
pass # load_translations() added in calibre 1.9
from calibre_plugins.fanficfare_plugin import __version__ as plugin_version
from calibre_plugins.fanficfare_plugin.common_utils import get_library_uuid
SKIP=_('Skip')
ADDNEW=_('Add New Book')
UPDATE=_('Update EPUB if New Chapters')
UPDATEALWAYS=_('Update EPUB Always')
OVERWRITE=_('Overwrite if Newer')
OVERWRITEALWAYS=_('Overwrite Always')
CALIBREONLY=_('Update Calibre Metadata from Web Site')
CALIBREONLYSAVECOL=_('Update Calibre Metadata from Saved Metadata Column')
collision_order=[SKIP,
ADDNEW,
UPDATE,
UPDATEALWAYS,
OVERWRITE,
OVERWRITEALWAYS,
CALIBREONLY,
CALIBREONLYSAVECOL,]
# best idea I've had for how to deal with config/pref saving the
# collision name in english.
SAVE_SKIP='Skip'
SAVE_ADDNEW='Add New Book'
SAVE_UPDATE='Update EPUB if New Chapters'
SAVE_UPDATEALWAYS='Update EPUB Always'
SAVE_OVERWRITE='Overwrite if Newer'
SAVE_OVERWRITEALWAYS='Overwrite Always'
SAVE_CALIBREONLY='Update Calibre Metadata Only'
SAVE_CALIBREONLYSAVECOL='Update Calibre Metadata Only(Saved Column)'
save_collisions={
SKIP:SAVE_SKIP,
ADDNEW:SAVE_ADDNEW,
UPDATE:SAVE_UPDATE,
UPDATEALWAYS:SAVE_UPDATEALWAYS,
OVERWRITE:SAVE_OVERWRITE,
OVERWRITEALWAYS:SAVE_OVERWRITEALWAYS,
CALIBREONLY:SAVE_CALIBREONLY,
CALIBREONLYSAVECOL:SAVE_CALIBREONLYSAVECOL,
SAVE_SKIP:SKIP,
SAVE_ADDNEW:ADDNEW,
SAVE_UPDATE:UPDATE,
SAVE_UPDATEALWAYS:UPDATEALWAYS,
SAVE_OVERWRITE:OVERWRITE,
SAVE_OVERWRITEALWAYS:OVERWRITEALWAYS,
SAVE_CALIBREONLY:CALIBREONLY,
SAVE_CALIBREONLYSAVECOL:CALIBREONLYSAVECOL,
}
anthology_collision_order=[UPDATE,
UPDATEALWAYS,
OVERWRITEALWAYS]
# Show translated strings, but save the same string in prefs so your
# prefs are the same in different languages.
YES=_('Yes, Always')
SAVE_YES='Yes'
YES_IF_IMG=_('Yes, if EPUB has a cover image')
SAVE_YES_IF_IMG='Yes, if img'
YES_UNLESS_IMG=_('Yes, unless FanFicFare found a cover image')
SAVE_YES_UNLESS_IMG='Yes, unless img'
YES_UNLESS_SITE=_('Yes, unless found on site')
SAVE_YES_UNLESS_SITE='Yes, unless site'
NO=_('No')
SAVE_NO='No'
prefs_save_options = {
YES:SAVE_YES,
SAVE_YES:YES,
YES_IF_IMG:SAVE_YES_IF_IMG,
SAVE_YES_IF_IMG:YES_IF_IMG,
YES_UNLESS_IMG:SAVE_YES_UNLESS_IMG,
SAVE_YES_UNLESS_IMG:YES_UNLESS_IMG,
NO:SAVE_NO,
SAVE_NO:NO,
YES_UNLESS_SITE:SAVE_YES_UNLESS_SITE,
SAVE_YES_UNLESS_SITE:YES_UNLESS_SITE,
}
updatecalcover_order=[YES,YES_IF_IMG,NO]
gencalcover_order=[YES,YES_UNLESS_IMG,NO]
do_wordcount_order=[YES,YES_UNLESS_SITE,NO]
PREFS_NAMESPACE = 'FanFicFarePlugin'
PREFS_KEY_SETTINGS = 'settings'
# Set defaults used by all. Library specific settings continue to
# take from here.
default_prefs = {}
default_prefs['last_saved_version'] = (0,0,0)
default_prefs['personal.ini'] = get_resources('plugin-example.ini')
default_prefs['cal_cols_pass_in'] = False
default_prefs['rejecturls'] = '' # removed, but need empty default for fallback
default_prefs['rejectreasons'] = '''Sucked
Boring
Dup from another site'''
default_prefs['reject_always'] = False
default_prefs['reject_delete_default'] = True
default_prefs['updatemeta'] = True
default_prefs['bgmeta'] = False
#default_prefs['updateepubcover'] = True # removed in favor of always True Oct 2022
default_prefs['keeptags'] = False
default_prefs['suppressauthorsort'] = False
default_prefs['suppresstitlesort'] = False
default_prefs['authorcase'] = False
default_prefs['titlecase'] = False
default_prefs['seriescase'] = False
default_prefs['setanthologyseries'] = False
default_prefs['mark'] = False
default_prefs['mark_success'] = True
default_prefs['mark_failed'] = True
default_prefs['mark_chapter_error'] = True
default_prefs['showmarked'] = False
default_prefs['autoconvert'] = False
default_prefs['urlsfromclip'] = True
default_prefs['button_instantpopup'] = False
default_prefs['updatedefault'] = True
default_prefs['fileform'] = 'epub'
default_prefs['collision'] = SAVE_UPDATE
default_prefs['deleteotherforms'] = False
default_prefs['adddialogstaysontop'] = False
default_prefs['lookforurlinhtml'] = False
default_prefs['checkforseriesurlid'] = True
default_prefs['auto_reject_seriesurlid'] = False
default_prefs['mark_series_anthologies'] = False
default_prefs['checkforurlchange'] = True
default_prefs['injectseries'] = False
default_prefs['matchtitleauth'] = True
default_prefs['do_wordcount'] = SAVE_YES_UNLESS_SITE
default_prefs['smarten_punctuation'] = False
default_prefs['show_est_time'] = False
default_prefs['send_lists'] = ''
default_prefs['read_lists'] = ''
default_prefs['addtolists'] = False
default_prefs['addtoreadlists'] = False
default_prefs['addtolistsonread'] = False
default_prefs['autounnew'] = False
default_prefs['updatecalcover'] = SAVE_YES_IF_IMG
default_prefs['covernewonly'] = False
default_prefs['gencalcover'] = SAVE_YES_UNLESS_IMG
default_prefs['updatecover'] = False
default_prefs['calibre_gen_cover'] = True
default_prefs['plugin_gen_cover'] = False
default_prefs['gcnewonly'] = True
default_prefs['gc_site_settings'] = {}
default_prefs['allow_gc_from_ini'] = True
default_prefs['gc_polish_cover'] = False
default_prefs['countpagesstats'] = []
default_prefs['wordcountmissing'] = False
default_prefs['errorcol'] = ''
default_prefs['save_all_errors'] = True
default_prefs['savemetacol'] = ''
default_prefs['lastcheckedcol'] = ''
default_prefs['custom_cols'] = {}
default_prefs['custom_cols_newonly'] = {}
default_prefs['allow_custcol_from_ini'] = True
default_prefs['std_cols_newonly'] = {}
default_prefs['set_author_url'] = True
default_prefs['set_series_url'] = True
default_prefs['includecomments'] = False
default_prefs['anth_comments_newonly'] = True
default_prefs['imapserver'] = ''
default_prefs['imapuser'] = ''
default_prefs['imappass'] = ''
default_prefs['imapsessionpass'] = False
default_prefs['imapfolder'] = 'INBOX'
default_prefs['imaptags'] = ''
default_prefs['imapmarkread'] = True
default_prefs['auto_reject_from_email'] = False
default_prefs['update_existing_only_from_email'] = False
default_prefs['download_from_email_immediately'] = False
#default_prefs['single_proc_jobs'] = True # setting and code removed
default_prefs['site_split_jobs'] = True
default_prefs['reconsolidate_jobs'] = True
def set_library_config(library_config,db,setting=PREFS_KEY_SETTINGS):
db.prefs.set_namespaced(PREFS_NAMESPACE,
setting,
library_config)
def get_library_config(db,setting=PREFS_KEY_SETTINGS,def_prefs=default_prefs):
library_id = get_library_uuid(db)
library_config = None
if library_config is None:
#print("get prefs from db")
library_config = db.prefs.get_namespaced(PREFS_NAMESPACE,
setting)
if library_config is None:
# defaults.
logger.info("Using default settings")
library_config = copy.deepcopy(def_prefs)
return library_config
# fake out so I don't have to change the prefs calls anywhere. The
# Java programmer in me is offended by op-overloading, but it's very
# tidy.
class PrefsFacade():
def _get_db(self):
if self.passed_db:
return self.passed_db
else:
# In the GUI plugin we want current db so we detect when
# it's changed. CLI plugin calls need to pass db in.
return get_gui().current_db
def __init__(self,passed_db=None,setting=PREFS_KEY_SETTINGS,def_prefs=default_prefs):
self.default_prefs = def_prefs
self.setting=setting
self.libraryid = None
self.current_prefs = None
self.passed_db=passed_db
def _get_prefs(self):
libraryid = get_library_uuid(self._get_db())
if self.current_prefs == None or self.libraryid != libraryid:
#print("self.current_prefs == None(%s) or self.libraryid != libraryid(%s)"%(self.current_prefs == None,self.libraryid != libraryid))
self.libraryid = libraryid
self.current_prefs = get_library_config(self._get_db(),
setting=self.setting,
def_prefs=self.default_prefs)
return self.current_prefs
def __getitem__(self,k):
prefs = self._get_prefs()
if k not in prefs:
# pulls from default_prefs.defaults automatically if not set
# in default_prefs
return self.default_prefs[k]
return prefs[k]
def __setitem__(self,k,v):
prefs = self._get_prefs()
prefs[k]=v
# self._save_prefs(prefs)
def __delitem__(self,k):
prefs = self._get_prefs()
if k in prefs:
del prefs[k]
def save_to_db(self):
self['last_saved_version'] = plugin_version
set_library_config(self._get_prefs(),self._get_db(),setting=self.setting)
prefs = PrefsFacade(setting=PREFS_KEY_SETTINGS,
def_prefs=default_prefs)
rejects_data = PrefsFacade(setting="rejects_data",
def_prefs={'rejecturls_data':[]})
# -*- coding: utf-8 -*-
from __future__ import (unicode_literals, division, absolute_import,
print_function)
__license__ = 'GPL v3'
__copyright__ = '2016, Jim Miller'
__docformat__ = 'restructuredtext en'
import logging
logger = logging.getLogger(__name__)
import copy
from calibre.utils.config import JSONConfig
from calibre.gui2.ui import get_gui
from calibre_plugins.fanficfare_plugin.common_utils import get_library_uuid
SKIP=_('Skip')
ADDNEW=_('Add New Book')
UPDATE=_('Update EPUB if New Chapters')
UPDATEALWAYS=_('Update EPUB Always')
OVERWRITE=_('Overwrite if Newer')
OVERWRITEALWAYS=_('Overwrite Always')
CALIBREONLY=_('Update Calibre Metadata from Web Site')
CALIBREONLYSAVECOL=_('Update Calibre Metadata from Saved Metadata Column')
collision_order=[SKIP,
ADDNEW,
UPDATE,
UPDATEALWAYS,
OVERWRITE,
OVERWRITEALWAYS,
CALIBREONLY,
CALIBREONLYSAVECOL,]
# best idea I've had for how to deal with config/pref saving the
# collision name in english.
SAVE_SKIP='Skip'
SAVE_ADDNEW='Add New Book'
SAVE_UPDATE='Update EPUB if New Chapters'
SAVE_UPDATEALWAYS='Update EPUB Always'
SAVE_OVERWRITE='Overwrite if Newer'
SAVE_OVERWRITEALWAYS='Overwrite Always'
SAVE_CALIBREONLY='Update Calibre Metadata Only'
SAVE_CALIBREONLYSAVECOL='Update Calibre Metadata Only(Saved Column)'
save_collisions={
SKIP:SAVE_SKIP,
ADDNEW:SAVE_ADDNEW,
UPDATE:SAVE_UPDATE,
UPDATEALWAYS:SAVE_UPDATEALWAYS,
OVERWRITE:SAVE_OVERWRITE,
OVERWRITEALWAYS:SAVE_OVERWRITEALWAYS,
CALIBREONLY:SAVE_CALIBREONLY,
CALIBREONLYSAVECOL:SAVE_CALIBREONLYSAVECOL,
SAVE_SKIP:SKIP,
SAVE_ADDNEW:ADDNEW,
SAVE_UPDATE:UPDATE,
SAVE_UPDATEALWAYS:UPDATEALWAYS,
SAVE_OVERWRITE:OVERWRITE,
SAVE_OVERWRITEALWAYS:OVERWRITEALWAYS,
SAVE_CALIBREONLY:CALIBREONLY,
SAVE_CALIBREONLYSAVECOL:CALIBREONLYSAVECOL,
}
anthology_collision_order=[UPDATE,
UPDATEALWAYS,
OVERWRITEALWAYS]
# Show translated strings, but save the same string in prefs so your
# prefs are the same in different languages.
YES=_('Yes, Always')
SAVE_YES='Yes'
YES_IF_IMG=_('Yes, if EPUB has a cover image')
SAVE_YES_IF_IMG='Yes, if img'
YES_UNLESS_IMG=_('Yes, unless FanFicFare found a cover image')
SAVE_YES_UNLESS_IMG='Yes, unless img'
YES_UNLESS_SITE=_('Yes, unless found on site')
SAVE_YES_UNLESS_SITE='Yes, unless site'
NO=_('No')
SAVE_NO='No'
prefs_save_options = {
YES:SAVE_YES,
SAVE_YES:YES,
YES_IF_IMG:SAVE_YES_IF_IMG,
SAVE_YES_IF_IMG:YES_IF_IMG,
YES_UNLESS_IMG:SAVE_YES_UNLESS_IMG,
SAVE_YES_UNLESS_IMG:YES_UNLESS_IMG,
NO:SAVE_NO,
SAVE_NO:NO,
YES_UNLESS_SITE:SAVE_YES_UNLESS_SITE,
SAVE_YES_UNLESS_SITE:YES_UNLESS_SITE,
}
updatecalcover_order=[YES,YES_IF_IMG,NO]
gencalcover_order=[YES,YES_UNLESS_IMG,NO]
do_wordcount_order=[YES,YES_UNLESS_SITE,NO]
# if don't have any settings for FanFicFarePlugin, copy from
# predecessor FanFictionDownLoaderPlugin.
FFDL_PREFS_NAMESPACE = 'FanFictionDownLoaderPlugin'
PREFS_NAMESPACE = 'FanFicFarePlugin'
PREFS_KEY_SETTINGS = 'settings'
# Set defaults used by all. Library specific settings continue to
# take from here.
default_prefs = {}
default_prefs['personal.ini'] = get_resources('plugin-example.ini')
default_prefs['cal_cols_pass_in'] = False
default_prefs['rejecturls'] = ''
default_prefs['rejectreasons'] = '''Sucked
Boring
Dup from another site'''
default_prefs['reject_always'] = False
default_prefs['updatemeta'] = True
default_prefs['bgmeta'] = False
default_prefs['updateepubcover'] = False
default_prefs['keeptags'] = False
default_prefs['suppressauthorsort'] = False
default_prefs['suppresstitlesort'] = False
default_prefs['mark'] = False
default_prefs['showmarked'] = False
default_prefs['autoconvert'] = False
default_prefs['urlsfromclip'] = True
default_prefs['updatedefault'] = True
default_prefs['fileform'] = 'epub'
default_prefs['collision'] = SAVE_UPDATE
default_prefs['deleteotherforms'] = False
default_prefs['adddialogstaysontop'] = False
default_prefs['lookforurlinhtml'] = False
default_prefs['checkforseriesurlid'] = True
default_prefs['auto_reject_seriesurlid'] = False
default_prefs['checkforurlchange'] = True
default_prefs['injectseries'] = False
default_prefs['matchtitleauth'] = True
default_prefs['do_wordcount'] = SAVE_YES_UNLESS_SITE
default_prefs['smarten_punctuation'] = False
default_prefs['show_est_time'] = False
default_prefs['send_lists'] = ''
default_prefs['read_lists'] = ''
default_prefs['addtolists'] = False
default_prefs['addtoreadlists'] = False
default_prefs['addtolistsonread'] = False
default_prefs['autounnew'] = False
default_prefs['updatecalcover'] = None
default_prefs['gencalcover'] = SAVE_YES
default_prefs['updatecover'] = False
default_prefs['calibre_gen_cover'] = False
default_prefs['plugin_gen_cover'] = True
default_prefs['gcnewonly'] = False
default_prefs['gc_site_settings'] = {}
default_prefs['allow_gc_from_ini'] = True
default_prefs['gc_polish_cover'] = False
default_prefs['countpagesstats'] = []
default_prefs['wordcountmissing'] = False
default_prefs['errorcol'] = ''
default_prefs['save_all_errors'] = True
default_prefs['savemetacol'] = ''
default_prefs['custom_cols'] = {}
default_prefs['custom_cols_newonly'] = {}
default_prefs['allow_custcol_from_ini'] = True
default_prefs['std_cols_newonly'] = {}
default_prefs['set_author_url'] = True
default_prefs['includecomments'] = False
default_prefs['anth_comments_newonly'] = True
default_prefs['imapserver'] = ''
default_prefs['imapuser'] = ''
default_prefs['imappass'] = ''
default_prefs['imapsessionpass'] = False
default_prefs['imapfolder'] = 'INBOX'
default_prefs['imapmarkread'] = True
default_prefs['auto_reject_from_email'] = False
default_prefs['update_existing_only_from_email'] = False
default_prefs['download_from_email_immediately'] = False
def set_library_config(library_config,db):
db.prefs.set_namespaced(PREFS_NAMESPACE,
PREFS_KEY_SETTINGS,
library_config)
def get_library_config(db):
library_id = get_library_uuid(db)
library_config = None
if library_config is None:
#print("get prefs from db")
library_config = db.prefs.get_namespaced(PREFS_NAMESPACE,
PREFS_KEY_SETTINGS)
# if don't have any settings for FanFicFarePlugin, copy from
# predecessor FanFictionDownLoaderPlugin.
if library_config is None:
logger.info("Attempting to read settings from predecessor--FFDL")
library_config = db.prefs.get_namespaced(FFDL_PREFS_NAMESPACE,
PREFS_KEY_SETTINGS)
if library_config is None:
# defaults.
logger.info("Using default settings")
library_config = copy.deepcopy(default_prefs)
return library_config
# fake out so I don't have to change the prefs calls anywhere. The
# Java programmer in me is offended by op-overloading, but it's very
# tidy.
class PrefsFacade():
def _get_db(self):
if self.passed_db:
return self.passed_db
else:
# In the GUI plugin we want current db so we detect when
# it's changed. CLI plugin calls need to pass db in.
return get_gui().current_db
def __init__(self,passed_db=None):
self.default_prefs = default_prefs
self.libraryid = None
self.current_prefs = None
self.passed_db=passed_db
def _get_prefs(self):
libraryid = get_library_uuid(self._get_db())
if self.current_prefs == None or self.libraryid != libraryid:
#print("self.current_prefs == None(%s) or self.libraryid != libraryid(%s)"%(self.current_prefs == None,self.libraryid != libraryid))
self.libraryid = libraryid
self.current_prefs = get_library_config(self._get_db())
return self.current_prefs
def __getitem__(self,k):
prefs = self._get_prefs()
if k not in prefs:
# pulls from default_prefs.defaults automatically if not set
# in default_prefs
return self.default_prefs[k]
return prefs[k]
def __setitem__(self,k,v):
prefs = self._get_prefs()
prefs[k]=v
# self._save_prefs(prefs)
def __delitem__(self,k):
prefs = self._get_prefs()
if k in prefs:
del prefs[k]
def save_to_db(self):
set_library_config(self._get_prefs(),self._get_db())
prefs = PrefsFacade()

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -18,7 +18,6 @@ logger = logging.getLogger(__name__)
import re
from calibre.ebooks.oeb.iterator import EbookIterator
from fanficfare.six import text_type as unicode
RE_HTML_BODY = re.compile(u'<body[^>]*>(.*)</body>', re.UNICODE | re.DOTALL | re.IGNORECASE)
RE_STRIP_MARKUP = re.compile(u'<[^>]+>', re.UNICODE)
@ -29,7 +28,7 @@ def get_word_count(book_path):
Estimate a word count
'''
from calibre.utils.localization import get_lang
iterator = _open_epub_file(book_path)
lang = iterator.opf.language
@ -53,7 +52,7 @@ def _get_epub_standard_word_count(iterator, lang='en'):
'''
book_text = _read_epub_contents(iterator, strip_html=True)
try:
from calibre.spell.break_iterator import count_words
wordcount = count_words(book_text, lang)
@ -68,7 +67,7 @@ def _get_epub_standard_word_count(iterator, lang='en'):
wordcount = get_wordcount_obj(book_text)
wordcount = wordcount.words
logger.debug('\tWord count - old method:%s'%wordcount)
return wordcount
def _read_epub_contents(iterator, strip_html=False):
@ -93,3 +92,4 @@ def _extract_body_text(data):
if body:
return RE_STRIP_MARKUP.sub('', body[0]).replace('.','. ')
return ''

View file

@ -1,23 +1,7 @@
# -*- coding: utf-8 -*-
# Copyright 2018 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import absolute_import
# coding: utf-8
import re
import codecs
stack = []
@ -70,4 +54,4 @@ def flush():
del stack[:]
def get_stack():
return stack
return stack

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2015 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2015 Fanficdownloader team, 2016 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the 'License');
# you may not use this file except in compliance with the License.
@ -14,23 +14,20 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import absolute_import
try: # just a way to switch between CLI and PI
from calibre.constants import DEBUG
if os.environ.get('CALIBRE_WORKER', None) is not None or DEBUG:
loghandler.setLevel(logging.DEBUG)
logger.setLevel(logging.DEBUG)
else:
loghandler.setLevel(logging.CRITICAL)
logger.setLevel(logging.CRITICAL)
try:
# just a way to switch between web service and CLI/PI
import google.appengine.api
except:
import sys
if sys.version_info >= (2, 7):
import logging
logger = logging.getLogger(__name__)
loghandler=logging.StreamHandler()
loghandler.setFormatter(logging.Formatter("FFF: %(levelname)s: %(asctime)s: %(filename)s(%(lineno)d): %(message)s"))
logger.addHandler(loghandler)
loghandler.setLevel(logging.DEBUG)
logger.setLevel(logging.DEBUG)
try: # just a way to switch between CLI and PI
import calibre.constants
except:
import sys
if sys.version_info >= (2, 7):
import logging
logger = logging.getLogger(__name__)
loghandler=logging.StreamHandler()
loghandler.setFormatter(logging.Formatter("FFF: %(levelname)s: %(asctime)s: %(filename)s(%(lineno)d): %(message)s"))
logger.addHandler(loghandler)
loghandler.setLevel(logging.DEBUG)
logger.setLevel(logging.DEBUG)

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2020 FanFicFare team
# Copyright 2011 Fanficdownloader team, 2016 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,132 +15,135 @@
# limitations under the License.
#
from __future__ import absolute_import
import os, re, sys, types
from contextlib import contextmanager
import os, re, sys, glob, types
from os.path import dirname, basename, normpath
import logging
# py2 vs py3 transition
from ..six.moves.urllib.parse import urlparse
import urlparse as up
logger = logging.getLogger(__name__)
from .. import exceptions as exceptions
from .. import configurable as configurable
from ..configurable import Configuration
## must import each adapter here.
from . import base_adapter
from . import base_efiction_adapter
from . import adapter_test1
from . import adapter_test2
from . import adapter_test3
from . import adapter_test4
from . import adapter_fanfictionnet
from . import adapter_fictionalleyarchiveorg
from . import adapter_fictionpresscom
from . import adapter_ficwadcom
from . import adapter_fimfictionnet
from . import adapter_mediaminerorg
from . import adapter_potionsandsnitches
from . import adapter_tenhawkpresents
from . import adapter_adastrafanficcom
from . import adapter_tthfanficorg
from . import adapter_twilightednet
from . import adapter_whoficcom
from . import adapter_siyecouk
from . import adapter_archiveofourownorg
from . import adapter_ficbooknet
from . import adapter_midnightwhispers
from . import adapter_ksarchivecom
from . import adapter_libraryofmoriacom
from . import adapter_ashwindersycophanthexcom
from . import adapter_chaossycophanthexcom
from . import adapter_erosnsapphosycophanthexcom
from . import adapter_lumossycophanthexcom
from . import adapter_occlumencysycophanthexcom
from . import adapter_phoenixsongnet
from . import adapter_walkingtheplankorg
from . import adapter_dokugacom
from . import adapter_storiesofardacom
from . import adapter_ncisfictioncom
from . import adapter_fanfiktionde
from . import adapter_themasquenet
from . import adapter_pretendercentrecom
from . import adapter_darksolaceorg
from . import adapter_storyroomcom
from . import adapter_dracoandginnycom
from . import adapter_wolverineandroguecom
from . import adapter_thehookupzonenet
from . import adapter_efpfanficnet
from . import adapter_imagineeficcom
from . import adapter_storiesonlinenet
from . import adapter_literotica
from . import adapter_voracity2eficcom
from . import adapter_spikeluvercom
from . import adapter_bloodshedversecom
from . import adapter_fictionmaniatv
from . import adapter_sheppardweircom
from . import adapter_samandjacknet
from . import adapter_tgstorytimecom
from . import adapter_forumsspacebattlescom
from . import adapter_forumssufficientvelocitycom
from . import adapter_forumquestionablequestingcom
from . import adapter_ninelivesarchivecom
from . import adapter_masseffect2in
from . import adapter_quotevcom
from . import adapter_mcstoriescom
from . import adapter_naiceanilmenet
from . import adapter_adultfanfictionorg
from . import adapter_fictionhuntcom
from . import adapter_royalroadcom
from . import adapter_chosentwofanficcom
from . import adapter_bdsmlibrarycom
from . import adapter_asexstoriescom
from . import adapter_gluttonyfictioncom
from . import adapter_valentchambercom
from . import adapter_wwwgiantessworldnet
from . import adapter_starslibrarynet
from . import adapter_fanficauthorsnet
from . import adapter_fireflyfansnet
from . import adapter_trekfanfictionnet
from . import adapter_wwwutopiastoriescom
from . import adapter_sinfuldreamscomunicornfic
from . import adapter_sinfuldreamscomwickedtemptation
from . import adapter_asianfanficscom
from . import adapter_mttjustoncenet
from . import adapter_narutoficorg
from . import adapter_thedelphicexpansecom
from . import adapter_wwwaneroticstorycom
from . import adapter_lcfanficcom
from . import adapter_inkbunnynet
from . import adapter_alternatehistorycom
from . import adapter_wattpadcom
from . import adapter_novelonlinefullcom
from . import adapter_wwwnovelallcom
from . import adapter_hentaifoundrycom
from . import adapter_mugglenetfanfictioncom
from . import adapter_fanficsme
from . import adapter_fanfictalkcom
from . import adapter_scifistoriescom
from . import adapter_chireadscom
from . import adapter_scribblehubcom
from . import adapter_fictionlive
from . import adapter_thesietchcom
from . import adapter_squidgeworldorg
from . import adapter_novelfull
from . import adapter_psychficcom
from . import adapter_deviantartcom
from . import adapter_readonlymindcom
from . import adapter_wwwsunnydaleafterdarkcom
from . import adapter_syosetucom
from . import adapter_kakuyomujp
from . import adapter_fanfictionsfr
from . import adapter_touchfluffytail
from . import adapter_spiritfanfictioncom
from . import adapter_superlove
from . import adapter_cfaa
from . import adapter_althistorycom
import adapter_test1
import adapter_fanfictionnet
import adapter_fanficcastletvnet
import adapter_fictionalleyorg
import adapter_fictionpresscom
import adapter_ficwadcom
import adapter_fimfictionnet
import adapter_harrypotterfanfictioncom
import adapter_mediaminerorg
import adapter_potionsandsnitches
import adapter_tenhawkpresentscom
import adapter_adastrafanficcom
import adapter_twcslibrarynet
import adapter_tthfanficorg
import adapter_twilightednet
import adapter_whoficcom
import adapter_siyecouk
import adapter_archiveofourownorg
import adapter_ficbooknet
import adapter_portkeyorg
import adapter_mugglenetcom
import adapter_hpfandomnet
import adapter_nfacommunitycom
import adapter_midnightwhispersca
import adapter_ksarchivecom
import adapter_archiveskyehawkecom
import adapter_squidgeorgpeja
import adapter_libraryofmoriacom
import adapter_wraithbaitcom
import adapter_chaossycophanthexcom
import adapter_dramioneorg
import adapter_erosnsapphosycophanthexcom
import adapter_lumossycophanthexcom
import adapter_occlumencysycophanthexcom
import adapter_phoenixsongnet
import adapter_walkingtheplankorg
import adapter_ashwindersycophanthexcom
import adapter_thehexfilesnet
import adapter_dokugacom
import adapter_iketernalnet
import adapter_onedirectionfanfictioncom
import adapter_storiesofardacom
import adapter_samdeanarchivenu
import adapter_destinysgatewaycom
import adapter_ncisfictionnet
import adapter_thealphagatecom
import adapter_fanfiktionde
import adapter_ponyfictionarchivenet
import adapter_ncisficcom
import adapter_nationallibrarynet
import adapter_themasquenet
import adapter_pretendercentrecom
import adapter_darksolaceorg
import adapter_finestoriescom
import adapter_hpfanficarchivecom
import adapter_twilightarchivescom
import adapter_nhamagicalworldsus
import adapter_hlfictionnet
import adapter_dracoandginnycom
import adapter_scarvesandcoffeenet
import adapter_thepetulantpoetesscom
import adapter_wolverineandroguecom
import adapter_sinfuldesireorg
import adapter_merlinficdtwinscouk
import adapter_thehookupzonenet
import adapter_bloodtiesfancom
import adapter_indeathnet
import adapter_qafficcom
import adapter_efpfanficnet
import adapter_potterficscom
import adapter_efictionestelielde
import adapter_pommedesangcom
import adapter_restrictedsectionorg
import adapter_imagineeficcom
import adapter_psychficcom
import adapter_asr3slashzoneorg
import adapter_potterheadsanonymouscom
import adapter_fictionpadcom
import adapter_storiesonlinenet
import adapter_trekiverseorg
import adapter_literotica
import adapter_voracity2eficcom
import adapter_spikeluvercom
import adapter_bloodshedversecom
import adapter_nocturnallightnet
import adapter_fanfichu
import adapter_fanfictioncsodaidokhu
import adapter_fictionmaniatv
import adapter_tolkienfanfiction
import adapter_themaplebookshelf
import adapter_fannation
import adapter_sheppardweircom
import adapter_samandjacknet
import adapter_csiforensicscom
import adapter_lotrfanfictioncom
import adapter_fhsarchivecom
import adapter_fanfictionjunkiesde
import adapter_tgstorytimecom
import adapter_itcouldhappennet
import adapter_forumsspacebattlescom
import adapter_forumssufficientvelocitycom
import adapter_forumquestionablequestingcom
import adapter_ninelivesarchivecom
import adapter_masseffect2in
import adapter_quotevcom
import adapter_mcstoriescom
import adapter_lucifaelff
import adapter_buffygilescom
import adapter_andromedawebcom
import adapter_artemisfowlcom
import adapter_naiceanilmenet
import adapter_deepinmysoulnet
import adapter_haremlucifaelcom
import adapter_kiarepositorymujajinet
import adapter_fanfictionlucifaelcom
import adapter_adultfanfictionorg
import adapter_fictionhuntcom
## This bit of complexity allows adapters to be added by just adding
## importing. It eliminates the long if/else clauses we used to need
@ -151,11 +154,9 @@ __class_list = []
__domain_map = {}
def imports():
out = []
for name, val in globals().items():
if isinstance(val, types.ModuleType):
out.append(val.__name__)
return out
yield val.__name__
for x in imports():
if "fanficfare.adapters.adapter_" in x:
@ -163,35 +164,7 @@ for x in imports():
cls = sys.modules[x].getClass()
__class_list.append(cls)
for site in cls.getAcceptDomains():
l = __domain_map.get(site,[])
l.append(cls)
__domain_map[site]=l
def get_url_chapter_range(url_in):
# Allow chapter range with URL.
# like test1.com?sid=5[4-6] or [4,6]
mc = re.match(r"^(?P<url>.*?)(?:\[(?P<begin>\d+)?(?P<comma>[,-])?(?P<end>\d+)?\])?$",url_in)
#print("url:(%s) begin:(%s) end:(%s)"%(mc.group('url'),mc.group('begin'),mc.group('end')))
url = mc.group('url')
ch_begin = mc.group('begin')
ch_end = mc.group('end')
if ch_begin and not mc.group('comma'):
ch_end = ch_begin
return url,ch_begin,ch_end
# Call as ' with busy_cursor:"
@contextmanager
def lightweight_adapter(url):
adapter = None
try:
if not getNormalStoryURL.__dummyconfig:
getNormalStoryURL.__dummyconfig = configurable.Configuration(["test1.com"],"EPUB",lightweight=True)
adapter = getAdapter(getNormalStoryURL.__dummyconfig,url)
yield adapter
except:
yield None
finally:
del adapter
__domain_map[site]=cls
def getNormalStoryURL(url):
r = getNormalStoryURLSite(url)
@ -200,49 +173,28 @@ def getNormalStoryURL(url):
else:
return None
# kludgey function static/singleton
# Note it's *not* on lightweight_adapter because it can't reference
# itself in its definition.
getNormalStoryURL.__dummyconfig = None
def getNormalStoryURLSite(url):
with lightweight_adapter(url) as adapter:
if adapter:
return (adapter.url,adapter.getSiteDomain())
else:
return None
# print("getNormalStoryURLSite:%s"%url)
if not getNormalStoryURL.__dummyconfig:
getNormalStoryURL.__dummyconfig = Configuration(["test1.com"],"EPUB",lightweight=True)
# pulling up an adapter is pretty low over-head. If
# it fails, it's a bad url.
try:
adapter = getAdapter(getNormalStoryURL.__dummyconfig,url)
url = adapter.url
site = adapter.getSiteDomain()
del adapter
return (url,site)
except:
return None
## Originally defined for INI [storyUrl] sections where story URL
## contains a title that can change, now also used for reject list.
## waaaay faster with classmethod.
def get_section_url(url):
cls = _get_class_for(url)[0]
if cls:
return cls.get_section_url(url)
else:
## might be a url from a removed adapter.
## return unchanged in that case.
return url
def get_url_search(url):
'''
For adapters that have story URLs that can change. This is
used for searching the Calibre library by identifiers:url for
sites (generally) that contain author or title that can
change, but also have a unique identifier that doesn't.
returns a string containing a regexp, not a compiled re object.
'''
cls = _get_class_for(url)[0]
if not cls:
## still apply common processing.
cls = base_adapter.BaseSiteAdapter
return cls.get_url_search(url)
# kludgey function static/singleton
getNormalStoryURL.__dummyconfig = None
def getAdapter(config,url,anyurl=False):
#logger.debug("trying url:"+url)
(cls,fixedurl) = _get_class_for(url)
(cls,fixedurl) = getClassFor(url)
#logger.debug("fixedurl:"+fixedurl)
if cls:
if anyurl:
@ -266,7 +218,8 @@ def getConfigSections():
def get_bulk_load_sites():
# for now, all eFiction Base adapters are assumed to allow bulk_load.
sections = set()
for cls in [x for x in __class_list if issubclass(x,base_efiction_adapter.BaseEfictionAdapter) ]:
for cls in filter( lambda x : issubclass(x,base_efiction_adapter.BaseEfictionAdapter),
__class_list):
sections.update( [ x.replace('www.','') for x in cls.getConfigSections() ] )
return sections
@ -277,60 +230,48 @@ def getSiteExamples():
return l
def getConfigSectionsFor(url):
(cls,fixedurl) = _get_class_for(url)
(cls,fixedurl) = getClassFor(url)
if cls:
return cls.getConfigSections()
# No adapter found.
raise exceptions.UnknownSite( url, [cls.getSiteDomain() for cls in __class_list] )
def _get_class_for(url):
def getClassFor(url):
## fix up leading protocol.
fixedurl = re.sub(r"(?i)^[htp]+(s?)[:/]+",r"http\1://",url.strip())
if fixedurl.startswith("//"):
fixedurl = "http:%s"%url
if not fixedurl.startswith("http"):
fixedurl = "http://%s"%url
## remove any trailing '#' locations, except for #post-12345 for
## XenForo
if not "#post-" in fixedurl:
fixedurl = re.sub(r"#.*$","",fixedurl)
parsedUrl = urlparse(fixedurl)
parsedUrl = up.urlparse(fixedurl)
domain = parsedUrl.netloc.lower()
if( domain != parsedUrl.netloc ):
fixedurl = fixedurl.replace(parsedUrl.netloc,domain)
clslst = _get_classlist_fromlist(domain)
## assumes all adapters for a domain will have www or not have www
## but not mixed.
if not clslst and domain.startswith("www."):
cls = getClassFromList(domain)
if not cls and domain.startswith("www."):
domain = domain.replace("www.","")
#logger.debug("trying site:without www: "+domain)
clslst = _get_classlist_fromlist(domain)
cls = getClassFromList(domain)
fixedurl = re.sub(r"^http(s?)://www\.",r"http\1://",fixedurl)
if not clslst:
if not cls:
#logger.debug("trying site:www."+domain)
clslst =_get_classlist_fromlist("www."+domain)
cls = getClassFromList("www."+domain)
fixedurl = re.sub(r"^http(s?)://",r"http\1://www.",fixedurl)
cls = None
if clslst:
if len(clslst) == 1:
cls = clslst[0]
elif len(clslst) > 1:
for c in clslst:
if c.getSiteURLFragment() in fixedurl:
cls = c
break
if cls:
fixedurl = cls.stripURLParameters(fixedurl)
return (cls,fixedurl)
def _get_classlist_fromlist(domain):
def getClassFromList(domain):
try:
return __domain_map[domain]
except KeyError:

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,24 +15,222 @@
# limitations under the License.
#
from __future__ import absolute_import
# Software: eFiction
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib
import urllib2
from .base_otw_adapter import BaseOTWAdapter
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
def getClass():
return AdastrafanficComAdapter
from base_adapter import BaseSiteAdapter, makeDate
class AdastrafanficComAdapter(BaseOTWAdapter):
class AdAstraFanficComSiteAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseOTWAdapter.__init__(self, config, url)
# Each adapter needs to have a unique site abbreviation.
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev','aaff')
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.is_adult=False
@staticmethod # must be @staticmethod, don't remove it.
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
self._setURL('http://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
@staticmethod
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'www.adastrafanfic.com'
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/viewstory.php?sid=")+r"\d+$"
def use_pagecache(self):
'''
adapters that will work with the page cache need to implement
this and change it to True.
'''
return True
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
addurl = "&warning=5"
else:
addurl=""
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if "Content is only suitable for mature adults. May contain explicit language and adult themes. Equivalent of NC-17." in data:
raise exceptions.AdultCheckRequired(self.url)
# problems with some stories, but only in calibre. I suspect
# issues with different SGML parsers in python. This is a
# nasty hack, but it works.
data = data[data.index("<body"):]
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
## Title
a = soup.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
a = soup.find('a', href=re.compile(r"viewuser.php"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.string)
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
## <meta name='description' content='&lt;p&gt;Description&lt;/p&gt; ...' >
## Summary, strangely, is in the content attr of a <meta name='description'> tag
## which is escaped HTML. Unfortunately, we can't use it because they don't
## escape (') chars in the desc, breakin the tag.
#meta_desc = soup.find('meta',{'name':'description'})
#metasoup = bs.BeautifulStoneSoup(meta_desc['content'])
#self.story.setMetadata('description',stripHTML(metasoup))
def defaultGetattr(d,k):
try:
return d[k]
except:
return ""
# <span class="label">Rated:</span> NC-17<br /> etc
labels = soup.findAll('span',{'class':'label'})
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
if 'Summary' in label:
## Everything until the next span class='label'
svalue = ''
while value and 'label' not in defaultGetattr(value,'class'):
svalue += unicode(value)
value = value.nextSibling
# sometimes poorly formated desc (<p> w/o </p>) leads
# to all labels being included.
svalue=svalue[:svalue.find('<span class="label">')]
self.setDescription(url,svalue)
#self.story.setMetadata('description',stripHTML(svalue))
if 'Rated' in label:
self.story.setMetadata('rating', value)
if 'Word count' in label:
self.story.setMetadata('numWords', value)
if 'Categories' in label:
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
catstext = [cat.string for cat in cats]
for cat in catstext:
self.story.addToList('category',cat.string)
if 'Characters' in label:
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
charstext = [char.string for char in chars]
for char in charstext:
self.story.addToList('characters',char.string)
if 'Genre' in label:
genres = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
genrestext = [genre.string for genre in genres]
self.genre = ', '.join(genrestext)
for genre in genrestext:
self.story.addToList('genre',genre.string)
if 'Warnings' in label:
warnings = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=2'))
warningstext = [warning.string for warning in warnings]
self.warning = ', '.join(warningstext)
for warning in warningstext:
self.story.addToList('warnings',warning.string)
if 'Completed' in label:
if 'Yes' in value:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
if 'Published' in label:
self.story.setMetadata('datePublished', makeDate(value.strip(), "%d %b %Y"))
if 'Updated' in label:
# there's a stray [ at the end.
#value = value[0:-1]
self.story.setMetadata('dateUpdated', makeDate(value.strip(), "%d %b %Y"))
try:
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"viewseries.php\?seriesid=\d+"))
series_name = a.string
series_url = 'http://'+self.host+'/'+a['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self_make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('viewstory.php?sid='+self.story.getMetadata('storyId')):
self.setSeries(series_name, i)
self.story.setMetadata('seriesUrl',series_url)
break
i+=1
except:
# I find it hard to care if the series parsing fails
pass
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
data = self._fetchUrl(url)
# problems with some stories, but only in calibre. I suspect
# issues with different SGML parsers in python. This is a
# nasty hack, but it works.
data = data[data.index("<body"):]
soup = self.make_soup(data)
span = soup.find('div', {'id' : 'story'})
if None == span:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,span)
def getClass():
return AdAstraFanficComSiteAdapter

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# -- coding: utf-8 --
# Copyright 2013 Fanficdownloader team, 2020 FanFicFare team
# Copyright 2013 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -17,19 +17,16 @@
################################################################################
### Written by GComyn
################################################################################
from __future__ import absolute_import
from __future__ import unicode_literals
import time
import logging
logger = logging.getLogger(__name__)
import re
import sys
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from ..six import text_type as unicode
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
################################################################################
@ -42,7 +39,13 @@ class AdultFanFictionOrgAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
# logger.debug("AdultFanFictionOrgAdapter.__init__ - url='{0}'".format(url))
logger.debug("AdultFanFictionOrgAdapter.__init__ - url='{0}'".format(url))
self.decode = ["utf8",
"Windows-1252"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
@ -54,11 +57,10 @@ class AdultFanFictionOrgAdapter(BaseSiteAdapter):
#Setting the 'Zone' for each "Site"
self.zone = self.parsedUrl.netloc.split('.')[0]
# normalized story URL.(checking self.zone against list
# normalized story URL. (checking self.zone against list
# removed--it was redundant w/getAcceptDomains and
# getSiteURLPattern both)
self._setURL('https://{0}.{1}/story.php?no={2}'.format(self.zone, self.getBaseDomain(), self.story.getMetadata('storyId')))
#self._setURL('https://' + self.zone + '.' + self.getBaseDomain() + '/story.php?no='+self.story.getMetadata('storyId'))
self._setURL('http://' + self.zone + '.' + self.getBaseDomain() + '/story.php?no='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
#self.story.setMetadata('siteabbrev',self.getSiteAbbrev())
@ -68,7 +70,13 @@ class AdultFanFictionOrgAdapter(BaseSiteAdapter):
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%B %d, %Y"
self.dateformat = "%Y-%m-%d"
##This method will be moved to the sub-adapters
# @classmethod
# def getSiteAbbrev(self):
# return self.zone+'aff'
## Added because adult-fanfiction.org does send you to
## www.adult-fanfiction.org when you go to it and it also moves
@ -111,31 +119,79 @@ class AdultFanFictionOrgAdapter(BaseSiteAdapter):
@classmethod
def getSiteExampleURLs(self):
return ("https://anime.adult-fanfiction.org/story.php?no=123456789 "
+ "https://anime2.adult-fanfiction.org/story.php?no=123456789 "
+ "https://bleach.adult-fanfiction.org/story.php?no=123456789 "
+ "https://books.adult-fanfiction.org/story.php?no=123456789 "
+ "https://buffy.adult-fanfiction.org/story.php?no=123456789 "
+ "https://cartoon.adult-fanfiction.org/story.php?no=123456789 "
+ "https://celeb.adult-fanfiction.org/story.php?no=123456789 "
+ "https://comics.adult-fanfiction.org/story.php?no=123456789 "
+ "https://ff.adult-fanfiction.org/story.php?no=123456789 "
+ "https://games.adult-fanfiction.org/story.php?no=123456789 "
+ "https://hp.adult-fanfiction.org/story.php?no=123456789 "
+ "https://inu.adult-fanfiction.org/story.php?no=123456789 "
+ "https://lotr.adult-fanfiction.org/story.php?no=123456789 "
+ "https://manga.adult-fanfiction.org/story.php?no=123456789 "
+ "https://movies.adult-fanfiction.org/story.php?no=123456789 "
+ "https://naruto.adult-fanfiction.org/story.php?no=123456789 "
+ "https://ne.adult-fanfiction.org/story.php?no=123456789 "
+ "https://original.adult-fanfiction.org/story.php?no=123456789 "
+ "https://tv.adult-fanfiction.org/story.php?no=123456789 "
+ "https://xmen.adult-fanfiction.org/story.php?no=123456789 "
+ "https://ygo.adult-fanfiction.org/story.php?no=123456789 "
+ "https://yuyu.adult-fanfiction.org/story.php?no=123456789")
return ("http://anime.adult-fanfiction.org/story.php?no=123456789 "
+ "http://anime2.adult-fanfiction.org/story.php?no=123456789 "
+ "http://bleach.adult-fanfiction.org/story.php?no=123456789 "
+ "http://books.adult-fanfiction.org/story.php?no=123456789 "
+ "http://buffy.adult-fanfiction.org/story.php?no=123456789 "
+ "http://cartoon.adult-fanfiction.org/story.php?no=123456789 "
+ "http://celeb.adult-fanfiction.org/story.php?no=123456789 "
+ "http://comics.adult-fanfiction.org/story.php?no=123456789 "
+ "http://ff.adult-fanfiction.org/story.php?no=123456789 "
+ "http://games.adult-fanfiction.org/story.php?no=123456789 "
+ "http://hp.adult-fanfiction.org/story.php?no=123456789 "
+ "http://inu.adult-fanfiction.org/story.php?no=123456789 "
+ "http://lotr.adult-fanfiction.org/story.php?no=123456789 "
+ "http://manga.adult-fanfiction.org/story.php?no=123456789 "
+ "http://movies.adult-fanfiction.org/story.php?no=123456789 "
+ "http://naruto.adult-fanfiction.org/story.php?no=123456789 "
+ "http://ne.adult-fanfiction.org/story.php?no=123456789 "
+ "http://original.adult-fanfiction.org/story.php?no=123456789 "
+ "http://tv.adult-fanfiction.org/story.php?no=123456789 "
+ "http://xmen.adult-fanfiction.org/story.php?no=123456789 "
+ "http://ygo.adult-fanfiction.org/story.php?no=123456789 "
+ "http://yuyu.adult-fanfiction.org/story.php?no=123456789")
def getSiteURLPattern(self):
return r'https?://(anime|anime2|bleach|books|buffy|cartoon|celeb|comics|ff|games|hp|inu|lotr|manga|movies|naruto|ne|original|tv|xmen|ygo|yuyu)\.adult-fanfiction\.org/story\.php\?no=\d+$'
return r'http?://(anime|anime2|bleach|books|buffy|cartoon|celeb|comics|ff|games|hp|inu|lotr|manga|movies|naruto|ne|original|tv|xmen|ygo|yuyu)\.adult-fanfiction\.org/story\.php\?no=\d+$'
##This is not working right now, so I'm commenting it out, but leaving it for future testing
## Login seems to be reasonably standard across eFiction sites.
#def needToLoginCheck(self, data):
##This adapter will always require a login
# return True
# <form name="login" method="post" action="">
# <div class="top">E-mail: <span id="sprytextfield1">
# <input name="email" type="text" id="email" size="20" maxlength="255" />
# <span class="textfieldRequiredMsg">Email is required.</span><span class="textfieldInvalidFormatMsg">Invalid E-mail.</span></span></div>
# <div class="top">Password: <span id="sprytextfield2">
# <input name="pass1" type="password" id="pass1" size="20" maxlength="32" />
# <span class="textfieldRequiredMsg">password is required.</span><span class="textfieldMinCharsMsg">Minimum 8 characters8.</span><span class="textfieldMaxCharsMsg">Exceeded 32 characters.</span></span></div>
# <div class="top"><br /> <input name="loginsubmittop" type="hidden" id="loginsubmit" value="TRUE" />
# <input type="submit" value="Login" />
# </div>
# </form>
##This is not working right now, so I'm commenting it out, but leaving it for future testing
#def performLogin(self, url, soup):
# params = {}
# if self.password:
# params['email'] = self.username
# params['pass1'] = self.password
# else:
# params['email'] = self.getConfig("username")
# params['pass1'] = self.getConfig("password")
# params['submit'] = 'Login'
# # copy all hidden input tags to pick up appropriate tokens.
# for tag in soup.findAll('input',{'type':'hidden'}):
# params[tag['name']] = tag['value']
# logger.debug("Will now login to URL {0} as {1} with password: {2}".format(url, params['email'],params['pass1']))
# d = self._postUrl(url, params, usecache=False)
# d = self._fetchUrl(url, params, usecache=False)
# soup = self.make_soup(d)
#if not (soup.find('form', {'name' : 'login'}) == None):
# logger.info("Failed to login to URL %s as %s" % (url, params['email']))
# raise exceptions.FailedToLogin(url,params['email'])
# return False
#else:
# return True
## Getting the chapter list and the meta data, plus 'is adult' checking.
def doExtractChapterUrlsAndMetadata(self, get_cover=True):
@ -143,109 +199,177 @@ class AdultFanFictionOrgAdapter(BaseSiteAdapter):
## You need to have your is_adult set to true to get this story
if not (self.is_adult or self.getConfig("is_adult")):
raise exceptions.AdultCheckRequired(self.url)
else:
d = self.post_request('https://www.adult-fanfiction.org/globals/ajax/age-verify.php', {"verify":"1"})
if "Age verified successfully" not in d:
raise exceptions.FailedToDownload("Failed to Verify Age: {0}".format(d))
url = self.url
logger.debug("URL: "+url)
data = self.get_request(url)
# logger.debug(data)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code in 404:
raise exceptions.StoryDoesNotExist("Code: 404. %s"%self.url)
elif e.code == 410:
raise exceptions.StoryDoesNotExist("Code: 410. %s"%self.url)
elif e.code == 401:
self.needToLogin = True
data = ''
else:
raise e
if "The dragons running the back end of the site can not seem to find the story you are looking for." in data:
raise exceptions.StoryDoesNotExist("{0}.{1} says: The dragons running the back end of the site can not seem to find the story you are looking for.".format(self.zone, self.getBaseDomain()))
raise exceptions.StoryDoesNotExist(self.zone+'.'+self.getBaseDomain()
+" says: The dragons running the back end of the site can not seem to find the story you are looking for.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
##This is not working right now, so I'm commenting it out, but leaving it for future testing
#self.performLogin(url, soup)
# Now go hunting for all the meta data and the chapter list.
## Title
## Some of the titles have a backslash on the story page, but not on the Author's page
## So I am removing it from the title, so it can be found on the Author's page further in the code.
## Also, some titles may have extra spaces ' ', and the search on the Author's page removes them,
## so I have to here as well. I used multiple replaces to make sure, since I did the same below.
h1 = soup.find('h1')
# logger.debug("Title:%s"%h1)
self.story.setMetadata('title',stripHTML(h1).replace('\\','').replace(' ',' ').replace(' ',' ').replace(' ',' ').strip())
# Find the chapters from first list only
chapters = soup.select_one('select.chapter-select').select('option')
for chapter in chapters:
self.add_chapter(chapter,self.url+'&chapter='+chapter['value'])
a = soup.find('a', href=re.compile(r'story.php\?no='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a).replace('\\','').replace(' ',' ').replace(' ',' ').replace(' ',' ').strip())
# Find authorid and URL from... author url.
a = soup.find('a', href=re.compile(r"profile.php\?id=\d+"))
if a == None:
# I know that the original author of fanficfare wants to always have metadata,
# but I posit that if the story is there, even if we can't get the metadata from the
# author page, the story should still be able to be downloaded, which is what I've done here.
self.story.setMetadata('authorId','000000000')
self.story.setMetadata('authorUrl','https://www.adult-fanfiction.org')
self.story.setMetadata('author','Unknown')
logger.warning('There was no author found for the story... Metadata will not be retreived.')
self.setDescription(url,'>>>>>>>>>> No Summary Given, Unknown Author <<<<<<<<<<')
a = soup.find('a', href=re.compile(r"profile.php\?no=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl',a['href'])
self.story.setMetadata('author',stripHTML(a))
# Find the chapters:
chapters = soup.find('div',{'id':'snav'})
for i, chapter in enumerate(chapters.findAll('a')):
self.chapterUrls.append((stripHTML(chapter),self.url+'&chapter='+str(i+1)))
self.story.setMetadata('numChapters', len(self.chapterUrls))
##The story page does not give much Metadata, so we go to the Author's page
##Get the first Author page to see if there are multiple pages.
##AFF doesn't care if the page number is larger than the actual pages,
##it will continue to show the last page even if the variable is larger than the actual page
author_Url = self.story.getMetadata('authorUrl')+'&view=story&zone='+self.zone+'&page=1'
##I'm resetting the author page to the zone for this story
self.story.setMetadata('authorUrl',author_Url)
logger.debug('Getting the author page: {0}'.format(author_Url))
try:
adata = self._fetchUrl(author_Url)
except urllib2.HTTPError, e:
if e.code in 404:
raise exceptions.StoryDoesNotExist("Author Page: Code: 404. %s"%author_Url)
elif e.code == 410:
raise exceptions.StoryDoesNotExist("Author Page: Code: 410. %s"%author_Url)
else:
raise e
if "The member you are looking for does not exist." in adata:
raise exceptions.StoryDoesNotExist(self.zone+'.'+self.getBaseDomain() +" says: The member you are looking for does not exist.")
asoup = self.make_soup(adata)
##Getting the number of pages
pages=asoup.find('div',{'class' : 'pagination'}).findAll('li')[-1].find('a')
if not pages == None:
pages = pages['href'].split('=')[-1]
else:
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl',a['href'])
self.story.setMetadata('author',stripHTML(a))
pages = 0
logger.info(pages)
##If there is only 1 page of stories, check it to get the Metadata,
if pages == 0:
a = asoup.findAll('li')
for lc2 in a:
if lc2.find('a', href=re.compile(r'story.php\?no='+self.story.getMetadata('storyId')+"$")):
break
## otherwise go through the pages
else:
page=1
i=0
while i == 0:
##We already have the first page, so if this is the first time through, skip getting the page
if page != 1:
author_Url = self.story.getMetadata('authorUrl')+'&view=story&zone='+self.zone+'&page='+str(page)
logger.debug('Getting the author page: {0}'.format(author_Url))
try:
adata = self._fetchUrl(author_Url)
except urllib2.HTTPError, e:
if e.code in 404:
raise exceptions.StoryDoesNotExist("Author Page: Code: 404. %s"%author_Url)
elif e.code == 410:
raise exceptions.StoryDoesNotExist("Author Page: Code: 410. %s"%author_Url)
else:
raise e
##This will probably never be needed, since AFF doesn't seem to care what number you put as
## the page number, it will default to the last page, even if you use 1000, for an author
## that only hase 5 pages of stories, but I'm keeping it in to appease Saint Justin Case (just in case).
if "The member you are looking for does not exist." in adata:
raise exceptions.StoryDoesNotExist(self.zone+'.'+self.getBaseDomain() +" says: The member you are looking for does not exist.")
asoup = self.make_soup(adata)
a = asoup.findAll('li')
for lc2 in a:
if lc2.find('a', href=re.compile(r'story.php\?no='+self.story.getMetadata('storyId')+"$")):
i=1
break
page = page + 1
if page > pages:
break
## The story page does not give much Metadata, so we go to
## the Author's page. Except it's actually a sub-req for
## list of author's stories for that subdomain
author_Url = 'https://members.{0}/load-user-stories.php?subdomain={1}&uid={2}'.format(
self.getBaseDomain(),
self.zone,
self.story.getMetadata('authorId'))
##Split the Metadata up into a list
##We have to change the soup type to a string, then remove the newlines, and double spaces,
##then changes the <br/> to '-:-', which seperates the different elemeents.
##Then we strip the HTML elements from the string.
##There is also a double <br/>, so we have to fix that, then remove the leading and trailing '-:-'.
##They are always in the same order.
liMetadata = stripHTML(str(lc2).replace('\n','').replace('\r','').replace('\t',' ').replace(' ',' ').replace(' ',' ').replace(' ',' ').replace(r'<br/>','-:-'))
liMetadata = liMetadata.replace(r'-:--:-','-:-').strip('-:-').strip('-:-')
logger.debug('Getting the load-user-stories page: {0}'.format(author_Url))
adata = self.get_request(author_Url)
none_found = "No stories found in this category."
if none_found in adata:
raise exceptions.StoryDoesNotExist("{0}.{1} says: {2}".format(self.zone, self.getBaseDomain(), none_found))
asoup = self.make_soup(adata)
# logger.debug(asoup)
story_card = asoup.select_one('div.story-card:has(a[href="{0}"])'.format(url))
# logger.debug(story_card)
## Category
## I've only seen one category per story so far, but just in case:
for cat in story_card.select('div.story-card-category'):
# remove Category:, old code suggests Located: is also
# possible, so removing by <strong>
cat.find("strong").decompose()
self.story.addToList('category',stripHTML(cat))
self.setDescription(url,story_card.select_one('div.story-card-description'))
for tag in story_card.select('span.story-tag'):
self.story.addToList('eroticatags',stripHTML(tag))
## created/updates share formatting
for meta in story_card.select('div.story-card-meta-item span:last-child'):
meta = stripHTML(meta)
if 'Created: ' in meta:
meta = meta.replace('Created: ','')
self.story.setMetadata('datePublished', makeDate(meta, self.dateformat))
if 'Updated: ' in meta:
meta = meta.replace('Updated: ','')
self.story.setMetadata('dateUpdated', makeDate(meta, self.dateformat))
for i, value in enumerate(liMetadata.split('-:-')):
##The item 6 is the reviews... We are disregarding them.
##The item 7 is the 'Dragon Prints'... not sure what they are, so disregarding them.
##The 0 item is the title
if i == 0:
if value <> self.story.getMetadata('title'):
raise exceptions.StoryDoesNotExist('Did not find story in author story list: {0}'.format(author_Url))
elif i == 1:
##Get the description
self.story.setMetadata('description',stripHTML(value.strip()))
elif i == 2:
##The Get the Category
self.story.setMetadata('category',value.replace(r'&gt;',r'>').replace(r'Located :',r'').strip())
elif i == 3:
##Get the Erotic Tags
value = stripHTML(value.replace(r'Content Tags :',r'')).strip()
for code in re.split(r'\s',value):
self.story.addToList('eroticatags',code)
elif i == 4:
##Get the Posted Date
value = value.replace(r'Posted :',r'').strip()
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
elif i == 5:
##Get the 'Updated' Edited date
##AFF has the time for the Updated date, and we only want the date,
##so we take the first 10 characters only
value = value.replace(r'Edited :',r'').strip()[0:10]
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
# grab the text for an individual chapter.
def getChapterText(self, url):
#Since each chapter is on 1 page, we don't need to do anything special, just get the content of the page.
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self.get_request(url))
chaptertag = soup.select_one('div.chapter-body')
soup = self.make_soup(self._fetchUrl(url))
chaptertag = soup.find('div',{'class' : 'pagination'}).parent.findNext('td')
if None == chaptertag:
raise exceptions.FailedToDownload("Error downloading Chapter: {0}! Missing required element!".format(url))
## chapter text includes a copy of story title, author,
## chapter title, & eroticatags specific to the chapter. Did
## before, too.
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,chaptertag)

View file

@ -1,46 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2020 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import absolute_import
from .base_xenforo2forum_adapter import BaseXenForo2ForumAdapter
import logging
logger = logging.getLogger(__name__)
def getClass():
return WWWAlternatehistoryComAdapter
class WWWAlternatehistoryComAdapter(BaseXenForo2ForumAdapter):
def __init__(self, config, url):
BaseXenForo2ForumAdapter.__init__(self, config, url)
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','ah')
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'www.alternatehistory.com'
@classmethod
def getPathPrefix(cls):
# in case it needs more than just site/
return '/forum/'
def get_post_created_date(self,souptag):
return self.make_date(souptag.find('div', {'class':'message-inner'}))

View file

@ -1,40 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2026 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import absolute_import
import re
from .base_xenforo2forum_adapter import BaseXenForo2ForumAdapter
def getClass():
return AltHistoryComAdapter
## NOTE: This is a different site than www.alternatehistory.com.
class AltHistoryComAdapter(BaseXenForo2ForumAdapter):
def __init__(self, config, url):
BaseXenForo2ForumAdapter.__init__(self, config, url)
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','ahc')
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'althistory.com'

View file

@ -0,0 +1,302 @@
# -*- coding: utf-8 -*-
# Copyright 2016 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# ####### Not all lables are captured. they are not formtted correctly on the
# ####### webpage.
# Software: eFiction
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return AndromedaWebComAdapter # XXX
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class AndromedaWebComAdapter(BaseSiteAdapter): # XXX
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
# XXX Most sites don't have the /fiction part. Replace all to remove it usually.
self._setURL('http://' + self.getSiteDomain() + '/fiction/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','awc') # XXX
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%d %b %Y" # XXX
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'www.andromeda-web.com' # XXX
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/fiction/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/fiction/viewstory.php?sid=")+r"\d+$"
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
if 'Registered Users Only' in data \
or 'There is no such account on our website' in data \
or "That password doesn't match the one in our database" in data:
return True
else:
return False
def performLogin(self, url):
params = {}
if self.password:
params['penname'] = self.username
params['password'] = self.password
else:
params['penname'] = self.getConfig("username")
params['password'] = self.getConfig("password")
params['cookiecheck'] = '1'
params['submit'] = 'Submit'
loginUrl = 'http://' + self.getSiteDomain() + '/user.php?action=login'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['penname']))
d = self._fetchUrl(loginUrl, params)
if "Member Account" not in d : #Member Account
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['penname']))
raise exceptions.FailedToLogin(url,params['penname'])
return False
else:
return True
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
addurl = "&warning=2"
else:
addurl=""
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url)
data = self._fetchUrl(url)
# Since the warning text can change by warning level, let's
# look for the warning pass url. ksarchive uses
# &amp;warning= -- actually, so do other sites. Must be an
# eFiction book.
# fiction/viewstory.php?sid=1882&amp;warning=4
# fiction/viewstory.php?sid=1654&amp;ageconsent=ok&amp;warning=2
#print data
m = re.search(r"'fiction/viewstory.php\?sid=10(&amp;warning=2)'",data)
m = re.search(r"'fiction/viewstory.php\?sid=\d+((?:&amp;ageconsent=ok)?&amp;warning=\d+)'",data)
if m != None:
if self.is_adult or self.getConfig("is_adult"):
# We tried the default and still got a warning, so
# let's pull the warning number from the 'continue'
# link and reload data.
addurl = m.group(1)
# correct stupid &amp; error in url.
addurl = addurl.replace("&amp;","&")
url = self.url+'&index=1'+addurl
logger.debug("URL 2nd try: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
else:
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.FailedToDownload(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
pagetitle = soup.find('div',{'id':'content'})
## Title
a = pagetitle.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
a = pagetitle.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.string)
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/fiction/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
# utility method
def defaultGetattr(d,k):
try:
return d[k]
except:
return ""
# <span class="label">Rated:</span> NC-17<br /> etc
labels = soup.findAll('span',{'class':'label'})
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
if 'Summary' in label:
## Everything until the next span class='label'
svalue = ""
while 'label' not in defaultGetattr(value,'class'):
svalue += unicode(value)
value = value.nextSibling
self.setDescription(url,svalue)
#self.story.setMetadata('description',stripHTML(svalue))
if 'Rated' in label:
self.story.setMetadata('rating', value)
if 'Word count' in label:
self.story.setMetadata('numWords', value)
if 'Categories' in label:
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
if 'Characters' in label:
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
for char in chars:
self.story.addToList('characters',char.string)
if 'Genre' in label:
genres = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
for genre in genres:
self.story.addToList('genre',genre.string)
if 'Warnings' in label:
warnings = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=3'))
for warning in warnings:
self.story.addToList('warnings',warning.string)
if 'Completed' in label:
if 'Yes' in value:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
if 'Published' in label:
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
if 'Updated' in label:
# there's a stray [ at the end.
#value = value[0:-1]
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
try:
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"fiction/viewseries.php\?seriesid=\d+"))
series_name = a.string
series_url = 'http://'+self.host+'/'+a['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^fiction/viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('fiction/viewstory.php?sid='+self.story.getMetadata('storyId')):
self.setSeries(series_name, i)
self.story.setMetadata('seriesUrl',series_url)
break
i+=1
except:
# I find it hard to care if the series parsing fails
pass
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'class' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2014 Fanficdownloader team, 2020 FanFicFare team
# Copyright 2014 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,55 +15,383 @@
# limitations under the License.
#
from __future__ import absolute_import
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from .base_otw_adapter import BaseOTWAdapter
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return ArchiveOfOurOwnOrgAdapter
class ArchiveOfOurOwnOrgAdapter(BaseOTWAdapter):
logger = logging.getLogger(__name__)
class ArchiveOfOurOwnOrgAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseOTWAdapter.__init__(self, config, url)
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["utf8",
"Windows-1252"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.path.split('/',)[2])
# get storyId from url--url validation guarantees query correct
m = re.match(self.getSiteURLPattern(),url)
if m:
self.story.setMetadata('storyId',m.group('id'))
# normalized story URL.
self._setURL('http://' + self.getSiteDomain() + '/works/'+self.story.getMetadata('storyId'))
else:
raise exceptions.InvalidStoryURL(url,
self.getSiteDomain(),
self.getSiteExampleURLs())
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','ao3')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%Y-%b-%d"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'archiveofourown.org'
# The certificate is only valid for the following names:
# ao3.org,
# archiveofourown.com,
# archiveofourown.net,
# archiveofourown.org,
# www.ao3.org,
@classmethod
def getAcceptDomains(cls):
return ['archiveofourown.org',
'archiveofourown.com',
'archiveofourown.net',
'archiveofourown.gay',
'download.archiveofourown.org',
'download.archiveofourown.com',
'download.archiveofourown.net',
'ao3.org',
]
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/works/123456 http://"+cls.getSiteDomain()+"/collections/Some_Archive/works/123456 http://"+cls.getSiteDomain()+"/works/123456/chapters/78901"
def mod_url_request(self, url):
return url
def getSiteURLPattern(self):
# http://archiveofourown.org/collections/Smallville_Slash_Archive/works/159770
# Discard leading zeros from story ID numbers--AO3 doesn't use them in it's own chapter URLs.
return r"https?://"+re.escape(self.getSiteDomain())+r"(/collections/[^/]+)?/works/0*(?P<id>\d+)"
def mod_url_request(self, url):
## add / to *not* replace media.archiveofourown.org
if self.getConfig("use_archive_transformativeworks_org",False):
return url.replace("/archiveofourown.org","/archive.transformativeworks.org")
elif self.getConfig("use_archiveofourown_gay",False):
return url.replace("/archiveofourown.org","/archiveofourown.gay")
## Login
def needToLoginCheck(self, data):
if 'This work is only available to registered users of the Archive.' in data \
or "The password or user name you entered doesn't match our records" in data:
return True
else:
return url
return False
def performLogin(self, url, data):
params = {}
if self.password:
params['user_session[login]'] = self.username
params['user_session[password]'] = self.password
else:
params['user_session[login]'] = self.getConfig("username")
params['user_session[password]'] = self.getConfig("password")
params['user_session[remember_me]'] = '1'
params['commit'] = 'Log in'
#params['utf8'] = u'✓'#u'\x2713' # gets along with out it, and it confuses the encoder.
params['authenticity_token'] = data.split('input name="authenticity_token" type="hidden" value="')[1].split('"')[0]
loginUrl = 'http://' + self.getSiteDomain() + '/user_sessions'
logger.info("Will now login to URL (%s) as (%s)" % (loginUrl,
params['user_session[login]']))
d = self._postUrl(loginUrl, params)
#logger.info(d)
if "Successfully logged in" not in d : #Member Account
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['user_session[login]']))
raise exceptions.FailedToLogin(url,params['user_session[login]'])
return False
else:
return True
def use_pagecache(self):
'''
adapters that will work with the page cache need to implement
this and change it to True.
'''
return True
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
addurl = "?view_adult=true"
else:
addurl=""
metaurl = self.url+addurl
url = self.url+'/navigate'+addurl
logger.info("url: "+url)
logger.info("metaurl: "+metaurl)
try:
data = self._fetchUrl(url)
meta = self._fetchUrl(metaurl)
if "This work could have adult content. If you proceed you have agreed that you are willing to see such content." in meta:
raise exceptions.AdultCheckRequired(self.url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if "Sorry, we couldn&#x27;t find the work you were looking for." in data:
raise exceptions.StoryDoesNotExist(self.url)
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url,data)
data = self._fetchUrl(url,usecache=False)
meta = self._fetchUrl(metaurl,usecache=False)
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
for tag in soup.findAll('div',id='admin-banner'):
tag.extract()
metasoup = self.make_soup(meta)
for tag in metasoup.findAll('div',id='admin-banner'):
tag.extract()
# Now go hunting for all the meta data and the chapter list.
## Title
a = soup.find('a', href=re.compile(r"/works/\d+$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
alist = soup.findAll('a', href=re.compile(r"/users/\w+/pseuds/\w+"))
if len(alist) < 1: # ao3 allows for author 'Anonymous' with no author link.
self.story.setMetadata('author','Anonymous')
self.story.setMetadata('authorUrl','http://archiveofourown.org/')
self.story.setMetadata('authorId','0')
else:
for a in alist:
self.story.addToList('authorId',a['href'].split('/')[-1])
self.story.addToList('authorUrl','http://'+self.host+a['href'])
self.story.addToList('author',a.text)
byline = metasoup.find('h3',{'class':'byline'})
if byline:
self.story.setMetadata('byline',stripHTML(byline))
newestChapter = None
self.newestChapterNum = None # save for comparing during update.
# Scan all chapters to find the oldest and newest, on AO3 it's
# possible for authors to insert new chapters out-of-order or
# change the dates of earlier ones by editing them--That WILL
# break epub update.
# Find the chapters:
chapters=soup.findAll('a', href=re.compile(r'/works/'+self.story.getMetadata('storyId')+"/chapters/\d+$"))
self.story.setMetadata('numChapters',len(chapters))
logger.debug("numChapters: (%s)"%self.story.getMetadata('numChapters'))
if len(chapters)==1:
self.chapterUrls.append((self.story.getMetadata('title'),'http://'+self.host+chapters[0]['href']+addurl))
else:
for index, chapter in enumerate(chapters):
# strip just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+chapter['href']+addurl))
# (2013-09-21)
date = stripHTML(chapter.findNext('span'))[1:-1]
chapterDate = makeDate(date,self.dateformat)
if newestChapter == None or chapterDate > newestChapter:
newestChapter = chapterDate
self.newestChapterNum = index
a = metasoup.find('blockquote',{'class':'userstuff'})
if a != None:
self.setDescription(url,a)
#self.story.setMetadata('description',a.text)
a = metasoup.find('dd',{'class':"rating tags"})
if a != None:
self.story.setMetadata('rating',stripHTML(a.text))
d = metasoup.find('dd',{'class':"language"})
if d != None:
self.story.setMetadata('language',stripHTML(d.text))
a = metasoup.find('dd',{'class':"fandom tags"})
fandoms = a.findAll('a',{'class':"tag"})
for fandom in fandoms:
self.story.addToList('fandoms',fandom.string)
a = metasoup.find('dd',{'class':"warning tags"})
if a != None:
warnings = a.findAll('a',{'class':"tag"})
for warning in warnings:
self.story.addToList('warnings',warning.string)
a = metasoup.find('dd',{'class':"freeform tags"})
if a != None:
genres = a.findAll('a',{'class':"tag"})
for genre in genres:
self.story.addToList('freeformtags',genre.string)
a = metasoup.find('dd',{'class':"category tags"})
if a != None:
genres = a.findAll('a',{'class':"tag"})
for genre in genres:
if genre != "Gen":
self.story.addToList('ao3categories',genre.string)
a = metasoup.find('dd',{'class':"character tags"})
if a != None:
chars = a.findAll('a',{'class':"tag"})
for char in chars:
self.story.addToList('characters',char.string)
a = metasoup.find('dd',{'class':"relationship tags"})
if a != None:
ships = a.findAll('a',{'class':"tag"})
for ship in ships:
self.story.addToList('ships',ship.string)
a = metasoup.find('dd',{'class':"collections"})
if a != None:
collections = a.findAll('a')
for collection in collections:
self.story.addToList('collections',collection.string)
stats = metasoup.find('dl',{'class':'stats'})
dt = stats.findAll('dt')
dd = stats.findAll('dd')
for x in range(0,len(dt)):
label = dt[x].text
value = dd[x].text
if 'Words:' in label:
self.story.setMetadata('numWords', value)
if 'Comments:' in label:
self.story.setMetadata('comments', value)
if 'Kudos:' in label:
self.story.setMetadata('kudos', value)
if 'Hits:' in label:
self.story.setMetadata('hits', value)
if 'Bookmarks:' in label:
self.story.setMetadata('bookmarks', value)
if 'Chapters:' in label:
if value.split('/')[0] == value.split('/')[1]:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
if 'Published' in label:
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
if 'Updated' in label:
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
if 'Completed' in label:
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
# Find Series name from series URL.
ddseries = metasoup.find('dd',{'class':"series"})
if ddseries:
for i, a in enumerate(ddseries.findAll('a', href=re.compile(r"/series/\d+"))):
series_name = stripHTML(a)
series_url = 'http://'+self.host+a['href']
series_index = int(stripHTML(a.previousSibling).replace(', ','').split(' ')[1]) # "Part # of" or ", Part #"
self.story.setMetadata('series%02d'%i,"%s [%s]"%(series_name,series_index))
self.story.setMetadata('series%02dUrl'%i,series_url)
if i == 0:
self.setSeries(series_name, series_index)
self.story.setMetadata('seriesUrl',series_url)
def hookForUpdates(self,chaptercount):
if self.oldchapters and len(self.oldchapters) > self.newestChapterNum:
logger.info("Existing epub has %s chapters\nNewest chapter is %s. Discarding old chapters from there on."%(len(self.oldchapters), self.newestChapterNum+1))
self.oldchapters = self.oldchapters[:self.newestChapterNum]
return len(self.oldchapters)
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
chapter=self.make_soup('<div class="story"></div>').find('div')
data = self._fetchUrl(url)
soup = self.make_soup(data)
exclude_notes=self.getConfigList('exclude_notes')
def append_tag(elem,tag,string):
'''bs4 requires tags be added separately.'''
new_tag = soup.new_tag(tag)
new_tag.string=string
elem.append(new_tag)
if 'authorheadnotes' not in exclude_notes:
headnotes = soup.find('div', {'class' : "preface group"}).find('div', {'class' : "notes module"})
if headnotes != None:
headnotes = headnotes.find('blockquote', {'class' : "userstuff"})
if headnotes != None:
append_tag(chapter,'b',"Author's Note:")
chapter.append(headnotes)
if 'chaptersummary' not in exclude_notes:
chapsumm = soup.find('div', {'id' : "summary"})
if chapsumm != None:
chapsumm = chapsumm.find('blockquote')
append_tag(chapter,'b',"Summary for the Chapter:")
chapter.append(chapsumm)
if 'chapterheadnotes' not in exclude_notes:
chapnotes = soup.find('div', {'id' : "notes"})
if chapnotes != None:
chapnotes = chapnotes.find('blockquote')
if chapnotes != None:
append_tag(chapter,'b',"Notes for the Chapter:")
chapter.append(chapnotes)
text = soup.find('div', {'class' : "userstuff module"})
chtext = text.find('h3', {'class' : "landmark heading"})
if chtext:
chtext.extract()
chapter.append(text)
if 'chapterfootnotes' not in exclude_notes:
chapfoot = soup.find('div', {'class' : "end notes module", 'role' : "complementary"})
if chapfoot != None:
chapfoot = chapfoot.find('blockquote')
append_tag(chapter,'b',"Notes for the Chapter:")
chapter.append(chapfoot)
if 'authorfootnotes' not in exclude_notes:
footnotes = soup.find('div', {'id' : "work_endnotes"})
if footnotes != None:
footnotes = footnotes.find('blockquote')
append_tag(chapter,'b',"Author's Note:")
chapter.append(footnotes)
if None == soup:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,chapter)

View file

@ -0,0 +1,190 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return ArchiveSkyeHawkeComAdapter
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class ArchiveSkyeHawkeComAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
self._setURL('http://' + self.getSiteDomain() + '/story.php?no='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','ash')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%Y-%m-%d"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'archive.skyehawke.com'
@classmethod
def getAcceptDomains(cls):
return ['archive.skyehawke.com','www.skyehawke.com']
@classmethod
def getSiteExampleURLs(cls):
return "http://archive.skyehawke.com/story.php?no=1234 http://www.skyehawke.com/archive/story.php?no=1234 http://skyehawke.com/archive/story.php?no=1234"
def getSiteURLPattern(self):
return re.escape("http://")+r"(archive|www)\.skyehawke\.com/(archive/)?story\.php\?no=\d+$"
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
url = self.url
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
## Title
a = soup.find('div', {'class':"story border"}).find('span',{'class':'left'})
title=stripHTML(a).split('"')[1]
self.story.setMetadata('title',title)
# Find authorid and URL from... author url.
author = a.find('a')
self.story.setMetadata('authorId',author['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+author['href'])
self.story.setMetadata('author',author.string)
authorSoup = self.make_soup(self._fetchUrl(self.story.getMetadata('authorUrl')))
chapter=soup.find('select',{'name':'chapter'}).findAll('option')
for i in range(1,len(chapter)):
ch=chapter[i]
self.chapterUrls.append((stripHTML(ch),ch['value']))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
box=soup.find('div', {'class': "container borderridge"})
sum=box.find('span').text
self.setDescription(url,sum)
boxes=soup.findAll('div', {'class': "container bordersolid"})
for box in boxes:
if box.find('b') != None and box.find('b').text == "History and Story Information":
for b in box.findAll('b'):
if "words" in b.nextSibling:
self.story.setMetadata('numWords', b.text)
if "archived" in b.previousSibling:
self.story.setMetadata('datePublished', makeDate(stripHTML(b.text), self.dateformat))
if "updated" in b.previousSibling:
self.story.setMetadata('dateUpdated', makeDate(stripHTML(b.text), self.dateformat))
if "fandom" in b.nextSibling:
self.story.addToList('category', b.text)
for br in box.findAll('br'):
br.replaceWith('split')
genre=box.text.split("Genre:")[1].split("split")[0]
if not "Unspecified" in genre:
self.story.addToList('genre',genre)
if box.find('span') != None and box.find('span').text == "WARNING":
rating=box.findAll('span')[1]
rating.find('br').replaceWith('split')
rating=rating.text.replace("This story is rated",'').split('split')[0]
self.story.setMetadata('rating',rating)
logger.debug(self.story.getMetadata('rating'))
warnings=box.find('ol')
if warnings != None:
warnings=warnings.text.replace(']', '').replace('[', '').split(' ')
for warning in warnings:
self.story.addToList('warnings',warning)
for asoup in authorSoup.findAll('div', {'class':"story bordersolid"}):
if asoup.find('a')['href'] == 'story.php?no='+self.story.getMetadata('storyId'):
if '[ Completed ]' in asoup.text:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
chars=asoup.findNext('div').text.split('Characters')[1].split(']')[0]
for char in chars.split(','):
if not "None" in char:
self.story.addToList('characters',char)
break
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div',{'class':"chapter bordersolid"}).findNext('div').findNext('div')
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -0,0 +1,302 @@
# -*- coding: utf-8 -*-
# Copyright 2016 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# ####### Not all lables are captured. they are not formtted correctly on the
# ####### webpage.
# Software: eFiction
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return ArtemisFowlComAdapter # XXX
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class ArtemisFowlComAdapter(BaseSiteAdapter): # XXX
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
# XXX Most sites don't have the /fiction part. Replace all to remove it usually.
self._setURL('http://' + self.getSiteDomain() + '/fanfiction/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','afcff') # XXX
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%d/%m/%y" # XXX
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'www.artemis-fowl.com' # XXX
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/fanfiction/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/fanfiction/viewstory.php?sid=")+r"\d+$"
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
if 'Registered Users Only' in data \
or 'There is no such account on our website' in data \
or "That password doesn't match the one in our database" in data:
return True
else:
return False
def performLogin(self, url):
params = {}
if self.password:
params['penname'] = self.username
params['password'] = self.password
else:
params['penname'] = self.getConfig("username")
params['password'] = self.getConfig("password")
params['cookiecheck'] = '1'
params['submit'] = 'Submit'
loginUrl = 'http://' + self.getSiteDomain() + '/user.php?action=login'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['penname']))
d = self._fetchUrl(loginUrl, params)
if "Member Account" not in d : #Member Account
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['penname']))
raise exceptions.FailedToLogin(url,params['penname'])
return False
else:
return True
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
addurl = "&warning=5"
else:
addurl=""
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url)
data = self._fetchUrl(url)
# Since the warning text can change by warning level, let's
# look for the warning pass url. ksarchive uses
# &amp;warning= -- actually, so do other sites. Must be an
# eFiction book.
# fanfiction/viewstory.php?sid=1882&amp;warning=4
# fanfiction/viewstory.php?sid=1654&amp;ageconsent=ok&amp;warning=2
#print data
m = re.search(r"'fanfiction/viewstory.php\?sid=10(&amp;warning=5)'",data)
m = re.search(r"'fanfiction/viewstory.php\?sid=\d+((?:&amp;ageconsent=ok)?&amp;warning=\d+)'",data)
if m != None:
if self.is_adult or self.getConfig("is_adult"):
# We tried the default and still got a warning, so
# let's pull the warning number from the 'continue'
# link and reload data.
addurl = m.group(1)
# correct stupid &amp; error in url.
addurl = addurl.replace("&amp;","&")
url = self.url+'&index=1'+addurl
logger.debug("URL 2nd try: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
else:
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.FailedToDownload(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
pagetitle = soup.find('div',{'id':'pagetitle'})
## Title
a = pagetitle.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
a = pagetitle.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.string)
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/fanfiction/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
# utility method
def defaultGetattr(d,k):
try:
return d[k]
except:
return ""
# <span class="label">Rated:</span> NC-17<br /> etc
labels = soup.findAll('span',{'class':'label'})
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
if 'Summary' in label:
## Everything until the next span class='label'
svalue = ""
while 'label' not in defaultGetattr(value,'class'):
svalue += unicode(value)
value = value.nextSibling
self.setDescription(url,svalue)
#self.story.setMetadata('description',stripHTML(svalue))
if 'Rated' in label:
self.story.setMetadata('rating', value)
if 'Word count' in label:
self.story.setMetadata('numWords', value)
if 'Categories' in label:
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
if 'Characters' in label:
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
for char in chars:
self.story.addToList('characters',char.string)
if 'Genre' in label:
genres = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
for genre in genres:
self.story.addToList('genre',genre.string)
if 'Warnings' in label:
warnings = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=3'))
for warning in warnings:
self.story.addToList('warnings',warning.string)
if 'Completed' in label:
if 'Yes' in value:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
if 'Published' in label:
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
if 'Updated' in label:
# there's a stray [ at the end.
#value = value[0:-1]
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
try:
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"fanfiction/viewseries.php\?seriesid=\d+"))
series_name = a.string
series_url = 'http://'+self.host+'/'+a['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^fanfiction/viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('fanfiction/viewstory.php?sid='+self.story.getMetadata('storyId')):
self.setSeries(series_name, i)
self.story.setMetadata('seriesUrl',series_url)
break
i+=1
except:
# I find it hard to care if the series parsing fails
pass
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -1,160 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2013 Fanficdownloader team, 2018 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import absolute_import
import logging
logger = logging.getLogger(__name__)
import os
from bs4.element import Comment
from .. import exceptions as exceptions
# py2 vs py3 transition
from ..six.moves.urllib import parse as urlparse
from .base_adapter import BaseSiteAdapter, makeDate
def getClass():
return ASexStoriesComAdapter
class ASexStoriesComAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev','asscom')
# Extract story ID from base URL, http://www.asexstories.com/Halloween-party-with-the-phantom/
storyId = self.parsedUrl.path.split('/',)[1]
self.story.setMetadata('storyId', storyId)
## set url
self._setURL(url)
@staticmethod
def getSiteDomain():
return 'www.asexstories.com'
@classmethod
def getAcceptDomains(cls):
return ['www.asexstories.com']
@classmethod
def getSiteExampleURLs(cls):
return "http://www.asexstories.com/StoryTitle/"
def getSiteURLPattern(self):
return r"https?://(www\.)?asexstories\.com/([a-zA-Z0-9_-]+)/"
def extractChapterUrlsAndMetadata(self):
"""
Chapters are located at /StoryName/ (for single-chapter
stories), or //StoryName/index#.html for multiple chapters (# is a
non-padded incrementing number, like StoryName1, StoryName2.html, ...,
StoryName10.html)
This site doesn't have much in the way of metadata, except on the
Category and Tags index pages. so we will get what we can.
Also, as this is an Adult site, the is_adult check is mandatory.
"""
if not (self.is_adult or self.getConfig("is_adult")):
raise exceptions.AdultCheckRequired(self.url)
data1 = self.get_request(self.url)
soup1 = self.make_soup(data1)
#strip comments from soup
[comment.extract() for comment in soup1.find_all(string=lambda text:isinstance(text, Comment))]
if 'Page Not Found.' in data1:
raise exceptions.StoryDoesNotExist(self.url)
url = self.url
# Extract metadata
# Title
title = soup1.find('div',{'class':'story-top-block'}).find('h1')
self.story.setMetadata('title', title.string)
# Author
author = soup1.find('div',{'class':'story-info'}).find_all('div',{'class':'story-info-bl'})[1].find('a')
authorurl = author['href']
self.story.setMetadata('author', author.string)
self.story.setMetadata('authorUrl', authorurl)
authorid = os.path.splitext(os.path.basename(authorurl))[0]
self.story.setMetadata('authorId', authorid)
# Description
### The only way to get the Description (summary) is to
### parse through the Category and/or Tags index pages.
### To get a summary, I've taken the first 150 characters
### from the story.
description = soup1.find('div',{'class':'story-block'}).get_text(strip=True)
description = description.encode('utf-8','ignore').strip()[0:150].decode('utf-8','ignore')
self.setDescription(url,'Excerpt from beginning of story: '+description+'...')
### The first 'chapter' is not listed in the links, so we have to
### add it before the rest of the pages, if any
self.add_chapter('1', self.url)
chapterTable = soup1.find('div',{'class':'pages'}).find_all('a')
if chapterTable is not None:
# Multi-chapter story
for page in chapterTable:
chapterTitle = page.string
chapterUrl = urlparse.urljoin(self.url, page['href'])
if chapterUrl.startswith(self.url): # there are other URLs in the pages block now.
self.add_chapter(chapterTitle, chapterUrl)
rated = soup1.find('div',{'class':'story-info'}).find_all('div',{'class':'story-info-bl5'})[0].find('img')['title'].replace('- Rate','').strip()
self.story.setMetadata('rating',rated)
self.story.setMetadata('dateUpdated', makeDate('01/01/2001', '%m/%d/%Y'))
logger.debug("Story: <%s>", self.story)
return
def getChapterText(self, url):
logger.debug('Getting chapter text from <%s>' % url)
#logger.info('Getting chapter text from <%s>' % url)
data1 = self.get_request(url)
soup1 = self.make_soup(data1)
# get story text
story1 = soup1.find('div', {'class':'story-block'})
### This site has links embeded in the text that lead
### to either a video site, or to a tags index page
### the default is to remove them, but you can set the
### strip_text_links to false to keep them in the text
if self.getConfig('strip_text_links'):
for anchor in story1('a', {'target': '_blank'}):
anchor.replaceWith(anchor.string)
## remove ad links in the story text and their following <br>
for anchor in story1('a', {'rel': 'nofollow'}):
br = anchor.find_next_sibling('br')
if br:
br.extract()
anchor.extract()
return self.utf8FromSoup(url, story1)

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -16,17 +16,17 @@
#
# Software: eFiction
from __future__ import absolute_import
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from ..six import text_type as unicode
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return AshwinderSycophantHexComAdapter
@ -38,6 +38,11 @@ class AshwinderSycophantHexComAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
@ -45,10 +50,10 @@ class AshwinderSycophantHexComAdapter(BaseSiteAdapter):
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
self._setURL('https://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
self._setURL('http://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','asph')
@ -64,10 +69,10 @@ class AshwinderSycophantHexComAdapter(BaseSiteAdapter):
@classmethod
def getSiteExampleURLs(cls):
return "https://"+cls.getSiteDomain()+"/viewstory.php?sid=1234"
return "http://"+cls.getSiteDomain()+"/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return r"https?://"+re.escape(self.getSiteDomain()+"/viewstory.php?sid=")+r"\d+$"
return re.escape("http://"+self.getSiteDomain()+"/viewstory.php?sid=")+r"\d+$"
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
@ -92,11 +97,11 @@ class AshwinderSycophantHexComAdapter(BaseSiteAdapter):
params['intent'] = ''
params['submit'] = 'Submit'
loginUrl = 'https://' + self.getSiteDomain() + '/user.php'
loginUrl = 'http://' + self.getSiteDomain() + '/user.php'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['penname']))
d = self.post_request(loginUrl, params)
d = self._fetchUrl(loginUrl, params)
if "Logout" not in d : #Member Account
logger.info("Failed to login to URL %s as %s" % (loginUrl,
@ -113,52 +118,61 @@ class AshwinderSycophantHexComAdapter(BaseSiteAdapter):
url = self.url
logger.debug("URL: "+url)
data = self.get_request(url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url)
data = self.get_request(url)
data = self._fetchUrl(url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
# Find authorid and URL from... author url.
a = soup.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','https://'+self.host+'/'+a['href'])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.string)
asoup = self.make_soup(self.get_request(self.story.getMetadata('authorUrl')))
asoup = self.make_soup(self._fetchUrl(self.story.getMetadata('authorUrl')))
try:
# in case link points somewhere other than the first chapter
a = soup.find_all('option')[1]['value']
a = soup.findAll('option')[1]['value']
self.story.setMetadata('storyId',a.split('=',)[1])
url = 'https://'+self.host+'/'+a
soup = self.make_soup(self.get_request(url))
url = 'http://'+self.host+'/'+a
soup = self.make_soup(self._fetchUrl(url))
except:
pass
for info in asoup.find_all('table', {'width' : '100%', 'bordercolor' : re.compile(r'#')}):
for info in asoup.findAll('table', {'width' : '100%', 'bordercolor' : re.compile(r'#')}):
a = info.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
if a != None:
self.story.setMetadata('title',stripHTML(a))
break
# Find the chapters:
chapters=soup.find_all('a', href=re.compile(r'viewstory.php\?sid=\d+&i=1$'))
chapters=soup.findAll('a', href=re.compile(r'viewstory.php\?sid=\d+&i=1$'))
if len(chapters) == 0:
self.add_chapter(self.story.getMetadata('title'),url)
self.chapterUrls.append((self.story.getMetadata('title'),url))
else:
for chapter in chapters:
# just in case there's tags, like <i> in chapter titles.
self.add_chapter(chapter,'https://'+self.host+'/'+chapter['href'])
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/'+chapter['href']))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
@ -169,11 +183,11 @@ class AshwinderSycophantHexComAdapter(BaseSiteAdapter):
return d.name
except:
return ""
cats = info.find_all('a',href=re.compile('categories.php'))
cats = info.findAll('a',href=re.compile('categories.php'))
for cat in cats:
self.story.addToList('category',cat.string)
a = info.find('a', href=re.compile(r'reviews.php\?sid='+self.story.getMetadata('storyId')))
val = a.nextSibling
svalue = ""
@ -185,10 +199,8 @@ class AshwinderSycophantHexComAdapter(BaseSiteAdapter):
val = val.nextSibling
self.setDescription(url,svalue)
## <td><span class="sb"><b>Published:</b> 04/08/2007</td>
## one story had <b>Updated...</b> in the description. Restrict to sub-table
labels = info.find('table').find_all('b')
# <span class="label">Rated:</span> NC-17<br /> etc
labels = info.findAll('b')
for labelspan in labels:
value = labelspan.nextSibling
label = stripHTML(labelspan)
@ -230,8 +242,8 @@ class AshwinderSycophantHexComAdapter(BaseSiteAdapter):
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
data = self.get_request(url)
data = self._fetchUrl(url)
soup = self.make_soup(data) # some chapters seem to be hanging up on those tags, so it is safer to close them

View file

@ -1,290 +0,0 @@
# -*- coding: utf-8 -*-
from __future__ import absolute_import
import logging
logger = logging.getLogger(__name__)
import re
import json
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from .base_adapter import BaseSiteAdapter, makeDate
def getClass():
return AsianFanFicsComAdapter
logger = logging.getLogger(__name__)
class AsianFanFicsComAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.username = ""
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.path.split('/',)[3])
# get storyId from url--url validation guarantees query correct
m = re.match(self.getSiteURLPattern(),url)
if m:
self.story.setMetadata('storyId',m.group('id'))
# normalized story URL.
self._setURL('https://' + self.getSiteDomain() + '/story/view/'+self.story.getMetadata('storyId'))
else:
raise exceptions.InvalidStoryURL(url,
self.getSiteDomain(),
self.getSiteExampleURLs())
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','asnff')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%Y-%b-%d"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'www.asianfanfics.com'
@classmethod
def getSiteExampleURLs(cls):
return "https://"+cls.getSiteDomain()+"/story/view/123456 https://"+cls.getSiteDomain()+"/story/view/123456/story-title-here https://"+cls.getSiteDomain()+"/story/view/123456/1"
def getSiteURLPattern(self):
return r"https?://"+re.escape(self.getSiteDomain())+r"/story/view/0*(?P<id>\d+)"
def performLogin(self, url, data):
params = {}
if self.password:
params['username'] = self.username
params['password'] = self.password
else:
params['username'] = self.getConfig("username")
params['password'] = self.getConfig("password")
if not params['username']:
raise exceptions.FailedToLogin(url,params['username'])
params['from_url'] = url
# capture token from JS script, not appearing in form now.
csrf_token_search = 'csrfToken = "'
params['csrf_aff_token'] = data[data.index(csrf_token_search)+len(csrf_token_search):]
params['csrf_aff_token'] = params['csrf_aff_token'][:params['csrf_aff_token'].index('"')]
loginUrl = 'https://' + self.getSiteDomain() + '/login/index'
logger.info("Will now login to URL (%s) as (%s)" % (loginUrl, params['username']))
data = self.post_request(loginUrl, params)
soup = self.make_soup(data)
if self.loginNeededCheck(data):
logger.info('Failed to login to URL %s as %s' % (loginUrl, params['username']))
raise exceptions.FailedToLogin(url,params['username'])
def loginNeededCheck(self,data):
return "isLoggedIn = false" in data
def doStorySubscribe(self, url, soup):
subHref = soup.find('a',{'id':'subscribe'})
if subHref:
#does not work when using https - 403
subUrl = 'http://' + self.getSiteDomain() + subHref['href']
self.get_request(subUrl)
data = self.get_request(url,usecache=False)
soup = self.make_soup(data)
check = soup.find('div',{'class':'click-to-read-full'})
if check:
return False
else:
return soup
else:
return False
## Getting the chapter list and the meta data, plus 'is adult' checking.
def doExtractChapterUrlsAndMetadata(self,get_cover=True):
url = self.url
logger.info("url: "+url)
soup = None
try:
data = self.get_request(url)
soup = self.make_soup(data)
except exceptions.HTTPErrorFFF as e:
if e.status_code != 404:
raise
data = self.decode_data(e.data)
# logger.debug(data)
if not soup or self.loginNeededCheck(data):
# always login if not already to avoid lots of headaches
self.performLogin(url,data)
# refresh website after logging in
data = self.get_request(url,usecache=False)
soup = self.make_soup(data)
# subscription check
# logger.debug(soup)
subCheck = soup.find('div',{'class':'click-to-read-full'})
if subCheck and self.getConfig("auto_sub"):
subSoup = self.doStorySubscribe(url,soup)
if subSoup:
soup = subSoup
else:
raise exceptions.FailedToDownload("Error when subscribing to story. This usually means a change in the website code.")
elif subCheck and not self.getConfig("auto_sub"):
raise exceptions.FailedToDownload("This story is only available to subscribers. You can subscribe manually on the web site, or set auto_sub:true in personal.ini.")
## Title
a = soup.find('h1', {'id': 'story-title'})
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
mainmeta = soup.find('footer', {'class': 'main-meta'})
alist = mainmeta.find('span', string='Author(s)')
alist = alist.parent.find_all('a', href=re.compile(r"/profile/u/[^/]+"))
for a in alist:
self.story.addToList('authorId',a['href'].split('/')[-1])
self.story.addToList('authorUrl','https://'+self.host+a['href'])
self.story.addToList('author',a.text)
newestChapter = None
self.newestChapterNum = None
# Find the chapters:
chapters=soup.find('select',{'name':'chapter-nav'})
hrefattr=None
if chapters:
chapters=chapters.find_all('option')
hrefattr='value'
else: # didn't find <select name='chapter-nav', look for alternative
chapters=soup.find('div',{'class':'widget--chapters'}).find_all('a')
hrefattr='href'
for index, chapter in enumerate(chapters):
if chapter.text != 'Foreword' and 'Collapse chapters' not in chapter.text:
self.add_chapter(chapter.text,'https://' + self.getSiteDomain() + chapter[hrefattr])
# note: AFF cuts off chapter names in list. this gets kind of fixed later on
# find timestamp
a = soup.find('span', string='Updated')
if a == None:
a = soup.find('span', string='Published') # use published date if work was never updated
a = a.parent.find('time')
chapterDate = makeDate(a['datetime'],self.dateformat)
if newestChapter == None or chapterDate > newestChapter:
newestChapter = chapterDate
self.newestChapterNum = index
# story status
a = mainmeta.find('span', string='Completed')
if a:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
# story description
try:
jsonlink = soup.find('script',string=re.compile(r'/api/forewords/[0-9]+/foreword_[0-9a-z]+.json')).get_text().split('"')[1] # grabs url from quotation marks
fore_json = json.loads(self.get_request(jsonlink))
content = self.make_soup(fore_json['post']).find('body') # BS4 adds <html><body> if not present.
a = content.find('div', {'id':'story-description'})
except:
# not all stories have foreward link.
a = soup.find('div', {'id':'story-description'})
if a:
self.setDescription(url,a)
# story tags
a = mainmeta.find('span',string='Tags')
if a:
tags = a.parent.find_all('a')
for tag in tags:
self.story.addToList('tags', tag.text)
# story tags
a = mainmeta.find('span',string='Characters')
if a:
self.story.addToList('characters', a.nextSibling)
# published on
a = soup.find('span', string='Published')
a = a.parent.find('time')
self.story.setMetadata('datePublished', makeDate(a['datetime'], self.dateformat))
# updated on
a = soup.find('span', string='Updated')
if a:
a = a.parent.find('time')
self.story.setMetadata('dateUpdated', makeDate(a['datetime'], self.dateformat))
# word count
a = soup.find('span', string='Total Word Count')
if a:
a = a.find_next('span')
self.story.setMetadata('numWords', int(a.text.split()[0]))
# upvote, subs, and views
a = soup.find('div',{'class':'title-meta'})
spans = a.find_all('span', recursive=False)
self.story.setMetadata('upvotes', re.search(r'\(([^)]+)', spans[0].find('span').text).group(1))
self.story.setMetadata('subscribers', re.search(r'\(([^)]+)', spans[1].find('span').text).group(1))
if len(spans) > 2: # views can be private
self.story.setMetadata('views', spans[2].text.split()[0])
# cover art in the form of a div before chapter content
if get_cover:
cover_url = ""
a = soup.find('div',{'id':'bodyText'})
if a:
a = a.find('div',{'class':'text-center'})
if a:
cover_url = a.find('img')['src']
self.setCoverImage(url,cover_url)
# grab the text for an individual chapter
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
data = self.get_request(url)
soup = self.make_soup(data)
# logger.debug(data)
ageform = soup.select_one('form[action="/account/toggle_age"]')
# logger.debug(ageform)
if ageform and (self.is_adult or self.getConfig("is_adult")):
params = {}
params['is_of_age']=ageform.select_one('input#is_of_age')['value']
params['current_url']=ageform.select_one('input#current_url')['value']
params['csrf_aff_token']=ageform.select_one('input[name="csrf_aff_token"]')['value']
loginUrl = 'https://' + self.getSiteDomain() + '/account/mark_over_18'
logger.info("Will now toggle age to URL (%s)" % (loginUrl))
# logger.debug(params)
data = self.post_request(loginUrl, params)
soup = self.make_soup(data)
# logger.debug(data)
content = soup.find('div', {'id': 'user-submitted-body'})
if self.getConfig('inject_chapter_image'):
logger.debug("Injecting chapter image")
imgdiv = soup.select_one('div#bodyText div.bot-spacer')
if imgdiv:
content.insert(0, "\n")
content.insert(0, imgdiv)
content.insert(0, "\n")
if self.getConfig('inject_chapter_title'):
logger.debug("Injecting full-length chapter title")
title = soup.find('h1', {'id' : 'chapter-title'}).text
newTitle = soup.new_tag('h3')
newTitle.string = title
content.insert(0, "\n")
content.insert(0, newTitle)
content.insert(0, "\n")
return self.utf8FromSoup(url,content)

View file

@ -0,0 +1,227 @@
# -*- coding: utf-8 -*-
# Copyright 2013 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Software: eFiction
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return Asr3SlashzoneOrgAdapter
class Asr3SlashzoneOrgAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
self._setURL('http://' + self.getSiteDomain() + '/archive/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','asr3')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%d/%m/%y"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'asr3.slashzone.org'
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/archive/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/archive/viewstory.php?sid=")+r"\d+$"
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
addurl = "&ageconsent=ok&warning=3"
else:
addurl=""
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
m = re.search(r"'viewstory.php\?sid=\d+((?:&amp;ageconsent=ok)?&amp;warning=\d+)'",data)
if m != None:
if self.is_adult or self.getConfig("is_adult"):
# We tried the default and still got a warning, so
# let's pull the warning number from the 'continue'
# link and reload data.
addurl = m.group(1)
# correct stupid &amp; error in url.
addurl = addurl.replace("&amp;","&")
url = self.url+'&index=1'+addurl
logger.debug("URL 2nd try: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
else:
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
#print data
# Now go hunting for all the meta data and the chapter list.
## Title
a = soup.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
a = soup.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/archive/'+a['href'])
self.story.setMetadata('author',a.string)
# Rating
rate = stripHTML(soup.find('div',{'id':'pagetitle'}))
rate = rate[rate.rindex('[')+1:rate.rindex(']')]
self.story.setMetadata('rating', rate)
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/archive/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
metadiv = soup.find('div',{'class':'content'})
smalldiv = metadiv.find('div',{'class':'small'})
categorys = smalldiv.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for category in categorys:
self.story.addToList('category',category.string)
chars = smalldiv.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
for char in chars:
self.story.addToList('characters',char.string)
ships = smalldiv.parent.findAll('a',href=re.compile(r'browse\.php\?type=class&type_id=2&classid=1'))
for ship in ships:
self.story.addToList('ships',ship.string)
metatext = stripHTML(smalldiv)
if 'Completed: Yes' in metatext:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
wordstart=metatext.rindex('Word count:')+12
words = metatext[wordstart:metatext.index(' ',wordstart)]
self.story.setMetadata('numWords', words)
datesdiv = soup.find('div',{'class':'bottom'})
dates = stripHTML(datesdiv).split()
# Published: 04/26/2011 Updated: 03/06/2013
self.story.setMetadata('datePublished', makeDate(dates[1], self.dateformat))
self.story.setMetadata('dateUpdated', makeDate(dates[3], self.dateformat))
try:
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"viewseries.php\?seriesid=\d+"))
series_name = a.string
series_url = 'http://'+self.host+'/archive/'+a['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
# can't use ^viewstory...$ in case of higher rated stories with javascript href.
storyas = seriessoup.findAll('a', href=re.compile(r'viewstory.php\?sid=\d+'))
i=1
for a in storyas:
# skip 'report this' and 'TOC' links
if 'contact.php' not in a['href'] and 'index' not in a['href']:
if a['href'] == ('viewstory.php?sid='+self.story.getMetadata('storyId')):
self.setSeries(series_name, i)
self.story.setMetadata('seriesUrl',series_url)
break
i+=1
except:
# I find it hard to care if the series parsing fails
pass
# remove 'small' leaving only summary.
smalldiv.extract()
self.setDescription(url,metadiv)
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -1,188 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
###########################################################################
### written by GComyn - 10/06/2016
### updated by GComyn = 10/24/2016
### updated by GComyn - November 25, 2016
### Fixed the re.compile problem with the chapters
### Removed the slash '\' from the title
### Fixed the removal of the extra tags from some of the stories and
### removed the attributes from the paragraph and span tags
###########################################################################
from __future__ import absolute_import
'''
This works, but some of the stories have abysmal formatting, so it would
probably need to be edited for reading.
I've seen one story that downloaded at 25M, but after editing is only 201K
after the formatting was corrected.
Right now it is written to download each chapter seperatly, but I may change
that to get the whole story. It will still have formatting problems, but should
be able to get the longer stories this way.
[Edited November 25, 2016] After looking at the single page story, I've come to
the conclusion that I (at this time) can't figure out a way to use it to download
the stories. There is no designation within the page to denote which chapter is
which. So, I'm going to leave it as is.
Also, the site is notrious for lagging, so some of the longer stories will
probably not be downloadable, since this program doesn't wait long enough
for the site to catch up.
'''
import time
import logging
logger = logging.getLogger(__name__)
import re
import sys
from bs4 import Comment
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from ..six import text_type as unicode
from ..six.moves.urllib import parse as urlparse
from .base_adapter import BaseSiteAdapter, makeDate
def getClass():
return BDSMLibraryComSiteAdapter
class BDSMLibraryComSiteAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only storyid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
self._setURL('https://{0}/stories/story.php?storyid={1}'.format(self.getSiteDomain(), self.story.getMetadata('storyId')))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','bdsmlib')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%b %d, %Y"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'www.bdsmlibrary.com'
@classmethod
def getSiteExampleURLs(cls):
return "https://"+cls.getSiteDomain()+"/stories/story.php?storyid=1234"
def getSiteURLPattern(self):
return r"https?://"+re.escape(self.getSiteDomain()+"/stories/story.php?storyid=")+r"\d+$"
def extractChapterUrlsAndMetadata(self):
if not (self.is_adult or self.getConfig("is_adult")):
raise exceptions.AdultCheckRequired(self.url)
data = self.get_request(self.url)
if 'The story does not exist' in data:
raise exceptions.StoryDoesNotExist(self.url)
soup = self.make_soup(data)
# Extract metadata
title=soup.title.text.replace('BDSM Library - Story: ','').replace('\\','')
self.story.setMetadata('title', title)
# Author
author = soup.find('a', href=re.compile(r"/stories/author.php\?authorid=\d+"))
if author:
authorurl = urlparse.urljoin(self.url, author['href'])
self.story.setMetadata('author', author.text)
self.story.setMetadata('authorUrl', authorurl)
authorid = author['href'].split('=')[1]
self.story.setMetadata('authorId', authorid)
else:
logger.info("Failed to find Author, setting to Anonymous")
self.story.setMetadata('author','Anonymous')
self.story.setMetadata('authorUrl','https://' + self.getSiteDomain() + '/')
self.story.setMetadata('authorId','0')
# Find the chapters:
# The update date is with the chapter links... so we will update it here as well
for chapter in soup.find_all('a', href=re.compile(r'/stories/chapter.php\?storyid='+self.story.getMetadata('storyId')+r"&chapterid=\d+$")):
value = chapter.findNext('td').findNext('td').string.replace('(added on','').replace(')','').strip()
self.story.setMetadata('dateUpdated', makeDate(value, self.dateformat))
self.add_chapter(chapter,'https://'+self.getSiteDomain()+chapter['href'])
# Get the MetaData
# Erotia Tags
tags = soup.find_all('a',href=re.compile(r'/stories/search.php\?selectedcode'))
for tag in tags:
self.story.addToList('eroticatags',tag.text)
for td in soup.find_all('td'):
if len(td.text)>0:
if 'Added on:' in td.text and '<table' not in unicode(td):
value = td.text.replace('Added on:','').strip()
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
elif 'Synopsis:' in td.text and '<table' not in unicode(td):
value = td.text.replace('\n','').replace('Synopsis:','').strip()
self.setDescription(self.url,stripHTML(value))
elif 'Size:' in td.text and '<table' not in unicode(td):
value = td.text.replace('\n','').replace('Size:','').strip()
self.story.setMetadata('size',stripHTML(value))
elif 'Comments:' in td.text and '<table' not in unicode(td):
value = td.text.replace('\n','').replace('Comments:','').strip()
self.story.setMetadata('comments',stripHTML(value))
# grab the text for an individual chapter.
def getChapterText(self, url):
#Since each chapter is on 1 page, we don't need to do anything special, just get the content of the page.
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self.get_request(url))
chaptertag = soup.find('div',{'class' : 'storyblock'})
# Some of the stories have the chapters in <pre> sections, so have to check for that
if chaptertag == None:
chaptertag = soup.find('pre')
if chaptertag == None:
raise exceptions.FailedToDownload("Error downloading Chapter: {0}! Missing required element!".format(url))
#strip comments from soup
[comment.extract() for comment in chaptertag.find_all(string=lambda text:isinstance(text, Comment))]
# BDSM Library basically wraps it's own html around the document,
# so we will be removing the script, title and meta content from the
# storyblock
for tag in chaptertag.find_all('head') + chaptertag.find_all('style') + chaptertag.find_all('title') + chaptertag.find_all('meta') + chaptertag.find_all('o:p') + chaptertag.find_all('link'):
tag.extract()
for tag in chaptertag.find_all('o:smarttagtype'):
tag.name = 'span'
## I'm going to take the attributes off all of the tags
## because they usually refer to the style that we removed above.
for tag in chaptertag.find_all(True):
tag.attrs = None
return self.utf8FromSoup(url,chaptertag)

View file

@ -1,15 +1,12 @@
from __future__ import absolute_import
from datetime import timedelta
import re
import urllib2
import urlparse
import logging
logger = logging.getLogger(__name__)
from bs4 import BeautifulSoup
from ..htmlcleanup import stripHTML
# py2 vs py3 transition
from ..six.moves.urllib import parse as urlparse
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
from .. import exceptions
@ -27,7 +24,7 @@ class BloodshedverseComAdapter(BaseSiteAdapter):
SITE_ABBREVIATION = 'bvc'
SITE_DOMAIN = 'bloodshedverse.com'
BASE_URL = 'https://' + SITE_DOMAIN + '/'
BASE_URL = 'http://' + SITE_DOMAIN + '/'
READ_URL_TEMPLATE = BASE_URL + 'stories.php?go=read&no=%s'
STARTED_DATETIME_FORMAT = '%m/%d/%Y'
@ -43,6 +40,19 @@ class BloodshedverseComAdapter(BaseSiteAdapter):
self._setURL(self.READ_URL_TEMPLATE % story_no)
self.story.setMetadata('siteabbrev', self.SITE_ABBREVIATION)
def _customized_fetch_url(self, url, exception=None, parameters=None):
if exception:
try:
data = self._fetchUrl(url, parameters)
except urllib2.HTTPError:
raise exception(self.url)
# Just let self._fetchUrl throw the exception, don't catch and
# customize it.
else:
data = self._fetchUrl(url, parameters)
return self.make_soup(data)
@staticmethod
def getSiteDomain():
return BloodshedverseComAdapter.SITE_DOMAIN
@ -52,7 +62,7 @@ class BloodshedverseComAdapter(BaseSiteAdapter):
return cls.READ_URL_TEMPLATE % 1234
def getSiteURLPattern(self):
return r'https?://' + re.escape(self.SITE_DOMAIN + '/stories.php?go=') + r'(read|chapters)\&(amp;)?no=\d+$'
return re.escape(self.BASE_URL + 'stories.php?go=') + r'(read|chapters)\&(amp;)?no=\d+$'
# Override stripURLParameters so the "no" parameter won't get stripped
@classmethod
@ -60,9 +70,7 @@ class BloodshedverseComAdapter(BaseSiteAdapter):
return url
def extractChapterUrlsAndMetadata(self):
logger.debug("URL: "+self.url)
soup = self.make_soup(self.get_request(self.url))
soup = self._customized_fetch_url(self.url)
# Since no 404 error code we have to raise the exception ourselves.
# A title that is just 'by' indicates that there is no author name
@ -73,24 +81,14 @@ class BloodshedverseComAdapter(BaseSiteAdapter):
for option in soup.find('select', {'name': 'chapter'}):
title = stripHTML(option)
url = self.READ_URL_TEMPLATE % option['value']
self.add_chapter(title, url)
self.chapterUrls.append((title, url))
# Reset the storyId to be the first chapter no. Needed
# because emails contain link to later chapters instead.
query_data = urlparse.parse_qs(self.get_chapter(0,'url'))
story_no = query_data['no'][0]
self.story.setMetadata('storyId', story_no)
self._setURL(self.READ_URL_TEMPLATE % story_no)
logger.info("updated storyId:%s"%story_no)
logger.info("updated storyUrl:%s"%self.url)
story_no = self.story.getMetadata('storyId')
# Get the URL to the author's page and find the correct story entry to
# scrape the metadata
author_url = urlparse.urljoin(self.url, soup.find('a', {'class': 'headline'})['href'])
soup = self.make_soup(self.get_request(author_url))
soup = self._customized_fetch_url(author_url)
story_no = self.story.getMetadata('storyId')
# Ignore first list_box div, it only contains the author information
for list_box in soup('div', {'class': 'list_box'})[1:]:
url = list_box.find('a', {'class': 'fictitle'})['href']
@ -117,7 +115,7 @@ class BloodshedverseComAdapter(BaseSiteAdapter):
summary_div = list_box.find('div', {'class': 'list_summary'})
if not self.getConfig('keep_summary_html'):
summary = ''.join(summary_div(string=True))
summary = ''.join(summary_div(text=True))
else:
summary = self.utf8FromSoup(author_url, summary_div)
@ -157,6 +155,9 @@ class BloodshedverseComAdapter(BaseSiteAdapter):
self.story.addToList('warnings', warning)
elif key == 'Chapters':
self.story.setMetadata('numChapters', int(value))
elif key == 'Words':
# Apparently only numChapters need to be an integer for
# some strange reason. Remove possible ',' characters as to
@ -171,16 +172,16 @@ class BloodshedverseComAdapter(BaseSiteAdapter):
# ugly %p(am/pm) hack moved into makeDate so other sites can use it.
self.story.setMetadata('dateUpdated', date)
if self.story.getMetadataRaw('rating') == 'NC-17' and not (self.is_adult or self.getConfig('is_adult')):
if self.story.getMetadata('rating') == 'NC-17' and not (self.is_adult or self.getConfig('is_adult')):
raise exceptions.AdultCheckRequired(self.url)
def getChapterText(self, url):
soup = self.make_soup(self.get_request(url))
storytext_div = soup.find('div', {'class': 'tl'})
storytext_div = storytext_div.find('div', {'class': ''})
soup = self._customized_fetch_url(url)
storytext_div = soup.find('div', {'class': 'storytext'})
if self.getConfig('strip_text_links'):
for anchor in storytext_div('a', {'class': 'FAtxtL'}):
anchor.replaceWith(anchor.string)
navigable_string = BeautifulSoup.NavigableString(anchor.string)
anchor.replaceWith(navigable_string)
return self.utf8FromSoup(url, storytext_div)

View file

@ -0,0 +1,330 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Software: eFiction
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from bs4.element import Tag
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
# By virtue of being recent and requiring both is_adult and user/pass,
# adapter_fanficcastletvnet.py is the best choice for learning to
# write adapters--especially for sites that use the eFiction system.
# Most sites that have ".../viewstory.php?sid=123" in the story URL
# are eFiction.
# For non-eFiction sites, it can be considerably more complex, but
# this is still a good starting point.
# In general an 'adapter' needs to do these five things:
# - 'Register' correctly with the downloader
# - Site Login (if needed)
# - 'Are you adult?' check (if needed--some do one, some the other, some both)
# - Grab the chapter list
# - Grab the story meta-data (some (non-eFiction) adapters have to get it from the author page)
# - Grab the chapter texts
# Search for XXX comments--that's where things are most likely to need changing.
# This function is called by the downloader in all adapter_*.py files
# in this dir to register the adapter class. So it needs to be
# updated to reflect the class below it. That, plus getSiteDomain()
# take care of 'Registering'.
def getClass():
return BloodTiesFansComAdapter # XXX
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class BloodTiesFansComAdapter(BaseSiteAdapter): # XXX
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
# XXX Most sites don't have the /fanfic part. Replace all to remove it usually.
self._setURL('http://' + self.getSiteDomain() + '/fiction/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','btf') # XXX
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%d %b %Y" # XXX
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'bloodties-fans.com' # XXX
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/fiction/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/fiction/viewstory.php?sid=")+r"\d+$"
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
if 'Registered Users Only' in data \
or 'There is no such account on our website' in data \
or "That password doesn't match the one in our database" in data:
return True
else:
return False
def performLogin(self, url):
params = {}
if self.password:
params['penname'] = self.username
params['password'] = self.password
else:
params['penname'] = self.getConfig("username")
params['password'] = self.getConfig("password")
params['cookiecheck'] = '1'
params['submit'] = 'Submit'
loginUrl = 'http://' + self.getSiteDomain() + '/fiction/user.php?action=login'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['penname']))
d = self._fetchUrl(loginUrl, params)
if "Member Account" not in d : #Member Account
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['penname']))
raise exceptions.FailedToLogin(url,params['penname'])
return False
else:
return True
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
# Furthermore, there's a couple sites now with more than
# one warning level for different ratings. And they're
# fussy about it. midnightwhispers has three: 4, 2 & 1.
# we'll try 1 first.
addurl = "&ageconsent=ok&warning=4" # XXX
else:
addurl=""
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
# The actual text that is used to announce you need to be an
# adult varies from site to site. Again, print data before
# the title search to troubleshoot.
# Since the warning text can change by warning level, let's
# look for the warning pass url. nfacommunity uses
# &amp;warning= -- actually, so do other sites. Must be an
# eFiction book.
# viewstory.php?sid=561&amp;warning=4
# viewstory.php?sid=561&amp;warning=1
# viewstory.php?sid=561&amp;warning=2
#print data
#m = re.search(r"'viewstory.php\?sid=1882(&amp;warning=4)'",data)
m = re.search(r"'viewstory.php\?sid=\d+((?:&amp;ageconsent=ok)?&amp;warning=\d+)'",data)
if m != None:
if self.is_adult or self.getConfig("is_adult"):
# We tried the default and still got a warning, so
# let's pull the warning number from the 'continue'
# link and reload data.
addurl = m.group(1)
# correct stupid &amp; error in url.
addurl = addurl.replace("&amp;","&")
url = self.url+'&index=1'+addurl
logger.debug("URL 2nd try: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
else:
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
## Title
a = soup.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
a = soup.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/fiction/'+a['href'])
self.story.setMetadata('author',a.string)
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/fiction/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
# utility method
def defaultGetattr(d,k):
try:
return d[k]
except:
return ""
listbox = soup.find('div',{'class':'listbox'})
# <strong>Rating:</strong> M<br /> etc
labels = listbox.findAll('strong')
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
if 'Summary' in label:
## Everything until the next strong tag.
svalue = ""
while not isinstance(value,Tag) or value.name != 'strong':
svalue += unicode(value)
value = value.nextSibling
self.setDescription(url,svalue)
#self.story.setMetadata('description',stripHTML(svalue))
if 'Rating' in label:
self.story.setMetadata('rating', value)
if 'Words' in label:
value=re.sub(r"\|",r"",value)
self.story.setMetadata('numWords', value)
if 'Categories' in label:
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
catstext = [cat.string for cat in cats]
for cat in catstext:
self.story.addToList('category',cat.string)
if 'Characters' in label:
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
charstext = [char.string for char in chars]
for char in charstext:
self.story.addToList('characters',char.string)
if 'Completed' in label:
if 'Yes' in value:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
if 'Published' in label:
value=re.sub(r"\|",r"",value)
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
if 'Updated' in label:
value=re.sub(r"\|",r"",value)
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
# moved outside because they changed *most*, but not *all* labels to <strong>
ships = listbox.findAll('a',href=re.compile(r'browse.php.type=class&(amp;)?type_id=2')) # crappy html: & vs &amp; in url.
shipstext = [ship.string for ship in ships]
for ship in shipstext:
self.story.addToList('ships',ship.string)
genres = listbox.findAll('a',href=re.compile(r'browse.php\?type=class&(amp;)?type_id=1')) # crappy html: & vs &amp; in url.
genrestext = [genre.string for genre in genres]
for genre in genrestext:
self.story.addToList('genre',genre.string)
try:
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"viewseries.php\?seriesid=\d+"))
series_name = a.string
series_url = 'http://'+self.host+'/fiction/'+a['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('viewstory.php?sid='+self.story.getMetadata('storyId')):
self.setSeries(series_name, i)
self.story.setMetadata('seriesUrl',series_url)
break
i+=1
except:
# I find it hard to care if the series parsing fails
pass
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -0,0 +1,300 @@
# -*- coding: utf-8 -*-
# Copyright 2016 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Software: eFiction
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return BuffyGilesComAdapter
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class BuffyGilesComAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
# XXX Most sites don't have the /efiction part. Replace all to remove it usually.
self._setURL('http://' + self.getSiteDomain() + '/efiction/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','bufg')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%d/%m/%y"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'buffygiles.velocitygrass.com'
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/efiction/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/efiction/viewstory.php?sid=")+r"\d+$"
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
if 'Registered Users Only' in data \
or 'There is no such account on our website' in data \
or "That password doesn't match the one in our database" in data:
return True
else:
return False
def performLogin(self, url):
params = {}
if self.password:
params['penname'] = self.username
params['password'] = self.password
else:
params['penname'] = self.getConfig("username")
params['password'] = self.getConfig("password")
params['cookiecheck'] = '1'
params['submit'] = 'Submit'
loginUrl = 'http://' + self.getSiteDomain() + '/user.php?action=login'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['penname']))
d = self._fetchUrl(loginUrl, params)
if "Member Account" not in d : #Member Account
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['penname']))
raise exceptions.FailedToLogin(url,params['penname'])
return False
else:
return True
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
addurl = "&warning=5"
else:
addurl=""
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url)
data = self._fetchUrl(url)
# Since the warning text can change by warning level, let's
# look for the warning pass url. ksarchive uses
# &amp;warning= -- actually, so do other sites. Must be an
# eFiction book.
# efiction/viewstory.php?sid=1882&amp;warning=4
# efiction/viewstory.php?sid=1654&amp;ageconsent=ok&amp;warning=5
#print data
m = re.search(r"'efiction/viewstory.php\?sid=542(&amp;warning=5)'",data)
m = re.search(r"'efiction/viewstory.php\?sid=\d+((?:&amp;ageconsent=ok)?&amp;warning=\d+)'",data)
if m != None:
if self.is_adult or self.getConfig("is_adult"):
# We tried the default and still got a warning, so
# let's pull the warning number from the 'continue'
# link and reload data.
addurl = m.group(1)
# correct stupid &amp; error in url.
addurl = addurl.replace("&amp;","&")
url = self.url+'&index=1'+addurl
logger.debug("URL 2nd try: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
else:
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.FailedToDownload(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
pagetitle = soup.find('div',{'id':'pagetitle'})
## Title
a = pagetitle.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
a = pagetitle.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.string)
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/efiction/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
# utility method
def defaultGetattr(d,k):
try:
return d[k]
except:
return ""
# <span class="label">Rated:</span> NC-17<br /> etc
labels = soup.findAll('span',{'class':'label'})
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
if 'Summary' in label:
## Everything until the next span class='label'
svalue = ""
while 'label' not in defaultGetattr(value,'class'):
svalue += unicode(value)
value = value.nextSibling
self.setDescription(url,svalue)
#self.story.setMetadata('description',stripHTML(svalue))
if 'Rated' in label:
self.story.setMetadata('rating', value)
if 'Word count' in label:
self.story.setMetadata('numWords', value)
if 'Categories' in label:
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
if 'Characters' in label:
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
for char in chars:
self.story.addToList('characters',char.string)
if 'Genre' in label:
genres = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
for genre in genres:
self.story.addToList('genre',genre.string)
if 'Warnings' in label:
warnings = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=3'))
for warning in warnings:
self.story.addToList('warnings',warning.string)
if 'Completed' in label:
if 'Yes' in value:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
if 'Published' in label:
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
if 'Updated' in label:
# there's a stray [ at the end.
#value = value[0:-1]
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
try:
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"efiction/viewseries.php\?seriesid=\d+"))
series_name = a.string
series_url = 'http://'+self.host+'/'+a['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^efiction/viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('efiction/viewstory.php?sid='+self.story.getMetadata('storyId')):
self.setSeries(series_name, i)
self.story.setMetadata('seriesUrl',series_url)
break
i+=1
except:
# I find it hard to care if the series parsing fails
pass
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -1,38 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2024 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import absolute_import
import logging
logger = logging.getLogger(__name__)
from .base_otw_adapter import BaseOTWAdapter
def getClass():
return CFAAAdapter
class CFAAAdapter(BaseOTWAdapter):
def __init__(self, config, url):
BaseOTWAdapter.__init__(self, config, url)
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','cfaa')
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'www.cfaarchive.org'

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -16,17 +16,17 @@
#
# Software: eFiction
from __future__ import absolute_import
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from ..six import text_type as unicode
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return ChaosSycophantHexComAdapter
@ -38,6 +38,11 @@ class ChaosSycophantHexComAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
@ -45,7 +50,7 @@ class ChaosSycophantHexComAdapter(BaseSiteAdapter):
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
self._setURL('http://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
@ -86,7 +91,13 @@ class ChaosSycophantHexComAdapter(BaseSiteAdapter):
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
data = self.get_request(url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
# The actual text that is used to announce you need to be an
# adult varies from site to site. Again, print data before
@ -97,9 +108,11 @@ class ChaosSycophantHexComAdapter(BaseSiteAdapter):
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
## Title
pt = soup.find('div', {'id' : 'pagetitle'})
@ -116,10 +129,11 @@ class ChaosSycophantHexComAdapter(BaseSiteAdapter):
self.story.setMetadata('rating', rating)
# Find the chapters:
for chapter in soup.find_all('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+r"&chapter=\d+$")):
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.add_chapter(chapter,'http://'+self.host+'/'+chapter['href']+addurl)
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
@ -130,12 +144,12 @@ class ChaosSycophantHexComAdapter(BaseSiteAdapter):
return d[k]
except:
return ""
# <span class="label">Rated:</span> NC-17<br /> etc
labels = soup.find_all('span',{'class':'label'})
labels = soup.findAll('span',{'class':'label'})
value = labels[0].previousSibling
svalue = ""
while value != None:
@ -145,7 +159,7 @@ class ChaosSycophantHexComAdapter(BaseSiteAdapter):
svalue += unicode(val)
val = val.nextSibling
self.setDescription(url,svalue)
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
@ -154,22 +168,22 @@ class ChaosSycophantHexComAdapter(BaseSiteAdapter):
self.story.setMetadata('numWords', value.split(' -')[0])
if 'Categories' in label:
cats = labelspan.parent.find_all('a',href=re.compile(r'browse.php\?type=categories'))
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
if 'Characters' in label:
chars = labelspan.parent.find_all('a',href=re.compile(r'browse.php\?type=characters'))
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
for char in chars:
self.story.addToList('characters',char.string)
if 'Genre' in label:
genres = labelspan.parent.find_all('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
genres = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
for genre in genres:
self.story.addToList('genre',genre.string)
if 'Warnings' in label:
warnings = labelspan.parent.find_all('a',href=re.compile(r'browse.php\?type=class&type_id=2'))
warnings = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=2'))
for warning in warnings:
self.story.addToList('warnings',warning.string)
@ -193,8 +207,9 @@ class ChaosSycophantHexComAdapter(BaseSiteAdapter):
series_name = a.string
series_url = 'http://'+self.host+'/'+a['href']
seriessoup = self.make_soup(self.get_request(series_url))
storyas = seriessoup.find_all('a', href=re.compile(r'^viewstory.php\?sid=\d+$'))
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('viewstory.php?sid='+self.story.getMetadata('storyId')):
@ -212,7 +227,7 @@ class ChaosSycophantHexComAdapter(BaseSiteAdapter):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self.get_request(url))
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})

View file

@ -1,107 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2020 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import absolute_import
import logging
import re
# py2 vs py3 transition
from .base_adapter import BaseSiteAdapter, makeDate
from fanficfare.htmlcleanup import stripHTML
from .. import exceptions as exceptions
logger = logging.getLogger(__name__)
def getClass():
return ChireadsComSiteAdapter
class ChireadsComSiteAdapter(BaseSiteAdapter):
NEW_DATE_FORMAT = '%Y/%m/%d %H:%M:%S'
OLD_DATE_FORMAT = '%m/%d/%Y %I:%M:%S %p'
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev', 'chireads')
# get storyId from url--url validation guarantees query correct
match = re.match(self.getSiteURLPattern(), url)
if not match:
raise exceptions.InvalidStoryURL(url, self.getSiteDomain(), self.getSiteExampleURLs())
story_id = match.group('id')
self.story.setMetadata('storyId', story_id)
self._setURL('https://%s/category/translatedtales/%s/' % (self.getSiteDomain(), story_id))
@staticmethod
def getSiteDomain():
return 'chireads.com'
@classmethod
def getSiteExampleURLs(cls):
return 'https://%s/category/translatedtales/story-name' % cls.getSiteDomain()
def getSiteURLPattern(self):
return r'https?://chireads\.com/category/translatedtales/(?P<id>[^/]+)(/)?'
def extractChapterUrlsAndMetadata(self):
logger.debug('URL: %s', self.url)
data = self.get_request(self.url)
soup = self.make_soup(data)
info = soup.select_one('.inform-inform-data')
self.story.setMetadata('title', stripHTML(info.h3).split(' | ')[0])
self.setCoverImage(self.url, soup.select_one('.inform-product > img')['src'])
# Unicode strings because '' isn't ':', but \xef\xbc\x9a
# author = stripHTML(info.h6).split(u' ')[0].replace(u'Auteur : ', '', 1)
author = stripHTML(info.h6).split('Babelcheck')[0].replace('Auteur : ', '').replace('\xc2\xa0', '')
# author = stripHTML(info.h6).split('\xa0')[0].replace(u'Auteur : ', '', 1)
self.story.setMetadata('author', author)
self.story.setMetadata('authorId', author)
## site doesn't have authorUrl links.
datestr = stripHTML(soup.select_one('.newestchapitre > div > a')['href'])[-11:-1]
date = makeDate(datestr, '%Y/%m/%d')
if date:
self.story.setMetadata('dateUpdated', date)
intro = stripHTML(info.select_one('.inform-inform-txt').span)
self.setDescription(self.url, intro)
for content in soup.find_all('div', {'id': 'content'}):
for a in content.find_all('a'):
self.add_chapter(a.get_text(), a['href'])
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
data = self.get_request(url)
soup = self.make_soup(data)
content = soup.select_one('#content')
if None == content:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,content)

View file

@ -0,0 +1,237 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return CSIForensicsComAdapter
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class CSIForensicsComAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
self._setURL('http://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','csiforensics')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%d %b %Y"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'csi-forensics.com'
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/viewstory.php?sid=")+r"\d+$"
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
addurl = "&ageconsent=ok&warning=5&skin=elegantcsi"
else:
addurl="&skin=elegantcsi"
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
# The actual text that is used to announce you need to be an
# adult varies from site to site. Again, print data before
# the title search to troubleshoot.
if "This story is rated NC-17, and therefore is not suitable for minors. If you are below the age required to view such material in your locality, please return from whence you came." in data: # XXX
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
## Title
pt = soup.find('div', {'id' : 'pagetitle'})
a = pt.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',a.string)
# Find authorid and URL from... author url.
a = soup.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.string)
# Rating
rate = stripHTML(soup.find('div',{'id':'pagetitle'}))
rate = rate[rate.rindex('[')+1:rate.rindex(']')]
self.story.setMetadata('rating', rate)
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
# utility method
def defaultGetattr(d,k):
try:
return d[k]
except:
return ""
smalldiv = soup.find('div', {'class' : 'small'})
chars = smalldiv.findAll('a',href=re.compile(r'browse.php\?type=characters'))
for char in chars:
self.story.addToList('characters',char.string)
metatext = stripHTML(smalldiv)
if 'Completed: Yes' in metatext:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
word=soup.find(text=re.compile("Word count:")).split(':')
self.story.setMetadata('numWords', word[1])
cats = smalldiv.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
warnings = smalldiv.findAll('a',href=re.compile(r'browse.php\?type=class(&amp;)type_id=2(&amp;)classid=\d+'))
for warning in warnings:
self.story.addToList('warnings',warning.string)
date=soup.find('div',{'class' : 'bottom'})
pd=date.find(text=re.compile("Published:")).string.split(': ')
self.story.setMetadata('datePublished', makeDate(stripHTML(pd[1].split(' U')[0]), self.dateformat))
self.story.setMetadata('dateUpdated', makeDate(stripHTML(pd[2]), self.dateformat))
# <span class="label">Rated:</span> NC-17<br /> etc
labels = soup.findAll('span',{'class':'label'})
pub=0
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
if 'Genres' in label:
genres = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
for genre in genres:
self.story.addToList('genre',genre.string)
if 'Warnings' in label:
warnings = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=2'))
for warning in warnings:
self.story.addToList('warnings',warning.string)
try:
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"viewseries.php\?seriesid=\d+"))
series_name = a.string
series_url = 'http://'+self.host+'/'+a['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('viewstory.php?sid='+self.story.getMetadata('storyId')):
self.setSeries(series_name, i)
self.story.setMetadata('seriesUrl',series_url)
break
i+=1
except:
# I find it hard to care if the series parsing fails
pass
smalldiv.extract()
# Summary
summary = soup.find('div', {'class' : 'content'})
self.setDescription(url,summary)
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,21 +15,11 @@
# limitations under the License.
#
from __future__ import absolute_import
from ..htmlcleanup import stripHTML
# Software: eFiction
from .base_efiction_adapter import BaseEfictionAdapter
from base_efiction_adapter import BaseEfictionAdapter
class DarkSolaceOrgAdapter(BaseEfictionAdapter):
@classmethod
def getProtocol(self):
"""
Some, but not all site now require https.
"""
return "https"
@staticmethod
def getSiteDomain():
return 'dark-solace.org'
@ -46,18 +36,6 @@ class DarkSolaceOrgAdapter(BaseEfictionAdapter):
def getDateFormat(self):
return "%B %d, %Y"
def extractChapterUrlsAndMetadata(self):
## Call super of extractChapterUrlsAndMetadata().
## base_efiction leaves the soup in self.html.
super(DarkSolaceOrgAdapter, self).extractChapterUrlsAndMetadata()
## attempt to fetch rating from title line:
## "Do You Think This Is Love? by Supernatural Beings [PG]"
r = stripHTML(self.html.find("div", {"id": "pagetitle"}))
if '[' in r and ']' in r:
self.story.setMetadata('rating',
r[r.index('[')+1:r.index(']')])
def getClass():
return DarkSolaceOrgAdapter

View file

@ -0,0 +1,300 @@
# -*- coding: utf-8 -*-
# Copyright 2013 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Software: eFiction
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return DeepInMySoulNetAdapter ## XXX
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class DeepInMySoulNetAdapter(BaseSiteAdapter): # XXX
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
# XXX Most sites don't have the /fiction part. Replace all to remove it usually.
self._setURL('http://' + self.getSiteDomain() + '/fiction/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','dimsn') ## XXX
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%B %d, %Y"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'www.deepinmysoul.net' # XXX
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/fiction/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/fiction/viewstory.php?sid=")+r"\d+$"
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
if 'Registered Users Only' in data \
or 'There is no such account on our website' in data \
or "That password doesn't match the one in our database" in data:
return True
else:
return False
def performLogin(self, url):
params = {}
if self.password:
params['penname'] = self.username
params['password'] = self.password
else:
params['penname'] = self.getConfig("username")
params['password'] = self.getConfig("password")
params['cookiecheck'] = '1'
params['submit'] = 'Submit'
loginUrl = 'http://' + self.getSiteDomain() + '/fiction/user.php?action=login'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['penname']))
d = self._fetchUrl(loginUrl, params)
if "Member Account" not in d : #Member Account
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['penname']))
raise exceptions.FailedToLogin(url,params['penname'])
return False
else:
return True
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
addurl = "&warning=4"
else:
addurl=""
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url)
data = self._fetchUrl(url)
# Since the warning text can change by warning level, let's
# look for the warning pass url. ksarchive uses
# &amp;warning= -- actually, so do other sites. Must be an
# eFiction book.
# fiction/viewstory.php?sid=1882&amp;warning=4
# fiction/viewstory.php?sid=1654&amp;ageconsent=ok&amp;warning=5
#print data
m = re.search(r"'fiction/viewstory.php\?sid=29(&amp;warning=4)'",data)
m = re.search(r"'fiction/viewstory.php\?sid=\d+((?:&amp;ageconsent=ok)?&amp;warning=\d+)'",data)
if m != None:
if self.is_adult or self.getConfig("is_adult"):
# We tried the default and still got a warning, so
# let's pull the warning number from the 'continue'
# link and reload data.
addurl = m.group(1)
# correct stupid &amp; error in url.
addurl = addurl.replace("&amp;","&")
url = self.url+'&index=1'+addurl
logger.debug("URL 2nd try: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
else:
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.FailedToDownload(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# Now go hunting for all the meta data and the chapter list.
pagetitle = soup.find('div',{'id':'pagecontent'})
## Title
a = pagetitle.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
a = pagetitle.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.string)
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/fiction/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
# utility method
def defaultGetattr(d,k):
try:
return d[k]
except:
return ""
# <span class="label">Rated:</span> NC-17<br /> etc
labels = soup.findAll('span',{'class':'label'})
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
if 'Summary' in label:
## Everything until the next span class='label'
svalue = ""
while 'label' not in defaultGetattr(value,'class'):
svalue += unicode(value)
value = value.nextSibling
self.setDescription(url,svalue)
#self.story.setMetadata('description',stripHTML(svalue))
if 'Rated' in label:
self.story.setMetadata('rating', value)
if 'Word count' in label:
self.story.setMetadata('numWords', value)
if 'Categories' in label:
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
if 'Characters' in label:
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
for char in chars:
self.story.addToList('characters',char.string)
if 'Genre' in label:
genres = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
for genre in genres:
self.story.addToList('genre',genre.string)
if 'Warnings' in label:
warnings = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=3'))
for warning in warnings:
self.story.addToList('warnings',warning.string)
if 'Completed' in label:
if 'Yes' in value:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
if 'Published' in label:
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
if 'Updated' in label:
# there's a stray [ at the end.
#value = value[0:-1]
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
try:
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"fiction/viewseries.php\?seriesid=\d+"))
series_name = a.string
series_url = 'http://'+self.host+'/'+a['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^fiction/viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('fiction/viewstory.php?sid='+self.story.getMetadata('storyId')):
self.setSeries(series_name, i)
self.story.setMetadata('seriesUrl',series_url)
break
i+=1
except:
# I find it hard to care if the series parsing fails
pass
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -0,0 +1,243 @@
# -*- coding: utf-8 -*-
# Copyright 2012 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Software: eFiction
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return DestinysGatewayComAdapter
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class DestinysGatewayComAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
self._setURL('http://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','dgrfa')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%b %d %Y"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'www.destinysgateway.com'
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/viewstory.php?sid=")+r"\d+$"
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
addurl = "&warning=4"
else:
addurl=""
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
m = re.search(r"'viewstory.php\?sid=\d+((?:&amp;ageconsent=ok)?&amp;warning=\d+)'",data)
if m != None:
if self.is_adult or self.getConfig("is_adult"):
# We tried the default and still got a warning, so
# let's pull the warning number from the 'continue'
# link and reload data.
addurl = m.group(1)
# correct stupid &amp; error in url.
addurl = addurl.replace("&amp;","&")
url = self.url+'&index=1'+addurl
logger.debug("URL 2nd try: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
else:
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
## Title
a = soup.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
a = soup.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.string)
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
# utility method
def defaultGetattr(d,k):
try:
return d[k]
except:
return ""
# <span class="label">Rated:</span> NC-17<br /> etc
labels = soup.findAll('span',{'class':'label'})
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
if 'Summary' in label:
## Everything until the next span class='label'
svalue = ""
while value and 'label' not in defaultGetattr(value,'class'):
svalue += unicode(value)
value = value.nextSibling
self.setDescription(url,svalue)
#self.story.setMetadata('description',stripHTML(svalue))
if 'Rated' in label:
self.story.setMetadata('rating', value)
if 'Word count' in label:
self.story.setMetadata('numWords', value)
if 'Categories' in label:
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
if 'Genre' in label:
genres = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
for genre in genres:
self.story.addToList('genre',genre.string)
if 'Warnings' in label:
warnings = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=2'))
for warning in warnings:
self.story.addToList('warnings',warning.string)
if 'Completed' in label:
if 'Yes' in value:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
if 'Published' in label:
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
if 'Updated' in label:
# there's a stray [ at the end.
#value = value[0:-1]
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
try:
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"viewseries.php\?seriesid=\d+"))
series_name = a.string
series_url = 'http://'+self.host+'/'+a['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('viewstory.php?sid='+self.story.getMetadata('storyId')):
self.setSeries(series_name, i)
self.story.setMetadata('seriesUrl',series_url)
break
i+=1
except:
# I find it hard to care if the series parsing fails
pass
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -1,256 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2021 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import absolute_import
import logging
import re
# py2 vs py3 transition
from ..six.moves.urllib.parse import urlparse
from .base_adapter import BaseSiteAdapter, makeDate
from fanficfare.htmlcleanup import stripHTML
from .. import exceptions as exceptions
from fanficfare.dateutils import parse_relative_date_string
logger = logging.getLogger(__name__)
def getClass():
return DeviantArtComSiteAdapter
class DeviantArtComSiteAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev', 'dac')
self.username = 'NoneGiven'
self.password = ''
self.is_adult = False
match = re.match(self.getSiteURLPattern(), url)
if not match:
raise exceptions.InvalidStoryURL(url, self.getSiteDomain(), self.getSiteExampleURLs())
story_id = match.group('id')
author = match.group('author')
self.story.setMetadata('author', author)
self.story.setMetadata('authorId', author)
self.story.setMetadata('authorUrl', 'https://www.deviantart.com/' + author)
self._setURL(url)
@staticmethod
def getSiteDomain():
return 'www.deviantart.com'
@classmethod
def getAcceptDomains(cls):
return ['www.deviantart.com']
@classmethod
def getProtocol(self):
return 'https'
@classmethod
def getSiteExampleURLs(cls):
return 'https://%s/<author>/art/<work-name>' % cls.getSiteDomain()
def getSiteURLPattern(self):
return r'https?://www\.deviantart\.com/(?P<author>[^/]+)/art/(?P<id>[^/]+)/?'
def performLogin(self, url):
if self.username and self.username != 'NoneGiven':
username = self.username
else:
username = self.getConfig('username')
# logger.debug("\n\nusername:(%s)\n\n"%username)
if not username:
logger.info("Login Required for URL %s" % url)
raise exceptions.FailedToLogin(url,username)
data = self.get_request_raw('https://www.deviantart.com/users/login', referer=url, usecache=False)
data = self.decode_data(data)
soup = self.make_soup(data)
params = {
'referer': 'https://www.deviantart.com/_sisu/do/signin', # soup.find('input', {'name': 'referer'})['value'],
'referer_type': soup.find('input', {'name': 'referer_type'})['value'],
'csrf_token': soup.find('input', {'name': 'csrf_token'})['value'],
'challenge': soup.find('input', {'name': 'challenge'})['value'],
'lu_token': soup.find('input', {'name': 'lu_token'})['value'],
'remember': 'on',
'username': username
}
loginUrl = 'https://' + self.getSiteDomain() + '/_sisu/do/step2'
logger.debug('Will now login to deviantARt as (%s)' % username)
result = self.post_request(loginUrl, params, usecache=False)
soup = self.make_soup(result)
if not soup.find('input', {'name': 'lu_token2'}):
logger.info("Login Failed for URL %s (no lu_token2 found)" % url)
raise exceptions.FailedToLogin(url,username)
params = {
'referer': 'https://www.deviantart.com/_sisu/do/signin', # soup.find('input', {'name': 'referer'})['value'],
'referer_type': soup.find('input', {'name': 'referer_type'})['value'],
'csrf_token': soup.find('input', {'name': 'csrf_token'})['value'],
'challenge': soup.find('input', {'name': 'challenge'})['value'],
'lu_token': soup.find('input', {'name': 'lu_token'})['value'],
'lu_token2': soup.find('input', {'name': 'lu_token2'})['value'],
'remember': 'on',
'username': ''
}
if self.password:
params['password'] = self.password
else:
params['password'] = self.getConfig('password')
# logger.debug("\n\nparams['password']:(%s)\n\n"%params['password'])
loginUrl = 'https://' + self.getSiteDomain() + '/_sisu/do/signin'
logger.debug('Will now send password to deviantARt')
result = self.post_request(loginUrl, params, usecache=False)
if 'Log In | DeviantArt' in result:
logger.error('Failed to login to deviantArt as %s' % username)
raise exceptions.FailedToLogin('https://www.deviantart.com', username)
else:
return True
def requiresLogin(self, data):
return '</a> has limited the viewing of this artwork to members of the DeviantArt community only' in data
def isLoggedIn(self, data):
return '<form id="logout-form" action="https://www.deviantart.com/users/logout" method="POST">' in data
def isWatchersOnly(self, data):
return '>Watchers-Only Deviation<' in data
def requiresMatureContentEnabled(self, data):
return (
'>This content is intended for mature audiences<' in data
or '>This deviation is intended for mature audiences<' in data
or '>This filter hides content that may be inappropriate for some viewers<' in data
or '>May contain sensitive content<' in data
or '>Log in to view<' in data
or '>This deviation has been labeled as containing themes not suitable for all deviants.<' in data
)
def extractChapterUrlsAndMetadata(self):
logger.debug('URL: %s', self.url)
data = self.get_request(self.url)
soup = self.make_soup(data)
## story can require login outright, or it can show up as
## watchers-only or mature-enabled without the same 'requires
## login' strings.
if self.requiresLogin(data) or ( not self.isLoggedIn(data) and
(self.isWatchersOnly(data) or
self.requiresMatureContentEnabled(data)) ):
if self.performLogin(self.url):
data = self.get_request(self.url, usecache=False)
soup = self.make_soup(data)
## Check watchers only and mature enabled again, separately,
## after login because they can still apply after login.
if self.isWatchersOnly(data):
raise exceptions.FailedToDownload(
'Deviation is only available for watchers.' +
'You must watch this author before you can download it.'
)
if self.requiresMatureContentEnabled(data):
raise exceptions.FailedToDownload(
'Deviation is set as mature, you must go into your account ' +
'and enable showing of mature content.'
)
appurl = soup.select_one('meta[property="og:url"]')['content']
if appurl:
story_id = urlparse(appurl).path.lstrip('/')
else:
logger.debug("Looking for JS story id")
## after login, this is only found in a JS block. Dunno why.
## F875A309-B0DB-860E-5079-790D0FBE5668
match = re.match(r'\\"deviationUuid\\":\\"(?P<id>[A-Z0-9-]+)\\",',data)
if match:
story_id = match.group('id')
else:
raise exceptions.FailedToDownload('Failed to find Story ID.')
self.story.setMetadata('storyId', story_id)
title = soup.select_one('h1').get_text()
self.story.setMetadata('title', stripHTML(title))
## dA has no concept of status
# self.story.setMetadata('status', 'Completed')
pubdate = soup.select_one('time').get_text()
# Maybe do this better, but this works
try:
self.story.setMetadata('datePublished', makeDate(pubdate, '%b %d, %Y'))
except:
self.story.setMetadata('datePublished', parse_relative_date_string(pubdate))
# do description here if appropriate
story_tags = soup.select('a[href^="https://www.deviantart.com/tag"] span')
if story_tags is not None:
for tag in story_tags:
self.story.addToList('genre', tag.get_text())
self.add_chapter(title, self.url)
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s', url)
data = self.get_request(url)
# logger.debug(data)
soup = self.make_soup(data)
# remove comments section to avoid false matches
comments = soup.select_one('[data-hook=comments_thread]')
if comments:
comments.decompose()
# previous search not always found in some stories.
# <div id="comments"></div> inside the real containing
# div seems more common
commentsdiv = soup.select_one('div#comments')
if commentsdiv:
commentsdiv.parent.decompose()
# three different 'content' tags to look for.
# This is the current in Oct 2024
content = soup.select_one('[data-editor-viewer="1"]')
if content is None:
# older story? I can't find any of this style in Oct2024
content = soup.select_one('[data-id="rich-content-viewer"]')
if content is None:
# olderer story, but used by some older (2018) posts
content = soup.select_one('.legacy-journal')
if content is None:
raise exceptions.FailedToDownload(
'Could not find story text. Please open a bug with the URL %s' % self.url
)
return self.utf8FromSoup(url, content)

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,16 +15,17 @@
# limitations under the License.
#
from __future__ import absolute_import
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return DokugaComAdapter
@ -36,6 +37,11 @@ class DokugaComAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
@ -74,7 +80,7 @@ class DokugaComAdapter(BaseSiteAdapter):
return "http://"+cls.getSiteDomain()+"/fanfiction/story/1234/1 http://"+cls.getSiteDomain()+"/spark/story/1234/1"
def getSiteURLPattern(self):
return r"http://"+self.getSiteDomain()+r"/(fanfiction|spark)?/story/\d+/?\d+?$"
return r"http://"+self.getSiteDomain()+"/(fanfiction|spark)?/story/\d+/?\d+?$"
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
@ -95,17 +101,17 @@ class DokugaComAdapter(BaseSiteAdapter):
params['Submit'] = 'Submit'
# copy all hidden input tags to pick up appropriate tokens.
for tag in soup.find_all('input',{'type':'hidden'}):
for tag in soup.findAll('input',{'type':'hidden'}):
params[tag['name']] = tag['value']
loginUrl = 'http://' + self.getSiteDomain() + '/fanfiction'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['username']))
d = self.post_request(loginUrl, params)
d = self._postUrl(loginUrl, params)
if "Your session has expired. Please log in again." in d:
d = self.post_request(loginUrl, params)
d = self._postUrl(loginUrl, params)
if "Logout" not in d : #Member Account
logger.info("Failed to login to URL %s as %s" % (loginUrl,
@ -123,20 +129,28 @@ class DokugaComAdapter(BaseSiteAdapter):
url = self.url
logger.debug("URL: "+url)
data = self.get_request(url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url,soup)
data = self.get_request(url)
data = self._fetchUrl(url)
soup = self.make_soup(data)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# print data
# Now go hunting for all the meta data and the chapter list.
## Title and author
a = soup.find('div', {'align' : 'center'}).find('h3')
@ -153,22 +167,23 @@ class DokugaComAdapter(BaseSiteAdapter):
self.story.setMetadata('title',stripHTML(a))
# Find the chapters:
chapters = soup.find('select').find_all('option')
chapters = soup.find('select').findAll('option')
if len(chapters)==1:
self.add_chapter(self.story.getMetadata('title'),'http://'+self.host+'/'+self.section+'/story/'+self.story.getMetadata('storyId')+'/1')
self.chapterUrls.append((self.story.getMetadata('title'),'http://'+self.host+'/'+self.section+'/story/'+self.story.getMetadata('storyId')+'/1'))
else:
for chapter in chapters:
# just in case there's tags, like <i> in chapter titles. /fanfiction/story/7406/1
self.add_chapter(chapter,'http://'+self.host+'/'+self.section+'/story/'+self.story.getMetadata('storyId')+'/'+chapter['value'])
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/'+self.section+'/story/'+self.story.getMetadata('storyId')+'/'+chapter['value']))
self.story.setMetadata('numChapters',len(self.chapterUrls))
asoup = self.make_soup(self.get_request(alink))
asoup = self.make_soup(self._fetchUrl(alink))
if 'fanfiction' in self.section:
asoup=asoup.find('div', {'id' : 'cb_tabid_52'}).find('div')
#grab the rest of the metadata from the author's page
for div in asoup.find_all('div'):
for div in asoup.findAll('div'):
nav=div.find('a', href=re.compile(r'/fanfiction/story/'+self.story.getMetadata('storyId')+"/1$"))
if nav != None:
break
@ -208,7 +223,7 @@ class DokugaComAdapter(BaseSiteAdapter):
else:
asoup=asoup.find('div', {'id' : 'maincol'}).find('div', {'class' : 'padding'})
for div in asoup.find_all('div'):
for div in asoup.findAll('div'):
nav=div.find('a', href=re.compile(r'/spark/story/'+self.story.getMetadata('storyId')+"/1$"))
if nav != None:
break
@ -252,7 +267,7 @@ class DokugaComAdapter(BaseSiteAdapter):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self.get_request(url))
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'chtext'})

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2012 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2012 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -16,17 +16,17 @@
#
# Software: eFiction
from __future__ import absolute_import
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from ..six import text_type as unicode
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return DracoAndGinnyComAdapter
@ -38,6 +38,11 @@ class DracoAndGinnyComAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
@ -93,7 +98,7 @@ class DracoAndGinnyComAdapter(BaseSiteAdapter):
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['penname']))
d = self.post_request(loginUrl, params)
d = self._fetchUrl(loginUrl, params)
if "Member Account" not in d : #Member Account
logger.info("Failed to login to URL %s as %s" % (loginUrl,
@ -120,12 +125,18 @@ class DracoAndGinnyComAdapter(BaseSiteAdapter):
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
data = self.get_request(url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url)
data = self.get_request(url)
data = self._fetchUrl(url)
m = re.search(r"'viewstory.php\?sid=\d+((?:&amp;ageconsent=ok)?&amp;warning=\d+)'",data)
if m != None:
@ -139,16 +150,24 @@ class DracoAndGinnyComAdapter(BaseSiteAdapter):
url = self.url+'&index=1'+addurl
logger.debug("URL 2nd try: "+url)
data = self.get_request(url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
else:
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
## Title
a = soup.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
@ -161,10 +180,11 @@ class DracoAndGinnyComAdapter(BaseSiteAdapter):
self.story.setMetadata('author',a.string)
# Find the chapters:
for chapter in soup.find_all('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+r"&chapter=\d+$")):
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.add_chapter(chapter,'http://'+self.host+'/'+chapter['href']+addurl)
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
@ -181,13 +201,13 @@ class DracoAndGinnyComAdapter(BaseSiteAdapter):
self.setDescription(url,content.find('blockquote'))
for genre in content.find_all('a',href=re.compile(r'browse.php\?type=class&type_id=1')):
for genre in content.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=1')):
self.story.addToList('genre',genre.string)
for warning in content.find_all('a',href=re.compile(r'browse.php\?type=class&type_id=2')):
for warning in content.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=2')):
self.story.addToList('warnings',warning.string)
labels = content.find_all('b')
labels = content.findAll('b')
for labelspan in labels:
value = labelspan.nextSibling
@ -208,22 +228,22 @@ class DracoAndGinnyComAdapter(BaseSiteAdapter):
self.story.setMetadata('rating', value)
if 'Categories' in label:
cats = labelspan.parent.find_all('a',href=re.compile(r'browse.php\?type=categories'))
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
if 'Characters' in label:
chars = labelspan.parent.find_all('a',href=re.compile(r'browse.php\?type=characters'))
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
for char in chars:
self.story.addToList('characters',char.string)
if 'Genre' in label:
genres = labelspan.parent.find_all('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
genres = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
for genre in genres:
self.story.addToList('genre',genre.string)
if 'Warnings' in label:
warnings = labelspan.parent.find_all('a',href=re.compile(r'browse.php\?type=class&type_id=2'))
warnings = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=2'))
for warning in warnings:
self.story.addToList('warnings',warning.string)
@ -245,9 +265,10 @@ class DracoAndGinnyComAdapter(BaseSiteAdapter):
series_name = a.string
series_url = 'http://'+self.host+'/'+a['href']
seriessoup = self.make_soup(self.get_request(series_url))
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
# can't use ^viewstory...$ in case of higher rated stories with javascript href.
storyas = seriessoup.find_all('a', href=re.compile(r'viewstory.php\?sid=\d+'))
storyas = seriessoup.findAll('a', href=re.compile(r'viewstory.php\?sid=\d+'))
i=1
for a in storyas:
# skip 'report this' and 'TOC' links
@ -267,7 +288,7 @@ class DracoAndGinnyComAdapter(BaseSiteAdapter):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self.get_request(url))
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'class' : 'listbox'})

View file

@ -0,0 +1,311 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Software: eFiction
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from bs4.element import Tag
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return DramioneOrgAdapter
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class DramioneOrgAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["utf8",
"Windows-1252",]
# 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
self._setURL('http://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','drmn')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%d %B %Y"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'dramione.org'
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/viewstory.php?sid=")+r"\d+$"
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
if 'Registered Users Only' in data \
or 'There is no such account on our website' in data \
or "That password doesn't match the one in our database" in data:
return True
else:
return False
def performLogin(self, url):
params = {}
if self.password:
params['penname'] = self.username
params['password'] = self.password
else:
params['penname'] = self.getConfig("username")
params['password'] = self.getConfig("password")
params['cookiecheck'] = '1'
params['submit'] = 'Submit'
loginUrl = 'http://' + self.getSiteDomain() + '/user.php?action=login'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['penname']))
d = self._fetchUrl(loginUrl, params)
if "Member Account" not in d : #Member Account
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['penname']))
raise exceptions.FailedToLogin(url,params['penname'])
return False
else:
return True
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
addurl = "&warning=5"
else:
addurl=""
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+addurl
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url)
data = self._fetchUrl(url)
# The actual text that is used to announce you need to be an
# adult varies from site to site. Again, print data before
# the title search to troubleshoot.
if "Stories that are suitable for ages 16 and older" in data:
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
## Title
a = soup.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
a = soup.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.string)
# Use banner as cover if found
coverurl = ''
img = soup.find('img',{'class':'banner'})
if img:
coverurl = img['src']
#print "Cover: "+coverurl
a = soup.find(text="This story has a banner; click to view.")
if a:
#print "A: "+ ', '.join("(%s, %s)" %tup for tup in a.parent.attrs)
coverurl = a.parent['href']
#print "Cover: "+coverurl
if coverurl:
self.setCoverImage(url,coverurl)
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
genres=soup.findAll('a', {'class' : "tag-1"})
for genre in genres:
self.story.addToList('genre',genre.string)
warnings=soup.findAll('a', {'class' : "tag-2"})
for warning in warnings:
self.story.addToList('warnings',warning.string)
themes=soup.findAll('a', {'class' : "tag-3"})
for theme in themes:
self.story.addToList('themes',theme.string)
hermiones=soup.findAll('a', {'class' : "tag-4"})
for hermione in hermiones:
self.story.addToList('hermiones',hermione.string)
dracos=soup.findAll('a', {'class' : "tag-5"})
for draco in dracos:
self.story.addToList('dracos',draco.string)
timelines=soup.findAll('a', {'class' : "tag-6"})
for timeline in timelines:
self.story.addToList('timeline',timeline.string)
# utility method
def defaultGetattr(d,k):
try:
return d[k]
except:
return ""
listbox = soup.find('div',{'class':'listbox'})
# <strong>Rated:</strong> M<br /> etc
labels = listbox.findAll('strong')
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
if 'Summary' in label:
## Everything until the next strong tag.
svalue = ""
while not isinstance(value,Tag) or value.name != 'strong':
svalue += unicode(value)
value = value.nextSibling
self.setDescription(url,svalue)
if 'Rated' in label:
self.story.setMetadata('rating', value)
if 'Word count' in label:
self.story.setMetadata('numWords', value)
if 'Read' in label:
self.story.setMetadata('read', value)
if 'Categories' in label:
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
if 'Characters' in label:
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
for char in chars:
self.story.addToList('characters',char.string)
if 'Completed' in label:
if 'Yes' in value:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
if 'Published' in label:
value=re.sub(r"(\d+)(st|nd|rd|th)",r"\1",value)
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
if 'Updated' in label:
value=re.sub(r"(\d+)(st|nd|rd|th)",r"\1",value)
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
try:
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"viewseries.php\?seriesid=\d+"))
series_name = a.string
series_url = 'http://'+self.host+'/'+a['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('viewstory.php?sid='+self.story.getMetadata('storyId')):
self.setSeries(series_name, i)
self.story.setMetadata('seriesUrl',series_url)
break
i+=1
except:
# I find it hard to care if the series parsing fails
pass
try:
self.story.setMetadata('reviews',
stripHTML(soup.find('h2',{'id':'pagetitle'}).
findAll('a', href=re.compile(r'^reviews.php'))[1]))
except:
# I find it hard to care if the series parsing fails
pass
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -0,0 +1,223 @@
# -*- coding: utf-8 -*-
# Copyright 2013 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Software: eFiction
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return EfictionEstelielDeAdapter
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class EfictionEstelielDeAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
self._setURL('http://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','eesd')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%B %d, %Y"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'efiction.esteliel.de'
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/viewstory.php?sid=")+r"\d+$"
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+'&index=1'
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# Now go hunting for all the meta data and the chapter list.
## Title and author
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
pagetitle = soup.find('div',{'id':'pagetitle'})
## Title
a = pagetitle.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
a = pagetitle.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.string)
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/'+chapter['href']))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
# utility method
def defaultGetattr(d,k):
try:
return d[k]
except:
return ""
# <span class="label">Rated:</span> NC-17<br /> etc
list = soup.find('div', {'class':'listbox'})
labelspan=list.find('span',{'class':'label'})
value = labelspan.nextSibling
label = labelspan.string
genres = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
for genre in genres:
self.story.addToList('genre',genre.string)
labels = list.findAll('b')
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
if 'Summary' in label:
## Everything until the next span class='label'
svalue = ""
while 'Rating' not in unicode(value):
svalue += unicode(value)
value = value.nextSibling
self.setDescription(url,svalue)
#self.story.setMetadata('description',stripHTML(svalue))
if 'Rating' in label:
self.story.setMetadata('rating', value)
if 'Words' in label:
self.story.setMetadata('numWords', value)
if 'Category' in label:
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
if 'Characters' in label:
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
for char in chars:
self.story.addToList('characters',char.string)
if 'Completed' in label:
if 'Yes' in value:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
if 'Published' in label:
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
if 'Updated' in label:
# there's a stray [ at the end.
#value = value[0:-1]
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
try:
if list.find('a', href=re.compile(r"series.php")) != None:
for series in asoup.findAll('a', href=re.compile(r"series.php\?seriesid=\d+")):
# Find Series name from series URL.
series_url = 'http://'+self.host+'/'+series['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('viewstory.php?sid='+self.story.getMetadata('storyId')):
name=seriessoup.find('div', {'id' : 'pagetitle'})
name.find('a').extract()
self.setSeries(name.text.split(' by[')[0], i)
self.story.setMetadata('seriesUrl',series_url)
i=0
break
i+=1
if i == 0:
break
except:
# I find it hard to care if the series parsing fails
pass
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2012 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2012 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -16,16 +16,17 @@
#
# Software: eFiction
from __future__ import absolute_import
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return EFPFanFicNet
@ -37,6 +38,11 @@ class EFPFanFicNet(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
@ -46,7 +52,7 @@ class EFPFanFicNet(BaseSiteAdapter):
# normalized story URL.
self._setURL('https://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
self._setURL('http://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','efp')
@ -58,14 +64,14 @@ class EFPFanFicNet(BaseSiteAdapter):
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'efpfanfic.net'
return 'www.efpfanfic.net'
@classmethod
def getSiteExampleURLs(cls):
return "https://"+cls.getSiteDomain()+"/viewstory.php?sid=1234"
return "http://"+cls.getSiteDomain()+"/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return r"https?://(www\.)?"+re.escape(self.getSiteDomain()+"/viewstory.php?sid=")+r"\d+$"
return re.escape("http://"+self.getSiteDomain()+"/viewstory.php?sid=")+r"\d+$"
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
@ -87,11 +93,11 @@ class EFPFanFicNet(BaseSiteAdapter):
params['cookiecheck'] = '1'
params['submit'] = 'Invia'
loginUrl = 'https://' + self.getSiteDomain() + '/user.php?sid='+self.story.getMetadata('storyId')
loginUrl = 'http://' + self.getSiteDomain() + '/user.php?sid='+self.story.getMetadata('storyId')
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['penname']))
d = self.post_request(loginUrl, params)
d = self._fetchUrl(loginUrl, params)
if '<a class="menu" href="newaccount.php">' in d : # register for new account link
logger.info("Failed to login to URL %s as %s" % (loginUrl,
@ -107,19 +113,27 @@ class EFPFanFicNet(BaseSiteAdapter):
url = self.url
logger.debug("URL: "+url)
data = self.get_request(url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url)
data = self.get_request(url)
data = self._fetchUrl(url)
# if "Access denied. This story has not been validated by the adminstrators of this site." in data:
# raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
## Title
a = soup.find('a', href=re.compile(r'^viewstory\.php\?sid='+self.story.getMetadata('storyId')+"$"))
@ -128,28 +142,29 @@ class EFPFanFicNet(BaseSiteAdapter):
# Find authorid and URL from... author url.
a = soup.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','https://'+self.host+'/'+a['href'])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.string)
# Find the chapter selector
select = soup.find('select', { 'name' : 'sid' } )
if select is None:
# no selector found, so it's a one-chapter story.
self.add_chapter(self.story.getMetadata('title'),url)
# no selector found, so it's a one-chapter story.
self.chapterUrls.append((self.story.getMetadata('title'),url))
else:
allOptions = select.find_all('option', {'value' : re.compile(r'viewstory')})
allOptions = select.findAll('option', {'value' : re.compile(r'viewstory')})
for o in allOptions:
url = u'https://%s/%s' % ( self.getSiteDomain(),
url = u'http://%s/%s' % ( self.getSiteDomain(),
o['value'])
# just in case there's tags, like <i> in chapter titles.
title = stripHTML(o)
self.add_chapter(title,url)
self.chapterUrls.append((title,url))
self.story.setMetadata('numChapters',len(self.chapterUrls))
self.story.setMetadata('language','Italian')
# normalize story URL to first chapter if later chapter URL was given:
url = self.get_chapter(0,'url').replace('&i=1','')
url = self.chapterUrls[0][1].replace('&i=1','')
logger.debug("Normalizing to URL: "+url)
self._setURL(url)
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
@ -169,15 +184,15 @@ class EFPFanFicNet(BaseSiteAdapter):
# no storya, but do have authsoup--we're looping on author pages.
if authsoup != None:
# last author link with offset should be the 'next' link.
authurl = u'https://%s/%s' % ( self.getSiteDomain(),
authsoup.find_all('a',href=re.compile(r'viewuser\.php\?uid=\d+&catid=&offset='))[-1]['href'] )
authurl = u'http://%s/%s' % ( self.getSiteDomain(),
authsoup.findAll('a',href=re.compile(r'viewuser\.php\?uid=\d+&catid=&offset='))[-1]['href'] )
# Need author page for most of the metadata.
logger.debug("fetching author page: (%s)"%authurl)
authsoup = self.make_soup(self.get_request(authurl))
authsoup = self.make_soup(self._fetchUrl(authurl))
#print("authsoup:%s"%authsoup)
storyas = authsoup.find_all('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+r'&i=1$'))
storyas = authsoup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+r'&i=1$'))
for storya in storyas:
#print("======storya:%s"%storya)
storyblock = storya.findParent('div',{'class':'storybloc'})
@ -194,7 +209,7 @@ class EFPFanFicNet(BaseSiteAdapter):
# Tipo di coppia: Het | Personaggi: Akasuna no Sasori , Akatsuki, Nuovo Personaggio | Note: OOC | Avvertimenti: Tematiche delicate<br />
# Categoria: <a href="categories.php?catid=1&amp;parentcatid=1">Anime & Manga</a> > <a href="categories.php?catid=108&amp;parentcatid=108">Naruto</a> | Contesto: Naruto Shippuuden | Leggi le <a href="reviews.php?sid=1331275&amp;a=">3</a> recensioni</div>
cats = noteblock.find_all('a',href=re.compile(r'browse.php\?type=categories'))
cats = noteblock.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
@ -258,11 +273,12 @@ class EFPFanFicNet(BaseSiteAdapter):
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"viewseries.php\?ssid=\d+&i=1"))
series_name = a.string
series_url = 'https://'+self.host+'/'+a['href']
series_url = 'http://'+self.host+'/'+a['href']
seriessoup = self.make_soup(self.get_request(series_url))
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
# can't use ^viewstory...$ in case of higher rated stories with javascript href.
storyas = seriessoup.find_all('a', href=re.compile(r'viewstory.php\?sid=\d+&i=1'))
storyas = seriessoup.findAll('a', href=re.compile(r'viewstory.php\?sid=\d+&i=1'))
i=1
for a in storyas:
if a['href'] == ('viewstory.php?sid='+self.story.getMetadata('storyId'))+'&i=1':
@ -280,7 +296,7 @@ class EFPFanFicNet(BaseSiteAdapter):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self.get_request(url))
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'class' : 'storia'})
@ -288,11 +304,11 @@ class EFPFanFicNet(BaseSiteAdapter):
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
# remove any header and 'o:p' tags.
for tag in div.find_all("head") + div.find_all("o:p"):
for tag in div.findAll("head") + div.findAll("o:p"):
tag.extract()
# change any html and body tags to div.
for tag in div.find_all("html") + div.find_all("body"):
for tag in div.findAll("html") + div.findAll("body"):
tag.name='div'
# remove extra bogus doctype.

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -16,17 +16,17 @@
#
# Software: eFiction
from __future__ import absolute_import
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from ..six import text_type as unicode
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return ErosnSapphoSycophantHexComAdapter
@ -38,6 +38,11 @@ class ErosnSapphoSycophantHexComAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
@ -45,7 +50,7 @@ class ErosnSapphoSycophantHexComAdapter(BaseSiteAdapter):
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
self._setURL('http://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
@ -86,7 +91,13 @@ class ErosnSapphoSycophantHexComAdapter(BaseSiteAdapter):
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
data = self.get_request(url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
m = re.search(r"'viewstory.php\?sid=\d+((?:&amp;ageconsent=ok)?&amp;warning=\d+)'",data)
if m != None:
@ -100,16 +111,24 @@ class ErosnSapphoSycophantHexComAdapter(BaseSiteAdapter):
url = self.url+'&index=1'+addurl
logger.debug("URL 2nd try: "+url)
data = self.get_request(url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
else:
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
## Title
pt = soup.find('div', {'id' : 'pagetitle'})
@ -126,10 +145,11 @@ class ErosnSapphoSycophantHexComAdapter(BaseSiteAdapter):
self.story.setMetadata('rating', rating)
# Find the chapters:
for chapter in soup.find_all('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+r"&chapter=\d+$")):
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.add_chapter(chapter,'http://'+self.host+'/'+chapter['href']+addurl)
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
@ -140,12 +160,12 @@ class ErosnSapphoSycophantHexComAdapter(BaseSiteAdapter):
return d[k]
except:
return ""
# <span class="label">Rated:</span> NC-17<br /> etc
labels = soup.find_all('span',{'class':'label'})
labels = soup.findAll('span',{'class':'label'})
value = labels[0].previousSibling
svalue = ""
while value != None:
@ -155,7 +175,7 @@ class ErosnSapphoSycophantHexComAdapter(BaseSiteAdapter):
svalue += unicode(val)
val = val.nextSibling
self.setDescription(url,svalue)
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
@ -164,22 +184,22 @@ class ErosnSapphoSycophantHexComAdapter(BaseSiteAdapter):
self.story.setMetadata('numWords', value.split(' -')[0])
if 'Categories' in label:
cats = labelspan.parent.find_all('a',href=re.compile(r'browse.php\?type=categories'))
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
if 'Characters' in label:
chars = labelspan.parent.find_all('a',href=re.compile(r'browse.php\?type=characters'))
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
for char in chars:
self.story.addToList('characters',char.string)
if 'Genre' in label:
genres = labelspan.parent.find_all('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
genres = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=1'))
for genre in genres:
self.story.addToList('genre',genre.string)
if 'Warnings' in label:
warnings = labelspan.parent.find_all('a',href=re.compile(r'browse.php\?type=class&type_id=2'))
warnings = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=2'))
for warning in warnings:
self.story.addToList('warnings',warning.string)
@ -203,8 +223,9 @@ class ErosnSapphoSycophantHexComAdapter(BaseSiteAdapter):
series_name = a.string
series_url = 'http://'+self.host+'/'+a['href']
seriessoup = self.make_soup(self.get_request(series_url))
storyas = seriessoup.find_all('a', href=re.compile(r'viewstory.php\?sid=\d+'))
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'viewstory.php\?sid=\d+'))
i=1
for a in storyas:
# skip 'report this' and 'TOC' links
@ -224,7 +245,7 @@ class ErosnSapphoSycophantHexComAdapter(BaseSiteAdapter):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self.get_request(url))
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})

View file

@ -1,265 +0,0 @@
# -*- coding: utf-8 -*-
# -- coding: utf-8 --
# Copyright 2013 Fanficdownloader team, 2018 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
####################################################################################################
### Adapted by GComyn - November 26, 2016
###
####################################################################################################
from __future__ import absolute_import
from __future__ import unicode_literals
import logging
logger = logging.getLogger(__name__)
import re
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from ..six import text_type as unicode
from .base_adapter import BaseSiteAdapter, makeDate
####################################################################################################
def getClass():
return FanficAuthorsNetAdapter
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class FanficAuthorsNetAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url
self.story.setMetadata('storyId',self.parsedUrl.path.split('/',)[1])
#Setting the 'Zone' for each "Site"
self.zone = self.parsedUrl.netloc.replace('.fanficauthors.net','')
# site change .nsns to -nsns
self.zone = self.zone.replace('.nsns','-nsns')
# normalized story URL.
self._setURL('https://{0}.{1}/{2}/'.format(
self.zone, self.getBaseDomain(), self.story.getMetadata('storyId')))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','ffa')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%d %b %y"
################################################################################################
def getBaseDomain(self):
''' Added because fanficauthors.net does send you to www.fanficauthors.net when
you go to it '''
return 'fanficauthors.net'
################################################################################################
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
return 'www.fanficauthors.net'
################################################################################################
@classmethod
def getAcceptDomains(cls):
# need both .nsns(old) and -nsns(new) because it's a domain
# change, not just URL change.
return ['aaran-st-vines.nsns.fanficauthors.net',
'aaran-st-vines-nsns.fanficauthors.net',
'abraxan.fanficauthors.net',
'bobmin.fanficauthors.net',
'canoncansodoff.fanficauthors.net',
'chemprof.fanficauthors.net',
'copperbadge.fanficauthors.net',
'crys.fanficauthors.net',
'deluded-musings.fanficauthors.net',
'draco664.fanficauthors.net',
'fp.fanficauthors.net',
'frenchsession.fanficauthors.net',
'ishtar.fanficauthors.net',
'jbern.fanficauthors.net',
'jeconais.fanficauthors.net',
'kinsfire.fanficauthors.net',
'kokopelli.nsns.fanficauthors.net',
'kokopelli-nsns.fanficauthors.net',
'ladya.nsns.fanficauthors.net',
'ladya-nsns.fanficauthors.net',
'lorddwar.fanficauthors.net',
'mrintel.nsns.fanficauthors.net',
'mrintel-nsns.fanficauthors.net',
'musings-of-apathy.fanficauthors.net',
'ruskbyte.fanficauthors.net',
'seelvor.fanficauthors.net',
'tenhawk.fanficauthors.net',
'viridian.fanficauthors.net',
'whydoyouneedtoknow.fanficauthors.net']
################################################################################################
@classmethod
def getSiteExampleURLs(self):
return ("https://aaran-st-vines-nsns.fanficauthors.net/A_Story_Name/ "
+ "https://abraxan.fanficauthors.net/A_Story_Name/ "
+ "https://bobmin.fanficauthors.net/A_Story_Name/ "
+ "https://canoncansodoff.fanficauthors.net/A_Story_Name/ "
+ "https://chemprof.fanficauthors.net/A_Story_Name/ "
+ "https://copperbadge.fanficauthors.net/A_Story_Name/ "
+ "https://crys.fanficauthors.net/A_Story_Name/ "
+ "https://deluded-musings.fanficauthors.net/A_Story_Name/ "
+ "https://draco664.fanficauthors.net/A_Story_Name/ "
+ "https://fp.fanficauthors.net/A_Story_Name/ "
+ "https://frenchsession.fanficauthors.net/A_Story_Name/ "
+ "https://ishtar.fanficauthors.net/A_Story_Name/ "
+ "https://jbern.fanficauthors.net/A_Story_Name/ "
+ "https://jeconais.fanficauthors.net/A_Story_Name/ "
+ "https://kinsfire.fanficauthors.net/A_Story_Name/ "
+ "https://kokopelli-nsns.fanficauthors.net/A_Story_Name/ "
+ "https://ladya-nsns.fanficauthors.net/A_Story_Name/ "
+ "https://lorddwar.fanficauthors.net/A_Story_Name/ "
+ "https://mrintel-nsns.fanficauthors.net/A_Story_Name/ "
+ "https://musings-of-apathy.fanficauthors.net/A_Story_Name/ "
+ "https://ruskbyte.fanficauthors.net/A_Story_Name/ "
+ "https://seelvor.fanficauthors.net/A_Story_Name/ "
+ "https://tenhawk.fanficauthors.net/A_Story_Name/ "
+ "https://viridian.fanficauthors.net/A_Story_Name/ "
+ "https://whydoyouneedtoknow.fanficauthors.net/A_Story_Name/ ")
################################################################################################
def getSiteURLPattern(self):
## .nsns kept here to match both . and -
return r'https?://(aaran-st-vines.nsns|abraxan|bobmin|canoncansodoff|chemprof|copperbadge|crys|deluded-musings|draco664|fp|frenchsession|ishtar|jbern|jeconais|kinsfire|kokopelli.nsns|ladya.nsns|lorddwar|mrintel.nsns|musings-of-apathy|ruskbyte|seelvor|tenhawk|viridian|whydoyouneedtoknow)\.fanficauthors\.net/([a-zA-Z0-9_]+)/'
@classmethod
def get_section_url(cls,url):
## only changing .nsns to -nsns and only when part of the
## domain.
url = url.replace('.nsns.fanficauthors.net','-nsns.fanficauthors.net')
return url
################################################################################################
def doExtractChapterUrlsAndMetadata(self, get_cover=True):
url = self.url
logger.debug("URL: "+url)
soup = self.make_soup(self.get_request(url+'index/'))
# Find authorid and URL.
# There is no place where the author's name is listed,
# except for in the image at the top of the page. We have to
# work with the url entered to get the Author's Name
a = self.zone.split('.')[0]
self.story.setMetadata('authorId',a)
a = a.replace('-',' ').title()
self.story.setMetadata('author',a)
self.story.setMetadata('authorUrl','https://{0}/'.format(self.parsedUrl.netloc))
## Title
a = soup.find('h2')
self.story.setMetadata('title',stripHTML(a))
# Find the chapters:
# The published and update dates are with the chapter links...
# so we have to get them from there.
chapters = soup.find_all('a', href=re.compile('/'+self.story.getMetadata(
'storyId')+'/([a-zA-Z0-9_]+)/'))
# Here we are getting the published date. It is the date the first chapter was "updated"
updatedate = stripHTML(unicode(chapters[0].parent)).split('Uploaded on:')[1].strip()
updatedate = updatedate.replace('st ',' ').replace('nd ',' ').replace(
'rd ',' ').replace('th ',' ')
self.story.setMetadata('datePublished', makeDate(updatedate, self.dateformat))
# Status: Completed - Rating: Adult Only - Chapters: 19 - Word count: 323,805 - Genre: Post-OotP
# Status: In progress - Rating: Adult Only - Chapters: 42 - Word count: 395,991 - Genre: Action/Adventure, Angst, Drama, Romance, Tragedy
# Status: Completed - Rating: Everyone - Chapters: 1 - Word count: 876 - Genre: Sorrow
# Status: In progress - Rating: Mature - Chapters: 39 - Word count: 314,544 - Genre: Drama - Romance
div = soup.find('div',{'class':'well'})
# logger.debug(div.find_all('p')[1])
metaline = re.sub(r' +',' ',stripHTML(div.find_all('p')[1]).replace('\n',' '))
# logger.debug(metaline)
match = re.match(r"Status: (?P<status>.+?) - Rating: (?P<rating>.+?) - Chapters: [0-9,]+ - Word count: (?P<numWords>[0-9,]+?) - Genre: ?(?P<genre>.*?)$",metaline)
if match:
# logger.debug(match.group('status'))
# logger.debug(match.group('rating'))
# logger.debug(match.group('numWords'))
# logger.debug(match.group('genre'))
if "Completed" in match.group('status'):
self.story.setMetadata('status',"Completed")
else:
self.story.setMetadata('status',"In-Progress")
self.story.setMetadata('rating',match.group('rating'))
self.story.setMetadata('numWords',match.group('numWords'))
self.story.extendList('genre',re.split(r'[;,-]',match.group('genre')))
else:
raise exceptions.FailedToDownload("Error parsing metadata: '{0}'".format(url))
summary = div.find('blockquote').get_text()
self.setDescription(url,summary)
## Raising AdultCheckRequired after collecting chapters gives
## a double chapter list. So does genre, but it de-dups
## automatically.
if( self.story.getMetadataRaw('rating') in ['Mature','Adult Only']
and not (self.is_adult or self.getConfig("is_adult")) ):
raise exceptions.AdultCheckRequired(self.url)
for i, chapter in enumerate(chapters):
if '/reviews/' not in chapter['href']:
# here we get the update date. We will update this for every chapter,
# so we get the last one.
updatedate = stripHTML(unicode(chapters[i].parent)).split(
'Uploaded on:')[1].strip()
updatedate = updatedate.replace('st ',' ').replace('nd ',' ').replace(
'rd ',' ').replace('th ',' ')
self.story.setMetadata('dateUpdated', makeDate(updatedate, self.dateformat))
if '::' in stripHTML(unicode(chapter)):
chapter_title = stripHTML(unicode(chapter).split('::')[1])
else:
chapter_title = stripHTML(unicode(chapter))
chapter_Url = self.story.getMetadata('authorUrl')+chapter['href'][1:]
self.add_chapter(chapter_title, chapter_Url)
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
if( self.story.getMetadataRaw('rating') in ['Mature','Adult Only'] and
(self.is_adult or self.getConfig("is_adult")) ):
addurl = "?bypass=1"
else:
addurl=""
soup = self.make_soup(self.get_request(url+addurl))
story = soup.find('div',{'class':'story'})
if story == None:
raise exceptions.FailedToDownload(
"Error downloading Chapter: '{0}'! Missing required element!".format(url))
#Now, there are a lot of extranious tags within the story division.. so we will remove them.
for tag in story.find_all('ul',{'class':'pager'}) + story.find_all(
'div',{'class':'alert'}) + story.find_all('div', {'class':'btn-group'}):
tag.extract()
return self.utf8FromSoup(url,story)

View file

@ -0,0 +1,321 @@
# -*- coding: utf-8 -*-
# Copyright 2014 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Software: eFiction
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
# In general an 'adapter' needs to do these five things:
# - 'Register' correctly with the downloader
# - Site Login (if needed)
# - 'Are you adult?' check (if needed--some do one, some the other, some both)
# - Grab the chapter list
# - Grab the story meta-data (some (non-eFiction) adapters have to get it from the author page)
# - Grab the chapter texts
# Search for XXX comments--that's where things are most likely to need changing.
# This function is called by the downloader in all adapter_*.py files
# in this dir to register the adapter class. So it needs to be
# updated to reflect the class below it. That, plus getSiteDomain()
# take care of 'Registering'.
def getClass():
return FanficCastleTVNetAdapter # XXX
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class FanficCastleTVNetAdapter(BaseSiteAdapter): # XXX
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
# XXX Most sites don't have the /fanfic part. Replace all to remove it usually.
self._setURL('http://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','csltv') # XXX
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%b %d, %Y" # XXX
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'fanfic.castletv.net' # XXX
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/viewstory.php?sid=")+r"\d+$"
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
if 'Registered Users Only' in data \
or 'There is no such account on our website' in data \
or "That password doesn't match the one in our database" in data:
return True
else:
return False
def performLogin(self, url):
params = {}
if self.password:
params['penname'] = self.username
params['password'] = self.password
else:
params['penname'] = self.getConfig("username")
params['password'] = self.getConfig("password")
params['cookiecheck'] = '1'
params['submit'] = 'Submit'
loginUrl = 'http://' + self.getSiteDomain() + '/user.php?action=login'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['penname']))
d = self._fetchUrl(loginUrl, params)
if "Member Account" not in d : #Member Account
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['penname']))
raise exceptions.FailedToLogin(url,params['penname'])
return False
else:
return True
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
addurl = "&ageconsent=ok&warning=3"
else:
addurl=""
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url)
data = self._fetchUrl(url)
m = re.search(r"'viewstory.php\?sid=\d+((?:&amp;ageconsent=ok)?&amp;warning=\d+)'",data)
if m != None:
if self.is_adult or self.getConfig("is_adult"):
# We tried the default and still got a warning, so
# let's pull the warning number from the 'continue'
# link and reload data.
addurl = m.group(1)
# correct stupid &amp; error in url.
addurl = addurl.replace("&amp;","&")
url = self.url+'&index=1'+addurl
logger.debug("URL 2nd try: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
else:
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
pagetitle = soup.find('div',{'id':'pagetitle'})
## Title
a = pagetitle.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
a = pagetitle.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.string)
# Reviews
reviewdata = soup.find('div', {'id' : 'sort'})
a = reviewdata.findAll('a', href=re.compile(r'reviews.php\?type=ST&(amp;)?item='+self.story.getMetadata('storyId')+"$"))[1] # second one.
self.story.setMetadata('reviews',stripHTML(a))
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
# utility method
def defaultGetattr(d,k):
try:
return d[k]
except:
return ""
# <span class="label">Rated:</span> NC-17<br /> etc
labels = soup.findAll('span',{'class':'label'})
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
if 'Summary' in label:
## Everything until the next span class='label'
svalue = ""
while value and 'label' not in defaultGetattr(value,'class'):
svalue += unicode(value)
value = value.nextSibling
self.setDescription(url,svalue)
#self.story.setMetadata('description',stripHTML(svalue))
if 'Rated' in label:
self.story.setMetadata('rating', value)
if 'Word count' in label:
self.story.setMetadata('numWords', value)
if 'Categories' in label:
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
catstext = [cat.string for cat in cats]
for cat in catstext:
self.story.addToList('category',cat.string)
if 'Characters' in label:
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
charstext = [char.string for char in chars]
for char in charstext:
self.story.addToList('characters',char.string)
## Not all sites use Genre, but there's no harm to
## leaving it in. Check to make sure the type_id number
## is correct, though--it's site specific.
if 'Genre' in label:
genres = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=2')) # XXX
genrestext = [genre.string for genre in genres]
self.genre = ', '.join(genrestext)
for genre in genrestext:
self.story.addToList('genre',genre.string)
## Not all sites use Warnings, but there's no harm to
## leaving it in. Check to make sure the type_id number
## is correct, though--it's site specific.
if 'Warnings' in label:
warnings = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=class&type_id=2')) # XXX
warningstext = [warning.string for warning in warnings]
self.warning = ', '.join(warningstext)
for warning in warningstext:
self.story.addToList('warnings',warning.string)
if 'Completed' in label:
if 'Yes' in value:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
if 'Published' in label:
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
if 'Updated' in label:
# there's a stray [ at the end.
#value = value[0:-1]
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
try:
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"viewseries.php\?seriesid=\d+"))
series_name = a.string
series_url = 'http://'+self.host+'/'+a['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('viewstory.php?sid='+self.story.getMetadata('storyId')):
self.setSeries(series_name, i)
self.story.setMetadata('seriesUrl',series_url)
break
i+=1
except:
# I find it hard to care if the series parsing fails
pass
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -0,0 +1,183 @@
# coding=utf-8
import re
import urllib2
import urlparse
from base_adapter import BaseSiteAdapter, makeDate
from .. import exceptions
_SOURCE_CODE_ENCODING = 'utf-8'
def getClass():
return FanficHuAdapter
def _get_query_data(url):
components = urlparse.urlparse(url)
query_data = urlparse.parse_qs(components.query)
return dict((key, data[0]) for key, data in query_data.items())
class FanficHuAdapter(BaseSiteAdapter):
SITE_ABBREVIATION = 'ffh'
SITE_DOMAIN = 'fanfic.hu'
SITE_LANGUAGE = 'Hungarian'
BASE_URL = 'http://' + SITE_DOMAIN + '/merengo/'
VIEW_STORY_URL_TEMPLATE = BASE_URL + 'viewstory.php?sid=%s'
DATE_FORMAT = '%m/%d/%Y'
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
query_data = urlparse.parse_qs(self.parsedUrl.query)
story_id = query_data['sid'][0]
self.story.setMetadata('storyId', story_id)
self._setURL(self.VIEW_STORY_URL_TEMPLATE % story_id)
self.story.setMetadata('siteabbrev', self.SITE_ABBREVIATION)
self.story.setMetadata('language', self.SITE_LANGUAGE)
def _customized_fetch_url(self, url, exception=None, parameters=None):
if exception:
try:
data = self._fetchUrl(url, parameters)
except urllib2.HTTPError:
raise exception(self.url)
# Just let self._fetchUrl throw the exception, don't catch and
# customize it.
else:
data = self._fetchUrl(url, parameters)
return self.make_soup(data)
@staticmethod
def getSiteDomain():
return FanficHuAdapter.SITE_DOMAIN
@classmethod
def getSiteExampleURLs(cls):
return cls.VIEW_STORY_URL_TEMPLATE % 1234
def getSiteURLPattern(self):
return re.escape(self.VIEW_STORY_URL_TEMPLATE[:-2]) + r'\d+$'
def extractChapterUrlsAndMetadata(self):
soup = self._customized_fetch_url(self.url + '&i=1')
if soup.title.string.encode(_SOURCE_CODE_ENCODING).strip(' :') == 'írta':
raise exceptions.StoryDoesNotExist(self.url)
chapter_options = soup.find('form', action='viewstory.php').select('option')
# Remove redundant "Fejezetek" option
chapter_options.pop(0)
# If there is still more than one entry remove chapter overview entry
if len(chapter_options) > 1:
chapter_options.pop(0)
for option in chapter_options:
url = urlparse.urljoin(self.url, option['value'])
self.chapterUrls.append((option.string, url))
author_url = urlparse.urljoin(self.BASE_URL, soup.find('a', href=lambda href: href and href.startswith('viewuser.php?uid='))['href'])
soup = self._customized_fetch_url(author_url)
story_id = self.story.getMetadata('storyId')
for table in soup('table', {'class': 'mainnav'}):
title_anchor = table.find('span', {'class': 'storytitle'}).a
href = title_anchor['href']
if href.startswith('javascript:'):
href = href.rsplit(' ', 1)[1].strip("'")
query_data = _get_query_data(href)
if query_data['sid'] == story_id:
break
else:
# This should never happen, the story must be found on the author's
# page.
raise exceptions.FailedToDownload(self.url)
self.story.setMetadata('title', title_anchor.string)
rows = table('tr')
anchors = rows[0].div('a')
author_anchor = anchors[1]
query_data = _get_query_data(author_anchor['href'])
self.story.setMetadata('author', author_anchor.string)
self.story.setMetadata('authorId', query_data['uid'])
self.story.setMetadata('authorUrl', urlparse.urljoin(self.BASE_URL, author_anchor['href']))
self.story.setMetadata('reviews', anchors[3].string)
if self.getConfig('keep_summary_html'):
self.story.setMetadata('description', self.utf8FromSoup(author_url, rows[1].td))
else:
self.story.setMetadata('description', ''.join(rows[1].td(text=True)))
for row in rows[3:]:
index = 0
cells = row('td')
while index < len(cells):
cell = cells[index]
key = cell.b.string.encode(_SOURCE_CODE_ENCODING).strip(':')
try:
value = cells[index+1].string.encode(_SOURCE_CODE_ENCODING)
except AttributeError:
value = None
if key == 'Kategória':
for anchor in cells[index+1]('a'):
self.story.addToList('category', anchor.string)
elif key == 'Szereplõk':
if cells[index+1].string:
for name in cells[index+1].string.split(', '):
self.story.addToList('character', name)
elif key == 'Korhatár':
if value != 'nem korhatáros':
self.story.setMetadata('rating', value)
elif key == 'Figyelmeztetések':
for b_tag in cells[index+1]('b'):
self.story.addToList('warnings', b_tag.string)
elif key == 'Jellemzõk':
for genre in cells[index+1].string.split(', '):
self.story.addToList('genre', genre)
elif key == 'Fejezetek':
self.story.setMetadata('numChapters', int(value))
elif key == 'Megjelenés':
self.story.setMetadata('datePublished', makeDate(value, self.DATE_FORMAT))
elif key == 'Frissítés':
self.story.setMetadata('dateUpdated', makeDate(value, self.DATE_FORMAT))
elif key == 'Szavak':
self.story.setMetadata('numWords', value)
elif key == 'Befejezett':
self.story.setMetadata('status', 'Completed' if value == 'Nem' else 'In-Progress')
index += 2
if self.story.getMetadata('rating') == '18':
if not (self.is_adult or self.getConfig('is_adult')):
raise exceptions.AdultCheckRequired(self.url)
def getChapterText(self, url):
soup = self._customized_fetch_url(url)
story_cell = soup.find('form', action='viewstory.php').parent.parent
for div in story_cell('div'):
div.extract()
return self.utf8FromSoup(url, story_cell)

View file

@ -1,324 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2014 Fanficdownloader team, 2018 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import absolute_import
import logging
logger = logging.getLogger(__name__)
import re
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from .base_adapter import BaseSiteAdapter, makeDate
def getClass():
return FanFicsMeAdapter
logger = logging.getLogger(__name__)
class FanFicsMeAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
self.full_work_soup = None
self.use_full_work_soup = True
## All Russian as far as I know.
self.story.setMetadata('language','Russian')
# get storyId from url--url validation guarantees query correct
m = re.match(self.getSiteURLPattern(),url)
if m:
self.story.setMetadata('storyId',m.group('id'))
# normalized story URL.
self._setURL('https://' + self.getSiteDomain() + '/fic'+self.story.getMetadata('storyId'))
else:
raise exceptions.InvalidStoryURL(url,
self.getSiteDomain(),
self.getSiteExampleURLs())
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','ffme')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%d.%m.%Y"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'fanfics.me'
@classmethod
def getSiteExampleURLs(cls):
return "https://"+cls.getSiteDomain()+"/fic1234 https://"+cls.getSiteDomain()+"/read.php?id=1234 https://"+cls.getSiteDomain()+"/read.php?id=1234&chapter=2"
def getSiteURLPattern(self):
# https://fanfics.me/fic137282
# https://fanfics.me/read.php?id=137282
# https://fanfics.me/read.php?id=137282&chapter=2
# https://fanfics.me/download.php?fic=137282&format=epub
return r"https?://"+re.escape(self.getSiteDomain())+r"/(fic|read\.php\?id=|download\.php\?fic=)(?P<id>\d+)"
## Login
def needToLoginCheck(self, data):
return '<form name="autent" action="https://fanfics.me/autent.php" method="post">' in data
def performLogin(self, url):
'''
<form name="autent" action="https://fanfics.me/autent.php" method="post">
Имя:<br>
<input class="input_3" type="text" name="name" id="name"><br>
Пароль:<br>
<input class="input_3" type="password" name="pass" id="pass"><br>
<input type="checkbox" name="nocookie" id="nocookie" />&nbsp;<label for="nocookie">Чужой&nbsp;компьютер</label><br>
<input class="modern_button" type="submit" value="Войти">
<div class="lostpass center"><a href="/index.php?section=lostpass">Забыл пароль</a></div>
'''
params = {}
if self.password:
params['name'] = self.username
params['pass'] = self.password
else:
params['name'] = self.getConfig("username")
params['pass'] = self.getConfig("password")
loginUrl = 'https://' + self.getSiteDomain() + '/autent.php'
logger.info("Will now login to URL (%s) as (%s)" % (loginUrl,
params['name']))
## must need a cookie or something.
self.get_request(loginUrl, usecache=False)
d = self.post_request(loginUrl, params, usecache=False)
if self.needToLoginCheck(d):
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['name']))
raise exceptions.FailedToLogin(url,params['name'])
return False
else:
return True
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
url = self.url
logger.info("url: "+url)
data = self.get_request(url)
soup = self.make_soup(data)
## restrict meta searches to header.
fichead = soup.find('div',class_='FicHead')
def get_meta_content(title):
val_label = fichead.find('div',string=re.compile(u'^'+title+u':'))
if val_label:
return val_label.find_next('div')
## fanfics.me doesn't have separate adult--you have to set
## your age to 18+ in your user account
## Rating
## R, NC-17, PG-13 require login
## doesn't: General
#('Рейтинг', 'rating', False, False)
# val_label = fichead.find('div',string=u'Рейтинг:')
# val = stripHTML(val_label.find_next('div'))
# logger.debug(val)
self.story.setMetadata('rating',stripHTML(get_meta_content(u'Рейтинг')))
## Need to login for any rating higher than General.
if self.story.getMetadataRaw('rating') != 'General' and self.needToLoginCheck(data):
self.performLogin(url)
# reload after login.
data = self.get_request(url,usecache=False)
soup = self.make_soup(data)
fichead = soup.find('div',class_='FicHead')
## Title
## <h1>Третья сторона&nbsp;<span class="small green">(гет)</span></h1>
h = fichead.find('h1')
span = h.find('span')
## I haven't found a term for what fanfics.me calls this, but
## it translates to Get Jen Slash Femslash
self.story.addToList('category',stripHTML(span)[1:-1])
span.extract()
self.story.setMetadata('title',stripHTML(h))
## author(s):
content = get_meta_content(u'Авторы?')
if content:
alist = content.find_all('a', class_='user')
for a in alist:
self.story.addToList('authorId',a['href'].split('/user')[-1])
self.story.addToList('authorUrl','https://'+self.host+a['href'])
self.story.addToList('author',stripHTML(a))
# can be deliberately anonymous.
if not alist:
self.story.setMetadata('author','Anonymous')
self.story.setMetadata('authorUrl','https://'+self.host)
self.story.setMetadata('authorId','0')
# translator(s) in different strings
content = get_meta_content(u'Переводчикк?и?')
if content:
for a in content.find_all('a', class_='user'):
self.story.addToList('translatorsId',a['href'].split('/user')[-1])
self.story.addToList('translatorsUrl','https://'+self.host+a['href'])
self.story.addToList('translators',stripHTML(a))
# If there are translators, but no authors, copy translators to authors.
if self.story.getList('translators') and not self.story.getList('author'):
self.story.extendList('authorId',self.story.getList('translatorsId'))
self.story.extendList('authorUrl',self.story.getList('translatorsUrl'))
self.story.extendList('author',self.story.getList('translators'))
# beta(s)
content = get_meta_content(u'Бета')
if content:
for a in content.find_all('a', class_='user'):
self.story.addToList('betasId',a['href'].split('/user')[-1])
self.story.addToList('betasUrl','https://'+self.host+a['href'])
self.story.addToList('betas',stripHTML(a))
content = get_meta_content(u'Фандом')
self.story.extendList('fandoms', [ stripHTML(a) for a in
fichead.find_all('a',href=re.compile(r'/fandom\d+$')) ] )
## 'Characters' header has both ships and chars lists
content = get_meta_content(u'Персонажи')
if content:
self.story.extendList('ships', [ stripHTML(a) for a in
content.find_all('a',href=re.compile(r'/paring\d+_\d+$')) ] )
for ship in self.story.getList('ships'):
self.story.extendList('characters', ship.split('/'))
self.story.extendList('characters', [ stripHTML(a) for a in
content.find_all('a',href=re.compile(r'/character\d+$')) ] )
self.story.extendList('genre',stripHTML(get_meta_content(u'Жанр')).split(', '))
## fanfics.me includes 'AU' and 'OOC' as warnings...
content = get_meta_content(u'Предупреждение')
if content:
self.story.extendList('warnings',stripHTML(content).split(', '))
content = get_meta_content(u'События')
if content:
self.story.extendList('events', [ stripHTML(a) for a in
content.find_all('a',href=re.compile(r'/find\?keyword=\d+$')) ] )
## Original work block
content = get_meta_content(u'Оригинал')
if content:
# only going to record URL.
titletd = content.find('td',string=u'Ссылка:')
self.story.setMetadata('originUrl',stripHTML(titletd.find_next('td')))
## size block, only saving word count.
content = get_meta_content(u'Размер')
words = stripHTML(content.find('a'))
words = re.sub(r'[^0-9]','',words) # only keep numbers
self.story.setMetadata('numWords',words)
## status by color code
statuscolors = {'red':'In-Progress',
'green':'Completed',
'blue':'Hiatus'}
content = get_meta_content(u'Статус')
self.story.setMetadata('status',statuscolors[content.span['class'][0]])
# desc
self.setDescription(url,soup.find('div',id='summary_'+self.story.getMetadata('storyId')))
# cover
div = fichead.find('div',class_='FicHead_cover')
if div:
# get the larger version.
self.setCoverImage(self.url,div.img['src'].replace('_200_300',''))
# dates
# <span class="DateUpdate" title="Опубликовано 22.04.2020, изменено 22.04.2020">22.04.2020 - 22.04.2020</span>
datespan = soup.find('span',class_='DateUpdate')
dates = stripHTML(datespan).split(" - ")
self.story.setMetadata('datePublished', makeDate(dates[0], self.dateformat))
self.story.setMetadata('dateUpdated', makeDate(dates[1], self.dateformat))
# series
seriesdiv = soup.find('div',id='fic_info_content_serie')
if seriesdiv:
seriesa = seriesdiv.find('a', href=re.compile(r'/serie\d+$'))
i=1
for a in seriesdiv.find_all('a', href=re.compile(r'/fic\d+$')):
if a['href'] == ('/fic'+self.story.getMetadata('storyId')):
self.setSeries(stripHTML(seriesa), i)
self.story.setMetadata('seriesUrl','https://'+self.host+seriesa['href'])
break
i+=1
chapteruls = soup.find_all('ul',class_='FicContents')
if chapteruls:
for ul in chapteruls:
# logger.debug(ul.prettify())
for chapter in ul.find_all('li'):
a = chapter.find('a')
# logger.debug(a.prettify())
if a and a.has_attr('href'):
# logger.debug(chapter.prettify())
self.add_chapter(stripHTML(a),'https://' + self.getSiteDomain() + a['href'])
else:
self.add_chapter(self.story.getMetadata('title'),
'https://' + self.getSiteDomain() +
'/read.php?id='+self.story.getMetadata('storyId')+'&chapter=0')
return
# grab the text for an individual chapter.
def getChapterTextNum(self, url, index):
logger.debug('Getting chapter text for: %s index: %s' % (url,index))
m = re.match(r'.*&chapter=(\d+).*',url)
if m:
index=m.group(1)
logger.debug("Using index(%s) from &chapter="%index)
chapter_div = None
if self.use_full_work_soup and self.getConfig("use_view_full_work",True) and self.num_chapters() > 1:
logger.debug("USE view_full_work")
## Assumed view_adult=true was cookied during metadata
if not self.full_work_soup:
self.full_work_soup = self.make_soup(self.get_request(
'https://' + self.getSiteDomain() + '/read.php?id='+self.story.getMetadata('storyId')))
whole_dl_soup = self.full_work_soup
chapter_div = whole_dl_soup.find('div',{'id':'c%s'%(index)})
if not chapter_div:
self.use_full_work_soup = False
logger.warning("c%s not found in view_full_work--ending use_view_full_work"%(index))
if chapter_div == None:
whole_dl_soup = self.make_soup(self.get_request(url))
chapter_div = whole_dl_soup.find('div',{'id':'c%s'%(index)})
if None == chapter_div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,chapter_div)

View file

@ -1,224 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2013 Fanficdownloader team, 2020 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Software: eFiction
from __future__ import absolute_import
import logging
logger = logging.getLogger(__name__)
import re
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from .base_adapter import BaseSiteAdapter, makeDate
def getClass():
return FanfictalkComAdapter
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class FanfictalkComAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
self._setURL('https://' + self.getSiteDomain() + '/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','ahpfftc')
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%d %b %Y"
@classmethod
def getAcceptDomains(cls):
return [cls.getSiteDomain(),'archive.hpfanfictalk.com','fanfictalk.com']
@classmethod
def getConfigSections(cls):
"Only needs to be overriden if has additional ini sections."
return [cls.getConfigSection(),'archive.hpfanfictalk.com','fanfictalk.com']
@staticmethod # must be @stgetAcceptDomainsaticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'archive.fanfictalk.com'
@classmethod
def getSiteExampleURLs(cls):
return "https://"+cls.getSiteDomain()+"/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return r"https?://("+r"|".join([x.replace('.',r'\.') for x in self.getAcceptDomains()])+r")(/archive)?/viewstory\.php\?sid=\d+$"
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
addurl = "&ageconsent=ok&warning=3"
else:
addurl=""
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
data = self.get_request(url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
## Title and author
soup = self.make_soup(data)
# logger.debug(soup)
pagetitle = soup.select_one('div#pagetitle')
# logger.debug(pagetitle)
## Title
a = pagetitle.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',stripHTML(a))
# Find authorid and URL from... author url.
for a in pagetitle.find_all('a', href=re.compile(r"viewuser.php\?uid=\d+")):
self.story.addToList('authorId',a['href'].split('=')[1])
self.story.addToList('authorUrl','https://'+self.host+'/'+a['href'])
self.story.addToList('author',stripHTML(a))
# Find the chapters:
for chapter in soup.find_all('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+r"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.add_chapter(chapter,'https://'+self.host+'/'+chapter['href'])
# categories
for a in soup.select("div#sort a"):
self.story.addToList('category',stripHTML(a))
# this site has two divs with class=gb-50 and no immediate container.
gb50s = soup.find_all('div', {'class':'gb-50'})
def list_from_urls(parent, regex, metadata):
urls = parent.find_all('a',href=re.compile(regex))
for url in urls:
self.story.addToList(metadata,stripHTML(url))
list_from_urls(gb50s[0],r'browse.php\?type=characters','characters')
list_from_urls(gb50s[0],r'browse.php\?type=class&type_id=11','ships')
list_from_urls(gb50s[0],r'browse.php\?type=class&type_id=10','representation')
list_from_urls(gb50s[0],r'browse.php\?type=class&type_id=7','storytype')
list_from_urls(gb50s[0],r'browse.php\?type=class&type_id=14','house')
list_from_urls(gb50s[1],r'browse.php\?type=class&type_id=8','warnings')
list_from_urls(gb50s[1],r'browse.php\?type=class&type_id=15','contentwarnings')
list_from_urls(gb50s[1],r'browse.php\?type=class&type_id=4','genre')
list_from_urls(gb50s[1],r'browse.php\?type=class&type_id=13','tropes')
bq = soup.find('blockquote2')
if bq:
# blockquote2??? Whatever. But we're changing it to a real tag.
bq.name='div'
self.setDescription(url,bq)
# usually use something more precise for label search, but
# site doesn't group much.
labels = soup.find_all('b')
for labelspan in labels:
# logger.debug(labelspan)
value = labelspan.nextSibling
label = stripHTML(labelspan)
# logger.debug(value)
# logger.debug(label)
if 'Words:' in label:
stripHTML(value)
self.story.setMetadata('numWords', stripHTML(value).replace('·',''))
if 'Published:' in label:
self.story.setMetadata('datePublished', makeDate(stripHTML(value).replace('·',''), self.dateformat))
if 'Updated:' in label:
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value).replace('·',''), self.dateformat))
# Site allows stories to be in several series at once. FFF
# isn't thrilled with that, we have series00, series01, etc.
# Example:
# https://archive.fanfictalk.com/viewstory.php?sid=483
if self.getConfig("collect_series"):
seriesspan = soup.find('span',label='Series')
for i, seriesa in enumerate(seriesspan.find_all('a', href=re.compile(r"viewseries\.php\?seriesid=\d+"))):
# logger.debug(seriesa)
series_name = stripHTML(seriesa)
series_url = 'https://'+self.host+'/'+seriesa['href']
seriessoup = self.make_soup(self.get_request(series_url))
storyas = seriessoup.find_all('a', href=re.compile(r'viewstory.php\?sid=\d+'))
# logger.debug(storyas)
j=1
found = False
for storya in storyas:
# logger.debug(storya)
## allow for JS links.
if ('viewstory.php?sid='+self.story.getMetadata('storyId')) in storya['href']:
found = True
break
j+=1
if found:
series_index = j
self.story.setMetadata('series%02d'%i,"%s [%s]"%(series_name,series_index))
self.story.setMetadata('series%02dUrl'%i,series_url)
if i == 0:
self.setSeries(series_name, series_index)
self.story.setMetadata('seriesUrl',series_url)
else:
logger.debug("Story URL not found in series (%s) page, not including."%series_url)
# grab the text for an individual chapter.
def getChapterText(self, url):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
addurl = "&ageconsent=ok&warning=3"
else:
addurl=""
logger.debug('Getting chapter text from: %s' % (url+addurl))
soup = self.make_soup(self.get_request(url+addurl))
div = soup.find('div', {'id' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -0,0 +1,217 @@
# coding=utf-8
import re
import urllib2
import urlparse
from bs4.element import Tag
from base_adapter import BaseSiteAdapter, makeDate
from .. import exceptions
_SOURCE_CODE_ENCODING = 'utf-8'
def getClass():
return FanfictionCsodaidokHuAdapter
def _get_query_data(url):
components = urlparse.urlparse(url)
query_data = urlparse.parse_qs(components.query)
return dict((key, data[0]) for key, data in query_data.items())
# yields Tag _and_ NavigableString siblings from the given tag. The
# BeautifulSoup findNextSiblings() method for some reasons only returns either
# NavigableStrings _or_ Tag objects, not both.
def _yield_next_siblings(tag):
sibling = tag.nextSibling
while sibling:
yield sibling
sibling = sibling.nextSibling
class FanfictionCsodaidokHuAdapter(BaseSiteAdapter):
_SITE_DOMAIN = 'fanfiction.csodaidok.hu'
_BASE_URL = 'http://' + _SITE_DOMAIN + '/'
_VIEW_STORY_URL_TEMPLATE = _BASE_URL + 'viewstory.php?sid=%s'
_VIEW_CHAPTER_URL_TEMPLATE = _VIEW_STORY_URL_TEMPLATE + '&chapter=%s'
_STORY_DOES_NOT_EXIST_PAGE_TITLE = 'Cím: Szerző:'
_DATE_FORMAT = '%Y.%m.%d'
_SITE_LANGUAGE = 'Hungarian'
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
query_data = urlparse.parse_qs(self.parsedUrl.query)
story_id = query_data['sid'][0]
self.story.setMetadata('storyId', story_id)
self._setURL(self._VIEW_STORY_URL_TEMPLATE % story_id)
self.story.setMetadata('siteabbrev', self._SITE_DOMAIN)
self.story.setMetadata('language', self._SITE_LANGUAGE)
def _customized_fetch_url(self, url, exception=None, parameters=None):
if exception:
try:
data = self._fetchUrl(url, parameters)
except urllib2.HTTPError:
raise exception(self.url)
# Just let self._fetchUrl throw the exception, don't catch and
# customize it.
else:
data = self._fetchUrl(url, parameters)
return self.make_soup(data)
@staticmethod
def getSiteDomain():
return FanfictionCsodaidokHuAdapter._SITE_DOMAIN
@classmethod
def getSiteExampleURLs(cls):
return cls._VIEW_STORY_URL_TEMPLATE % 1234
def getSiteURLPattern(self):
return re.escape(self._VIEW_STORY_URL_TEMPLATE[:-2]) + r'\d+$'
def extractChapterUrlsAndMetadata(self):
soup = self._customized_fetch_url(self.url + '&chapter=1')
element = soup.find('div', id='pagetitle')
page_title = ''.join(element(text=True)).encode(_SOURCE_CODE_ENCODING)
if page_title == self._STORY_DOES_NOT_EXIST_PAGE_TITLE:
raise exceptions.StoryDoesNotExist(self.url)
author_url = urlparse.urljoin(self.url, element.a['href'])
story_id = self.story.getMetadata('storyId')
element = soup.find('select', {'name': 'chapter'})
if element:
for option in element('option'):
title = option.string
url = self._VIEW_CHAPTER_URL_TEMPLATE % (story_id, option['value'])
self.chapterUrls.append((title, url))
soup = self._customized_fetch_url(author_url)
story_id = self.story.getMetadata('storyId')
for listbox_div in soup('div', {'class': lambda klass: klass and 'listbox' in klass}):
a = listbox_div.div.a
if not a['href'].startswith('viewstory.php?sid='):
continue
query_data = _get_query_data(a['href'])
if query_data['sid'] == story_id:
break
else:
raise exceptions.FailedToDownload(self.url)
title = ''.join(a(text=True))
self.story.setMetadata('title', title)
if not self.chapterUrls:
self.chapterUrls.append((title, self.url))
element = a.findNextSibling('a')
self.story.setMetadata('author', element.string)
query_data = _get_query_data(element['href'])
self.story.setMetadata('authorId', query_data['uid'])
self.story.setMetadata('authorUrl', author_url)
element = element.findNextSibling('span')
rating = element.nextSibling.strip(' [')
if rating.encode(_SOURCE_CODE_ENCODING) != 'Korhatár nélkül':
self.story.setMetadata('rating', rating)
if rating == '18':
raise exceptions.AdultCheckRequired(self.url)
element = element.findNextSiblings('a')[1]
self.story.setMetadata('reviews', element.string)
sections = listbox_div('div', {'class': lambda klass: klass and klass in ['content', 'tail']})
for section in sections:
for element in section('span', {'class': 'classification'}):
key = element.string.encode(_SOURCE_CODE_ENCODING).strip(' :')
try:
value = element.nextSibling.string.encode(_SOURCE_CODE_ENCODING).strip()
except AttributeError:
value = None
if key == 'Tartalom':
contents = []
keep_summary_html = self.getConfig('keep_summary_html')
for sibling in _yield_next_siblings(element):
if isinstance(sibling, Tag):
if sibling.name == 'span' and sibling.get('class', None) == 'classification':
break
if keep_summary_html:
contents.append(self.utf8FromSoup(author_url, sibling))
else:
contents.append(''.join(sibling(text=True)))
else:
contents.append(sibling)
self.story.setMetadata('description', ''.join(contents))
elif key == 'Kategória':
for sibling in element.findNextSiblings(['a', 'span']):
if sibling.name == 'span':
break
self.story.addToList('category', sibling.string)
elif key == 'Szereplők':
for name in value.split(', '):
self.story.addToList('characters', name)
elif key == 'Műfaj':
if value != 'Nincs':
self.story.setMetadata('genre', value)
elif key == 'Figyelmeztetés':
if value != 'Nincs':
for warning in value.split(', '):
self.story.addToList('warnings', warning)
elif key == 'Kihívás':
if value != 'Nincs':
self.story.setMetadata('challenge', value)
elif key == 'Sorozat':
if value != 'Nincs':
self.story.setMetadata('series', value)
elif key == 'Fejezetek':
self.story.setMetadata('numChapters', int(value))
elif key == 'Befejezett':
self.story.setMetadata('status', 'Completed' if value == 'Nem' else 'In-Progress')
elif key == 'Szavak száma':
self.story.setMetadata('numWords', value)
elif key == 'Feltöltve':
self.story.setMetadata('datePublished', makeDate(value, self._DATE_FORMAT))
elif key == 'Frissítve':
self.story.setMetadata('dateUpdated', makeDate(value, self._DATE_FORMAT))
def getChapterText(self, url):
soup = self._customized_fetch_url(url)
contents = []
notes_div = soup.find('div', id='notes')
if notes_div:
contents.append(self.utf8FromSoup(url, notes_div))
story_div = notes_div.findNextSibling('div')
else:
element = soup.find('div', {'class': 'jumpmenu'})
story_div = element.findNextSibling('div')
contents.append(self.utf8FromSoup(url, story_div.span))
return ''.join(contents)

View file

@ -0,0 +1,290 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Software: eFiction
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
# By virtue of being recent and requiring both is_adult and user/pass,
# adapter_fanficcastletvnet.py is the best choice for learning to
# write adapters--especially for sites that use the eFiction system.
# Most sites that have ".../viewstory.php?sid=123" in the story URL
# are eFiction.
# For non-eFiction sites, it can be considerably more complex, but
# this is still a good starting point.
# In general an 'adapter' needs to do these five things:
# - 'Register' correctly with the downloader
# - Site Login (if needed)
# - 'Are you adult?' check (if needed--some do one, some the other, some both)
# - Grab the chapter list
# - Grab the story meta-data (some (non-eFiction) adapters have to get it from the author page)
# - Grab the chapter texts
# Search for XXX comments--that's where things are most likely to need changing.
# This function is called by the downloader in all adapter_*.py files
# in this dir to register the adapter class. So it needs to be
# updated to reflect the class below it. That, plus getSiteDomain()
# take care of 'Registering'.
def getClass():
return FanfictionJunkiesDeAdapter # XXX
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class FanfictionJunkiesDeAdapter(BaseSiteAdapter): # XXX
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
# get storyId from url--url validation guarantees query is only sid=1234
self.story.setMetadata('storyId',self.parsedUrl.query.split('=',)[1])
# normalized story URL.
# XXX Most sites don't have the /fanfic part. Replace all to remove it usually.
self._setURL('http://' + self.getSiteDomain() + '/efiction/viewstory.php?sid='+self.story.getMetadata('storyId'))
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','ffjde') # XXX
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%d/%m/%y" # XXX
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'fanfiction-junkies.de' # XXX
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/efiction/viewstory.php?sid=1234"
def getSiteURLPattern(self):
return re.escape("http://"+self.getSiteDomain()+"/efiction/viewstory.php?sid=")+r"\d+$"
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
if 'Registered Users Only' in data \
or 'There is no such account on our website' in data \
or "That password doesn't match the one in our database" in data:
return True
else:
return False
def performLogin(self, url):
params = {}
if self.password:
params['penname'] = self.username
params['password'] = self.password
else:
params['penname'] = self.getConfig("username")
params['password'] = self.getConfig("password")
params['cookiecheck'] = '1'
params['submit'] = 'Submit'
loginUrl = 'http://' + self.getSiteDomain() + '/efiction/user.php?action=login'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['penname']))
d = self._fetchUrl(loginUrl, params)
if "Member Account" not in d : #Member Account
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['penname']))
raise exceptions.FailedToLogin(url,params['penname'])
return False
else:
return True
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self):
if self.is_adult or self.getConfig("is_adult"):
# Weirdly, different sites use different warning numbers.
# If the title search below fails, there's a good chance
# you need a different number. print data at that point
# and see what the 'click here to continue' url says.
addurl = "&ageconsent=ok&warning=1" # XXX
else:
addurl=""
# index=1 makes sure we see the story chapter index. Some
# sites skip that for one-chapter stories.
url = self.url+'&index=1'+addurl
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url)
data = self._fetchUrl(url)
# The actual text that is used to announce you need to be an
# adult varies from site to site. Again, print data before
# the title search to troubleshoot.
if "For adults only " in data: # XXX
raise exceptions.AdultCheckRequired(self.url)
if "Access denied. This story has not been validated by the adminstrators of this site." in data:
raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# print data
# Now go hunting for all the meta data and the chapter list.
pagetitle = soup.find('h4')
## Title
a = pagetitle.find('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"$"))
self.story.setMetadata('title',a.string)
# Find authorid and URL from... author url.
a = pagetitle.find('a', href=re.compile(r"viewuser.php\?uid=\d+"))
self.story.setMetadata('authorId',a['href'].split('=')[1])
self.story.setMetadata('authorUrl','http://'+self.host+'/efiction/'+a['href'])
self.story.setMetadata('author',a.string)
# Reviews
reviewdata = soup.find('div', {'id' : 'sort'})
a = reviewdata.findAll('a', href=re.compile(r'reviews.php\?type=ST&(amp;)?item='+self.story.getMetadata('storyId')+"$"))[1] # second one.
self.story.setMetadata('reviews',stripHTML(a))
# Find the chapters:
for chapter in soup.findAll('a', href=re.compile(r'viewstory.php\?sid='+self.story.getMetadata('storyId')+"&chapter=\d+$")):
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/efiction/'+chapter['href']+addurl))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# eFiction sites don't help us out a lot with their meta data
# formating, so it's a little ugly.
# utility method
def defaultGetattr(d,k):
try:
return d[k]
except:
return ""
# <span class="label">Rated:</span> NC-17<br /> etc
list = soup.find('div', {'class':'listbox'})
labels = list.findAll('b')
for labelspan in labels:
value = labelspan.nextSibling
label = labelspan.string
if 'Zusammenfassung' in label:
self.setDescription(url,value)
if 'Eingestuft' in label:
self.story.setMetadata('rating', value)
if u'Wörter' in label:
self.story.setMetadata('numWords', value)
if 'Kategorie' in label:
cats = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=categories'))
for cat in cats:
self.story.addToList('category',cat.string)
if 'Charaktere' in label:
chars = labelspan.parent.findAll('a',href=re.compile(r'browse.php\?type=characters'))
for char in chars:
self.story.addToList('characters',char.string)
if 'Abgeschlossen' in label:
if 'Yes' in value:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
if u'Veröffentlicht' in label:
self.story.setMetadata('datePublished', makeDate(stripHTML(value), self.dateformat))
if 'Aktualisiert' in label:
# there's a stray [ at the end.
#value = value[0:-1]
self.story.setMetadata('dateUpdated', makeDate(stripHTML(value), self.dateformat))
try:
# Find Series name from series URL.
a = soup.find('a', href=re.compile(r"viewseries.php\?seriesid=\d+"))
series_name = a.string
series_url = 'http://'+self.host+'/efiction/'+a['href']
# use BeautifulSoup HTML parser to make everything easier to find.
seriessoup = self.make_soup(self._fetchUrl(series_url))
storyas = seriessoup.findAll('a', href=re.compile(r'^viewstory.php\?sid=\d+$'))
i=1
for a in storyas:
if a['href'] == ('viewstory.php?sid='+self.story.getMetadata('storyId')):
self.setSeries(series_name, i)
self.story.setMetadata('seriesUrl',series_url)
break
i+=1
except:
# I find it hard to care if the series parsing fails
pass
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'story'})
if None == div:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,div)

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,24 +15,22 @@
# limitations under the License.
#
from __future__ import absolute_import
# Software: eFiction
from .base_efiction_adapter import BaseEfictionAdapter
from base_efiction_adapter import BaseEfictionAdapter
class NarutoFicOrgSiteAdapter(BaseEfictionAdapter):
class FanfictionLucifaelComAdapter(BaseEfictionAdapter):
@staticmethod
def getSiteDomain():
return 'www.narutofic.org'
return 'fanfiction.lucifael.com'
@classmethod
def getSiteAbbrev(self):
return 'nfo'
return 'luci'
@classmethod
def getDateFormat(self):
return "%d/%m/%y"
return "%d/%m/%Y"
def getClass():
return NarutoFicOrgSiteAdapter
return FanfictionLucifaelComAdapter

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2011 Fanficdownloader team, 2016 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,31 +15,21 @@
# limitations under the License.
#
from __future__ import absolute_import
from datetime import datetime
import logging
logger = logging.getLogger(__name__)
import re
# py2 vs py3 transition
from ..six import text_type as unicode
from ..six.moves.urllib.parse import urlparse
import urllib2
from urllib import unquote_plus
from .. import exceptions as exceptions
from ..htmlcleanup import stripHTML
from .base_adapter import BaseSiteAdapter
from base_adapter import BaseSiteAdapter, makeDate
ffnetgenres=["Adventure", "Angst", "Crime", "Drama", "Family", "Fantasy",
"Friendship", "General", "Horror", "Humor", "Hurt-Comfort",
"Mystery", "Parody", "Poetry", "Romance", "Sci-Fi", "Spiritual",
"Supernatural", "Suspense", "Tragedy", "Western"]
ffnetpluscategories=["+Anima", "Alex + Ada", "Rosario + Vampire", "Blood+",
"+C: Sword and Cornett", "Norn9 - ノルン+ノネット",
"Haré+Guu/ジャングルはいつもハレのちグゥ", "Lost+Brain",
"Wicked + The Divine", "Alex + Ada", "RE: Alistair++",
"Tristan + Isolde"]
ffnetgenres=["Adventure", "Angst", "Crime", "Drama", "Family", "Fantasy", "Friendship", "General",
"Horror", "Humor", "Hurt-Comfort", "Mystery", "Parody", "Poetry", "Romance", "Sci-Fi",
"Spiritual", "Supernatural", "Suspense", "Tragedy", "Western"]
class FanFictionNetSiteAdapter(BaseSiteAdapter):
@ -47,13 +37,27 @@ class FanFictionNetSiteAdapter(BaseSiteAdapter):
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev','ffnet')
self.set_story_idurl(url)
# get storyId from url--url validation guarantees second part is storyId
self.story.setMetadata('storyId',self.parsedUrl.path.split('/',)[2])
# normalized story URL.
self._setURL("https://"+self.getSiteDomain()\
+"/s/"+self.story.getMetadata('storyId')+"/1/")
# ffnet update emails have the latest chapter URL.
# Frequently, when they arrive, not all the servers have the
# latest chapter yet and going back to chapter 1 to pull the
# chapter list doesn't get the latest. So save and use the
# original URL given to pull chapter list & metadata.
# Not used by plugin because URL gets normalized first for
# eliminating duplicate story urls.
self.origurl = url
if "https://m." in self.origurl:
## accept m(mobile)url, but use www.
self.origurl = self.origurl.replace("https://m.","https://www.")
self.opener.addheaders.append(('Referer',self.origurl))
@staticmethod
def getSiteDomain():
return 'www.fanfiction.net'
@ -66,74 +70,24 @@ class FanFictionNetSiteAdapter(BaseSiteAdapter):
def getSiteExampleURLs(cls):
return "https://www.fanfiction.net/s/1234/1/ https://www.fanfiction.net/s/1234/12/ http://www.fanfiction.net/s/1234/1/Story_Title http://m.fanfiction.net/s/1234/1/"
def set_story_idurl(self,url):
parsedUrl = urlparse(url)
pathparts = parsedUrl.path.split('/',)
self.story.setMetadata('storyId',pathparts[2])
self.urltitle='' if len(pathparts)<5 else pathparts[4]
# normalized story URL.
self._setURL("https://"+self.getSiteDomain()\
+"/s/"+self.story.getMetadata('storyId')+"/1/"+self.urltitle)
## here so getSiteURLPattern and get_section_url(class method) can
## both use it. Note adapter_fictionpresscom has one too.
@classmethod
def _get_site_url_pattern(cls):
return r"https?://(www|m)?\.fanfiction\.net/s/(?P<id>\d+)(/\d+)?(/(?P<title>[^/]+))?/?$"
@classmethod
def get_section_url(cls,url):
## minimal URL used for section names in INI and reject list
## for comparison
# logger.debug("pre--url:%s"%url)
m = re.match(cls._get_site_url_pattern(),url)
if m:
url = "https://"+cls.getSiteDomain()\
+"/s/"+m.group('id')+"/1/"
# logger.debug("post-url:%s"%url)
return url
@classmethod
def get_url_search(cls,url):
regexp = super(getClass(), cls).get_url_search(url)
regexp = re.sub(r"^(?P<keep>.*net/s/\d+/\d+/)(?P<urltitle>[^\$]*)?",
r"\g<keep>(.*)",regexp)
logger.debug(regexp)
return regexp
def getSiteURLPattern(self):
return self._get_site_url_pattern()
return r"https?://(www|m)?\.fanfiction\.net/s/\d+(/\d+)?(/|/[^/]+)?/?$"
## normalized chapter URLs DO contain the story title now, but
## normalized to current urltitle in case of title changes.
def normalize_chapterurl(self,url):
return re.sub(r"https?://(www|m)\.(?P<keep>fanfiction\.net/s/\d+/\d+/).*",
r"https://www.\g<keep>",url)+self.urltitle
def _fetchUrl(self,url,parameters=None,extrasleep=1.0,usecache=True):
## ffnet(and, I assume, fpcom) tends to fail more if hit too
## fast. This is in additional to what ever the
## slow_down_sleep_time setting is.
return BaseSiteAdapter._fetchUrl(self,url,
parameters=parameters,
extrasleep=extrasleep,
usecache=usecache)
def get_request(self,url,usecache=True):
## use super version if not set or isn't a chapter URL with a
## title.
if( not self.getConfig("try_shortened_title_urls") or
not re.match(r"https?://www\.fanfiction\.net/s/\d+/\d+/(?P<title>[^/]+)$", url) ):
return super(getClass(), self).get_request(url,usecache)
## kludgey way to attempt more than one URL variant by
## removing title one letter at a time. Note that network and
## open_pages_in_browser retries still happen first.
titlelen = len(url.split('/')[-1])
maxcut = min([4,titlelen])
j = 0
while j < maxcut: # should actually leave loop either by
# return or exception raise.
try:
useurl = url
if j: # j==0, full URL, then remove letters.
useurl = url[:-j]
return super(getClass(), self).get_request(useurl,usecache)
except exceptions.HTTPErrorFFF as fffe:
if j >= maxcut or 'Page not found or expired' not in unicode(fffe):
raise
j = j+1
def use_pagecache(self):
'''
adapters that will work with the page cache need to implement
this and change it to True.
'''
return True
def doExtractChapterUrlsAndMetadata(self,get_cover=True):
@ -143,60 +97,52 @@ class FanFictionNetSiteAdapter(BaseSiteAdapter):
url = self.origurl
logger.debug("URL: "+url)
data = self.get_request(url)
#logger.debug("\n===================\n%s\n===================\n"%data)
soup = self.make_soup(data)
# use BeautifulSoup HTML parser to make everything easier to find.
try:
data = self._fetchUrl(url)
#logger.debug("\n===================\n%s\n===================\n"%data)
soup = self.make_soup(data)
except urllib2.HTTPError as e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(url)
else:
raise e
if "Unable to locate story" in data or "Story Not Found" in data:
if "Unable to locate story" in data:
raise exceptions.StoryDoesNotExist(url)
# some times "Chapter not found...", sometimes "Chapter text
# not found..." or "Story does not have any chapters"
if "Please check to see you are not using an outdated url." in data:
# some times "Chapter not found...", sometimes "Chapter text not found..."
if "not found. Please check to see you are not using an outdated url." in data:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! 'Chapter not found. Please check to see you are not using an outdated url.'" % url)
if "Category for this story has been disabled" in data:
raise exceptions.FailedToDownload("FanFiction.Net has removed the category for this story and will no longer serve it.")
# <link rel="canonical" href="//www.fanfiction.net/s/13551154/100/Haze-Gray">
canonicalurl = soup.select_one('link[rel=canonical]')['href']
self.set_story_idurl(canonicalurl)
## ffnet used to have a tendency to send out update notices in
## email before all their servers were showing the update on
## the first chapter. It generates another server request and
## doesn't seem to be needed lately, so now default it to off.
try:
chapcount = len(soup.find('select', { 'name' : 'chapter' } ).find_all('option'))
# get chapter part of url.
except:
chapcount = 1
have_later_meta = False
if self.getConfig('check_next_chapter'):
try:
tryurl = "https://%s/s/%s/%d/%s"%(self.getSiteDomain(),
self.story.getMetadata('storyId'),
chapcount+1,
self.urltitle)
## ffnet used to have a tendency to send out update
## notices in email before all their servers were
## showing the update on the first chapter. It
## generates another server request and doesn't seem
## to be needed lately, so now default it to off.
try:
chapcount = len(soup.find('select', { 'name' : 'chapter' } ).findAll('option'))
# get chapter part of url.
except:
chapcount = 1
chapter = url.split('/',)[5]
tryurl = "https://%s/s/%s/%d/"%(self.getSiteDomain(),
self.story.getMetadata('storyId'),
chapcount+1)
logger.debug('=Trying newer chapter: %s' % tryurl)
newdata = self.get_request(tryurl)
newdata = self._fetchUrl(tryurl)
if "not found. Please check to see you are not using an outdated url." not in newdata \
and "This request takes too long to process, it is timed out by the server." not in newdata:
logger.debug('=======Found newer chapter: %s' % tryurl)
soup = self.make_soup(newdata)
have_later_meta = True
except Exception as e:
logger.warning("Caught exception in check_next_chapter URL: %s Exception %s."%(unicode(tryurl),unicode(e)))
if self.getConfig('meta_from_last_chapter') and not have_later_meta and chapcount > 1:
tryurl = "https://%s/s/%s/%d/%s"%(self.getSiteDomain(),
self.story.getMetadata('storyId'),
chapcount,
self.urltitle)
logger.debug('=Trying last chapter for meta_from_last_chapter: %s' % tryurl)
newdata = self.get_request(tryurl)
soup = self.make_soup(newdata)
have_later_meta = True
except urllib2.HTTPError as e:
if e.code == 503:
raise e
except e:
logger.warn("Caught an exception reading URL: %s sleeptime(%s) Exception %s."%(unicode(url),sleeptime,unicode(e)))
pass
# Find authorid and URL from... author url.
a = soup.find('a', href=re.compile(r"^/u/\d+"))
@ -211,8 +157,8 @@ class FanFictionNetSiteAdapter(BaseSiteAdapter):
## 2) cat1_cat2_Crossover
## For 1, use the second link.
## For 2, fetch the crossover page and pull the two categories from there.
pre_links = soup.find('div',{'id':'pre_story_links'})
categories = pre_links.find_all('a',{'class':'xcontrast_txt'})
categories = soup.find('div',{'id':'pre_story_links'}).findAll('a',{'class':'xcontrast_txt'})
#print("xcontrast_txt a:%s"%categories)
if len(categories) > 1:
# Strangely, the ones with *two* links are the
@ -220,17 +166,20 @@ class FanFictionNetSiteAdapter(BaseSiteAdapter):
# of Book, Movie, etc.
self.story.addToList('category',stripHTML(categories[1]))
elif 'Crossover' in categories[0]['href']:
## turns out there's only a handful of ffnet category's
## with '+' in. Keep a list and look for them
## specifically instead of looking up the crossover page.
crossover_cat = stripHTML(categories[0]).replace(" Crossover","")
for pluscat in ffnetpluscategories:
if pluscat in crossover_cat:
self.story.addToList('category',pluscat)
crossover_cat = crossover_cat.replace(pluscat,'')
for cat in crossover_cat.split(' + '):
if cat:
self.story.addToList('category',cat)
caturl = "https://%s%s"%(self.getSiteDomain(),categories[0]['href'])
catsoup = self.make_soup(self._fetchUrl(caturl))
found = False
for a in catsoup.findAll('a',href=re.compile(r"^/crossovers/.+?/\d+/")):
self.story.addToList('category',stripHTML(a))
found = True
if not found:
# Fall back. I ran across a story with a Crossver
# category link to a broken page once.
# http://www.fanfiction.net/s/2622060/1/
# Naruto + Harry Potter Crossover
logger.info("Fall back category collection")
for c in stripHTML(categories[0]).replace(" Crossover","").split(' + '):
self.story.addToList('category',c)
a = soup.find('a', href=re.compile(r'https?://www\.fictionratings\.com/'))
rating = a.string
@ -251,7 +200,7 @@ class FanFictionNetSiteAdapter(BaseSiteAdapter):
grayspan = gui_table1i.find('span', {'class':'xgray xcontrast_txt'})
# for b in grayspan.find_all('button'):
# for b in grayspan.findAll('button'):
# b.extract()
metatext = stripHTML(grayspan).replace('Hurt/Comfort','Hurt-Comfort')
#logger.debug("metatext:(%s)"%metatext)
@ -261,8 +210,7 @@ class FanFictionNetSiteAdapter(BaseSiteAdapter):
else:
self.story.setMetadata('status', 'In-Progress')
## Newer BS libraries are discarding whitespace after tags now. :-/
metalist = re.split(" ?- ",metatext)
metalist = metatext.split(" - ")
#logger.debug("metalist:(%s)"%metalist)
# Rated: Fiction K - English - Words: 158,078 - Published: 02-04-11
@ -290,7 +238,7 @@ class FanFictionNetSiteAdapter(BaseSiteAdapter):
# Updated: <span data-xutime='1368059198'>5/8</span> - Published: <span data-xutime='1278984264'>7/12/2010</span>
# Published: <span data-xutime='1384358726'>8m ago</span>
dates = soup.find_all('span',{'data-xutime':re.compile(r'^\d+$')})
dates = soup.findAll('span',{'data-xutime':re.compile(r'^\d+$')})
if len(dates) > 1 :
# updated get set to the same as published upstream if not found.
self.story.setMetadata('dateUpdated',datetime.fromtimestamp(float(dates[0]['data-xutime'])))
@ -338,51 +286,42 @@ class FanFictionNetSiteAdapter(BaseSiteAdapter):
# Try the larger image first.
cover_url = ""
try:
img = soup.select_one('img.lazy.cimage')
cover_url=img['data-original']
img = soup.select('img.lazy.cimage')
cover_url=img[0]['data-original']
except:
## Nov 2023 - src is always "/static/images/d_60_90.jpg" now
## Only take cover if there's data-original
## Primary motivator is to prevent unneeded author page hits.
pass
img = soup.select('img.cimage')
if img:
cover_url=img[0]['src']
logger.debug("cover_url:%s"%cover_url)
authimg_url = ""
if cover_url and self.getConfig('skip_author_cover') and self.getConfig('include_images'):
if cover_url and self.getConfig('skip_author_cover'):
authsoup = self.make_soup(self._fetchUrl(self.story.getMetadata('authorUrl')))
try:
authsoup = self.make_soup(self.get_request(self.story.getMetadata('authorUrl')))
try:
img = authsoup.select_one('img.lazy.cimage')
authimg_url=img['data-original']
except:
img = authsoup.select_one('img.cimage')
if img:
authimg_url=img['src']
img = authsoup.select('img.lazy.cimage')
authimg_url=img[0]['data-original']
except:
img = authsoup.select('img.cimage')
if img:
authimg_url=img[0]['src']
logger.debug("authimg_url:%s"%authimg_url)
logger.debug("authimg_url:%s"%authimg_url)
## ffnet uses different sizes on auth & story pages, but same id.
## Old URLs:
## //ffcdn2012t-fictionpressllc.netdna-ssl.com/image/1936929/150/
## //ffcdn2012t-fictionpressllc.netdna-ssl.com/image/1936929/180/
## After Dec 2020 ffnet changes:
## /image/6472517/180/
## /image/6472517/150/
try:
cover_id = cover_url.split('/')[-3]
except:
cover_id = None
try:
authimg_id = authimg_url.split('/')[-3]
except:
authimg_id = None
## ffnet uses different sizes on auth & story pages, but same id.
## //ffcdn2012t-fictionpressllc.netdna-ssl.com/image/1936929/150/
## //ffcdn2012t-fictionpressllc.netdna-ssl.com/image/1936929/180/
try:
cover_id = cover_url.split('/')[4]
except:
cover_id = None
try:
authimg_id = authimg_url.split('/')[4]
except:
authimg_id = None
## don't use cover if it matches the auth image.
if cover_id and authimg_id and cover_id == authimg_id:
logger.debug("skip_author_cover: cover_url matches authimg_url: don't use")
cover_url = None
except Exception as e:
logger.warning("Caught exception in skip_author_cover: %s."%unicode(e))
## don't use cover if it matches the auth image.
if cover_id and authimg_id and cover_id == authimg_id:
cover_url = None
if cover_url:
self.setCoverImage(url,cover_url)
@ -392,40 +331,35 @@ class FanFictionNetSiteAdapter(BaseSiteAdapter):
select = soup.find('select', { 'name' : 'chapter' } )
if select is None:
# no selector found, so it's a one-chapter story.
self.add_chapter(self.story.getMetadata('title'),url)
# no selector found, so it's a one-chapter story.
self.chapterUrls.append((self.story.getMetadata('title'),url))
else:
allOptions = select.find_all('option')
allOptions = select.findAll('option')
for o in allOptions:
## title URL will be put back on chapter URL during
## normalize_chapterurl() anyway, but also here for
## clarity
url = u'https://%s/s/%s/%s/%s' % ( self.getSiteDomain(),
self.story.getMetadata('storyId'),
o['value'],
self.urltitle)
url = u'https://%s/s/%s/%s/' % ( self.getSiteDomain(),
self.story.getMetadata('storyId'),
o['value'])
# just in case there's tags, like <i> in chapter titles.
title = u"%s" % o
title = re.sub(r'<[^>]+>','',title)
self.add_chapter(title,url)
self.chapterUrls.append((title,url))
self.story.setMetadata('numChapters',len(self.chapterUrls))
return
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % (url))
logger.debug('Getting chapter text from: %s' % url)
## ffnet(and, I assume, fpcom) tends to fail more if hit too
## fast. This is in additional to what ever the
## slow_down_sleep_time setting is.
data = self._fetchUrl(url,extrasleep=4.0)
## title URL was put back on chapter URL during
## normalize_chapterurl()
data = self.get_request(url)
if "Please email this error message in full to <a href='mailto:" in data:
if "Please email this error message in full to <a href='mailto:support@fanfiction.com'>support@fanfiction.com</a>" in data:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! FanFiction.net Site Error!" % url)
soup = self.make_soup(data)
## remove inline ads -- only seen with flaresolverr
for adtag in soup.select("div.google-auto-placed"):
adtag.decompose()
div = soup.find('div', {'id' : 'storytextp'})
if None == div:

View file

@ -1,157 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2024 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import absolute_import
import io
import logging
import re
import zipfile
from bs4 import BeautifulSoup
# py2 vs py3 transition
from .base_adapter import BaseSiteAdapter, makeDate
from fanficfare.htmlcleanup import stripHTML
from .. import exceptions as exceptions
logger = logging.getLogger(__name__)
def getClass():
return FanfictionsFrSiteAdapter
class FanfictionsFrSiteAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev', 'fanfictionsfr')
self.story.setMetadata('langcode','fr')
self.story.setMetadata('language','Français')
# get storyId from url--url validation guarantees query correct
match = re.match(self.getSiteURLPattern(), url)
if not match:
raise exceptions.InvalidStoryURL(url, self.getSiteDomain(), self.getSiteExampleURLs())
story_id = match.group('id')
self.story.setMetadata('storyId', story_id)
fandom_name = match.group('fandom')
self._setURL('https://%s/fanfictions/%s/%s/chapters.html' % (self.getSiteDomain(), fandom_name, story_id))
@staticmethod
def getSiteDomain():
return 'www.fanfictions.fr'
@classmethod
def getSiteExampleURLs(cls):
return 'https://%s/fanfictions/fandom/fanfiction-id/chapters.html' % cls.getSiteDomain()
def getSiteURLPattern(self):
return r'https?://(?:www\.)?fanfictions\.fr/fanfictions/(?P<fandom>[^/]+)/(?P<id>[^/]+)(/chapters.html)?'
def extractChapterUrlsAndMetadata(self):
logger.debug('URL: %s', self.url)
data = self.get_request(self.url)
soup = self.make_soup(data)
# detect if the fanfiction is 'suspended' (chapters unavailable)
alert_div = soup.find('div', id='alertInactiveFic')
if alert_div:
raise exceptions.FailedToDownload("Failed to download the fanfiction, most likely because it is suspended.")
title_element = soup.find('h1', itemprop='name')
self.story.setMetadata('title', stripHTML(title_element))
author_div = soup.find('div', itemprop='author')
author_name = stripHTML(author_div.a)
author_id = author_div.a['href'].split('/')[-1].replace('.html', '')
self.story.setMetadata('author', author_name)
self.story.setMetadata('authorId', author_id)
published_date_element = soup.find('span', class_='date-distance')
published_date_text = published_date_element['data-date']
published_date = makeDate(published_date_text, '%Y-%m-%d %H:%M:%S')
if published_date:
self.story.setMetadata('datePublished', published_date)
status_element = soup.find('p', title="Statut de la fanfiction").find('span', class_='badge')
french_status = stripHTML(status_element)
status_translation = {
"En cours": "In-Progress",
"Terminée": "Completed",
"One-shot": "Completed",
}
self.story.setMetadata('status', status_translation.get(french_status, french_status))
genre_elements = soup.find('div', title="Format et genres").find_all('span', class_="highlightable")
self.story.extendList('genre', [ stripHTML(genre) for genre in genre_elements[1:] ])
category_elements = soup.find_all('li', class_="breadcrumb-item")
self.story.extendList('category', [ stripHTML(category) for category in category_elements[-2].find_all('a') ])
first_description = soup.find('p', itemprop='abstract')
self.setDescription(self.url, first_description)
chapter_cards = soup.find_all(class_=['card', 'chapter'])
for chapter_card in chapter_cards:
chapter_title_tag = chapter_card.find('h2')
if chapter_title_tag:
chapter_title = stripHTML(chapter_title_tag)
chapter_link = 'https://'+self.getSiteDomain()+chapter_title_tag.find('a')['href']
# Clean up the chapter title by replacing multiple spaces and newline characters with a single space
chapter_title = re.sub(r'\s+', ' ', chapter_title)
self.add_chapter(chapter_title, chapter_link)
last_chapter_div = chapter_cards[-1]
updated_date_element = last_chapter_div.find('span', class_='date-distance')
last_chapter_update_date = updated_date_element['data-date']
date = makeDate(last_chapter_update_date, '%Y-%m-%d %H:%M:%S')
if date:
self.story.setMetadata('dateUpdated', date)
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
response, redirection_url = self.get_request_redirected(url)
if "telecharger_pdf.html" in redirection_url:
with zipfile.ZipFile(io.BytesIO(response.encode('latin1'))) as z:
# Assuming there's only one text file inside the zip
file_list = z.namelist()
if len(file_list) != 1:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Zip file should contain exactly one text file!" % url)
text_filename = file_list[0]
with z.open(text_filename) as text_file:
# Decode the text file with windows-1252 encoding
text = text_file.read().decode('windows-1252')
return text.replace("\r\n", "<br>\r\n")
else:
soup = self.make_soup(response)
div_content = soup.find('div', id='readarea')
if div_content is None:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url, div_content)

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2012 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2012 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,18 +15,19 @@
# limitations under the License.
#
from __future__ import absolute_import
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return FanFiktionDeAdapter
@ -38,6 +39,11 @@ class FanFiktionDeAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["utf8",
"Windows-1252"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
@ -47,7 +53,7 @@ class FanFiktionDeAdapter(BaseSiteAdapter):
# normalized story URL.
self._setURL('https://' + self.getSiteDomain() + '/s/'+self.story.getMetadata('storyId') + '/1')
self._setURL('http://' + self.getSiteDomain() + '/s/'+self.story.getMetadata('storyId') + '/1')
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','ffde')
@ -63,10 +69,17 @@ class FanFiktionDeAdapter(BaseSiteAdapter):
@classmethod
def getSiteExampleURLs(cls):
return "https://"+cls.getSiteDomain()+"/s/46ccbef30000616306614050 https://"+cls.getSiteDomain()+"/s/46ccbef30000616306614050/1 https://"+cls.getSiteDomain()+"/s/46ccbef30000616306614050/1/story-name"
return "http://"+cls.getSiteDomain()+"/s/46ccbef30000616306614050 http://"+cls.getSiteDomain()+"/s/46ccbef30000616306614050/1 http://"+cls.getSiteDomain()+"/s/46ccbef30000616306614050/1/story-name"
def getSiteURLPattern(self):
return r"https?"+re.escape("://"+self.getSiteDomain()+"/s/")+r"\w+(/\d+)?"
return re.escape("http://"+self.getSiteDomain()+"/s/")+r"\w+(/\d+)?"
def use_pagecache(self):
'''
adapters that will work with the page cache need to implement
this and change it to True.
'''
return True
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
@ -90,10 +103,10 @@ class FanFiktionDeAdapter(BaseSiteAdapter):
params['a'] = 'l'
params['submit'] = 'Login...'
loginUrl = 'https://www.fanfiktion.de/'
loginUrl = 'https://ssl.fanfiktion.de/'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['nickname']))
soup = self.make_soup(self.post_request(loginUrl,params))
soup = self.make_soup(self._postUrl(loginUrl,params))
if not soup.find('a', title='Logout'):
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['nickname']))
@ -108,19 +121,27 @@ class FanFiktionDeAdapter(BaseSiteAdapter):
url = self.url
logger.debug("URL: "+url)
data = self.get_request(url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if self.needToLoginCheck(data):
# need to log in for this one.
self.performLogin(url)
data = self.get_request(url,usecache=False)
data = self._fetchUrl(url,usecache=False)
if "Uhr ist diese Geschichte nur nach einer" in data:
raise exceptions.FailedToDownload(self.getSiteDomain() +" says: Auserhalb der Zeit von 23:00 Uhr bis 04:00 Uhr ist diese Geschichte nur nach einer erfolgreichen Altersverifikation zuganglich.")
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
# logger.debug(data)
# print data
# Now go hunting for all the meta data and the chapter list.
## Title
a = soup.find('a', href=re.compile(r'/s/'+self.story.getMetadata('storyId')+"/"))
@ -130,69 +151,41 @@ class FanFiktionDeAdapter(BaseSiteAdapter):
head = soup.find('div', {'class' : 'story-left'})
a = head.find('a')
self.story.setMetadata('authorId',a['href'].split('/')[2])
self.story.setMetadata('authorUrl','https://'+self.host+'/'+a['href'])
self.story.setMetadata('authorUrl','http://'+self.host+'/'+a['href'])
self.story.setMetadata('author',stripHTML(a))
# Find the chapters:
for chapter in soup.find('select').find_all('option'):
self.add_chapter(chapter,'https://'+self.host+'/s/'+self.story.getMetadata('storyId')+'/'+chapter['value'])
for chapter in soup.find('select').findAll('option'):
self.chapterUrls.append((stripHTML(chapter),'http://'+self.host+'/s/'+self.story.getMetadata('storyId')+'/'+chapter['value']))
## title="Wörter" failed with max_zalgo:1
self.story.setMetadata('numWords',stripHTML(soup.find("span",{'class':"fa-keyboard"}).parent).replace('.','')) # 1.234 = 1,234
self.story.setMetadata('numChapters',len(self.chapterUrls))
self.story.setMetadata('language','German')
self.story.setMetadata('datePublished', makeDate(stripHTML(head.find('span',title='erstellt').parent), self.dateformat))
self.story.setMetadata('dateUpdated', makeDate(stripHTML(head.find('span',title='aktualisiert').parent), self.dateformat))
## Genre now shares a line with rating.
# second colspan=3 td in head.
genres=stripHTML(head.find('span',class_='fa-angle-right').next_sibling)
self.story.extendList('genre',genres[:genres.index(' / ')].split(', '))
self.story.setMetadata('rating', genres[genres.index(' / ')+3:])
self.story.extendList('genre',genres[:genres.index('/')].split(', '))
# self.story.addToList('category',stripHTML(soup.find('span',id='ffcbox-story-topic-1')).split('/')[2].strip())
for a in soup.find('span',id='ffcbox-story-topic-1').find_all('a',href=re.compile(r'/c/')):
cat = stripHTML(a)
if cat != 'Fanfiction':
self.story.addToList('category',cat)
for span in soup.find_all('span',class_='badge-character'):
self.story.addToList('characters',stripHTML(span))
try:
self.story.setMetadata('native_status', head.find_all('span',{'class':'titled-icon'})[3]['title'])
except e:
logger.debug("Failed to find native status:%s"%e)
if head.find('span',title='fertiggestellt'):
if head.find('span',title='Fertiggestellt'):
self.story.setMetadata('status', 'Completed')
elif head.find('span',title='pausiert'):
self.story.setMetadata('status', 'Paused')
elif head.find('span',title='abgebrochen'):
self.story.setMetadata('status', 'Cancelled')
else:
self.story.setMetadata('status', 'In-Progress')
## Get description
descdiv = soup.select_one('div#story-summary-inline div')
if descdiv:
if 'center' in descdiv['class']:
del descdiv['class']
self.setDescription(url,descdiv)
# #find metadata on the author's page
# asoup = self.make_soup(self.get_request("https://"+self.getSiteDomain()+"?a=q&a1=v&t=nickdetailsstories&lbi=stories&ar=0&nick="+self.story.getMetadata('authorId')))
# tr=asoup.find_all('tr')
# for i in range(1,len(tr)):
# a = tr[i].find('a')
# if '/s/'+self.story.getMetadata('storyId')+'/1/' in a['href']:
# break
# td = tr[i].find_all('td')
# self.story.addToList('category',stripHTML(td[2]))
# self.story.setMetadata('rating', stripHTML(td[5]))
# self.story.setMetadata('numWords', stripHTML(td[6]))
#find metadata on the author's page
asoup = self.make_soup(self._fetchUrl("http://"+self.getSiteDomain()+"?a=q&a1=v&t=nickdetailsstories&lbi=stories&ar=0&nick="+self.story.getMetadata('authorId')))
tr=asoup.findAll('tr')
for i in range(1,len(tr)):
a = tr[i].find('a')
if '/s/'+self.story.getMetadata('storyId')+'/1/' in a['href']:
break
self.setDescription(url,a['onmouseover'].split("', '")[1])
td = tr[i].findAll('td')
self.story.addToList('category',stripHTML(td[2]))
self.story.setMetadata('rating', stripHTML(td[5]))
self.story.setMetadata('numWords', stripHTML(td[6]))
# grab the text for an individual chapter.
@ -201,10 +194,10 @@ class FanFiktionDeAdapter(BaseSiteAdapter):
logger.debug('Getting chapter text from: %s' % url)
time.sleep(0.5) ## ffde has "floodlock" protection
soup = self.make_soup(self.get_request(url))
soup = self.make_soup(self._fetchUrl(url))
div = soup.find('div', {'id' : 'storytext'})
for a in div.find_all('script'):
for a in div.findAll('script'):
a.extract()
if None == div:

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2014 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,25 +15,15 @@
# limitations under the License.
#
from __future__ import absolute_import
# Software: eFiction
from .base_efiction_adapter import BaseEfictionAdapter
import re
from base_efiction_adapter import BaseEfictionAdapter
class TheDelphicExpanseComAdapter(BaseEfictionAdapter):
''' This adapter will download stories from the
'Taste of Poison, the Fanfiction of Arsenic Jade' site '''
@classmethod
def getProtocol(self):
"""
Some, but not all site now require https.
"""
return "https"
class FanNationAdapter(BaseEfictionAdapter):
@staticmethod
def getSiteDomain():
return 'www.thedelphicexpanse.com'
return 'fannation.shades-of-moonlight.com'
@classmethod
def getPathToArchive(self):
@ -41,11 +31,14 @@ class TheDelphicExpanseComAdapter(BaseEfictionAdapter):
@classmethod
def getSiteAbbrev(self):
return 'tdec'
return 'fannation'
@classmethod
def getDateFormat(self):
return "%B %d, %Y"
def handleMetadataPair(self, key, value):
if key == 'Romance':
for val in re.split("\s*,\s*", value):
self.story.addToList('romance', val)
else:
super(FanNationAdapter, self).handleMetadataPair(key, value)
def getClass():
return TheDelphicExpanseComAdapter
return FanNationAdapter

View file

@ -0,0 +1,57 @@
# -*- coding: utf-8 -*-
# Copyright 2014 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Software: eFiction
import re
from base_efiction_adapter import BaseEfictionAdapter
class FHSArchiveComAdapter(BaseEfictionAdapter):
@staticmethod
def getSiteDomain():
return 'fhsarchive.com'
@classmethod
def getPathToArchive(self):
return '/autoarchive'
@classmethod
def getSiteAbbrev(self):
return 'fhsa'
@classmethod
def getDateFormat(self):
return "%m/%d/%y"
def handleMetadataPair(self, key, value):
if key == 'Warnings':
for val in re.split("\s*,\s*", value):
if value == 'None':
return
else:
# toss numbers only.
self.story.addToList('warnings', filter(lambda x : not x.isdigit() , val))
# elif 'Categories' in key:
# for val in re.split("\s*>\s*", value):
# self.story.addToList('category', val)
else:
super(FHSArchiveComAdapter, self).handleMetadataPair(key, value)
def getClass():
return FHSArchiveComAdapter

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2020 FanFicFare team
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,20 +15,19 @@
# limitations under the License.
#
from __future__ import absolute_import,unicode_literals
# import datetime
import time
import datetime
import logging
import json
logger = logging.getLogger(__name__)
import re
# from .. import translit
import urllib2
from .. import translit
from ..htmlcleanup import stripHTML
from .. import exceptions# as exceptions
from .. import exceptions as exceptions
# py2 vs py3 transition
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
@ -42,6 +41,11 @@ class FicBookNetAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.decode = ["utf8",
"Windows-1252"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.username = "NoneGiven" # if left empty, site doesn't return any message at all.
self.password = ""
self.is_adult=False
@ -58,42 +62,34 @@ class FicBookNetAdapter(BaseSiteAdapter):
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = u"%d %m %Y г., %H:%M"
self.dateformat = "%d %m %Y"
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'ficbook.net'
return 'www.ficbook.net'
@classmethod
def getSiteExampleURLs(cls):
return "https://"+cls.getSiteDomain()+"/readfic/12345 https://"+cls.getSiteDomain()+"/readfic/93626/246417#part_content https://"+cls.getSiteDomain()+"/readfic/578de1cd-a8b4-7ff1-aa49-750426508b82 https://"+cls.getSiteDomain()+"/readfic/578de1cd-a8b4-7ff1-aa49-750426508b82/94793742#part_content"
return "https://"+cls.getSiteDomain()+"/readfic/12345 https://"+cls.getSiteDomain()+"/readfic/93626/246417#part_content"
def getSiteURLPattern(self):
return r"https?://"+re.escape(self.getSiteDomain()+"/readfic/")+r"[\d\-a-zA-Z]+"
def performLogin(self,url,data):
params = {}
if self.password:
params['login'] = self.username
params['password'] = self.password
else:
params['login'] = self.getConfig("username")
params['password'] = self.getConfig("password")
logger.debug("Try to login in as (%s)" % params['login'])
d = self.post_request('https://' + self.getSiteDomain() + '/login_check_static',params,usecache=False)
if 'Войти используя аккаунт на сайте' in d:
raise exceptions.FailedToLogin(url,params['login'])
return True
return r"https?://"+re.escape(self.getSiteDomain()+"/readfic/")+r"\d+"
## Getting the chapter list and the meta data, plus 'is adult' checking.
def extractChapterUrlsAndMetadata(self,get_cover=True):
def extractChapterUrlsAndMetadata(self):
url=self.url
logger.debug("URL: "+url)
data = self.get_request(url)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
adult_div = soup.find('div',id='adultCoverWarning')
@ -102,12 +98,11 @@ class FicBookNetAdapter(BaseSiteAdapter):
adult_div.extract()
else:
raise exceptions.AdultCheckRequired(self.url)
# Now go hunting for all the meta data and the chapter list.
## Title
try:
a = soup.find('section',{'class':'chapter-info'}).find('h1')
except AttributeError:
raise exceptions.FailedToDownload("Error collecting meta: %s! Missing required element!" % url)
a = soup.find('section',{'class':'chapter-info'}).find('h1')
# kill '+' marks if present.
sup = a.find('sup')
if sup:
@ -117,12 +112,42 @@ class FicBookNetAdapter(BaseSiteAdapter):
# Find authorid and URL from... author url.
# assume first avatar-nickname -- there can be a second marked 'beta'.
a = soup.find('a',{'class':'creator-username'})
a = soup.find('a',{'class':'avatar-nickname'})
self.story.setMetadata('authorId',a.text) # Author's name is unique
self.story.setMetadata('authorUrl','https://'+self.host+a['href'])
self.story.setMetadata('authorUrl','https://'+self.host+'/'+a['href'])
self.story.setMetadata('author',a.text)
logger.debug("Author: (%s)"%self.story.getMetadata('author'))
# Find the chapters:
chapters = soup.find('ul', {'class' : 'table-of-contents'})
if chapters != None:
chapters=chapters.findAll('a', href=re.compile(r'/readfic/'+self.story.getMetadata('storyId')+"/\d+#part_content$"))
self.story.setMetadata('numChapters',len(chapters))
for x in range(0,len(chapters)):
chapter=chapters[x]
churl='https://'+self.host+chapter['href']
self.chapterUrls.append((stripHTML(chapter),churl))
if x == 0:
pubdate = translit.translit(stripHTML(chapter.parent.find('span')))
# pubdate = translit.translit(stripHTML(self.make_soup(self._fetchUrl(churl)).find('div', {'class' : 'part_added'}).find('span')))
if x == len(chapters)-1:
update = translit.translit(stripHTML(chapter.parent.find('span')))
# update = translit.translit(stripHTML(self.make_soup(self._fetchUrl(churl)).find('div', {'class' : 'part_added'}).find('span')))
else:
self.chapterUrls.append((self.story.getMetadata('title'),url))
self.story.setMetadata('numChapters',1)
pubdate=translit.translit(stripHTML(soup.find('div',{'class':'title-area'}).find('span')))
update=pubdate
logger.debug("numChapters: (%s)"%self.story.getMetadata('numChapters'))
if not ',' in pubdate:
pubdate=datetime.date.today().strftime(self.dateformat)
if not ',' in update:
update=datetime.date.today().strftime(self.dateformat)
pubdate=pubdate.split(',')[0]
update=update.split(',')[0]
fullmon = {"yanvarya":"01", u"января":"01",
"fievralya":"02", u"февраля":"02",
"marta":"03", u"марта":"03",
@ -136,68 +161,44 @@ class FicBookNetAdapter(BaseSiteAdapter):
"noyabrya":"11", u"ноября":"11",
"diekabrya":"12", u"декабря":"12" }
# Find the chapters:
pubdate = None
chapters = soup.find('ul', {'class' : 'list-of-fanfic-parts'})
if chapters is not None:
for chapdiv in chapters.find_all('li', {'class':'part'}):
chapter=chapdiv.find('a',href=re.compile(r'/readfic/'+self.story.getMetadata('storyId')+r"/\d+#part_content$"))
churl='https://'+self.host+chapter['href']
for (name,num) in fullmon.items():
if name in pubdate:
pubdate = pubdate.replace(name,num)
if name in update:
update = update.replace(name,num)
# Find the chapter dates.
date_str = chapdiv.find('span', {'title': True})['title'].replace(u"\u202fг. в", "")
for month_name, month_num in fullmon.items():
date_str = date_str.replace(month_name, month_num)
chapterdate = makeDate(date_str,self.dateformat)
self.add_chapter(chapter,churl,
{'date':chapterdate.strftime(self.getConfig("datechapter_format",self.getConfig("datePublished_format",self.dateformat)))})
if pubdate is None and chapterdate:
pubdate = chapterdate
update = chapterdate
else:
self.add_chapter(self.story.getMetadata('title'),url)
date_str = soup.find('div', {'class' : 'part-date'}).find('span', {'title': True})['title'].replace(u"\u202fг. в", "")
for month_name, month_num in fullmon.items():
date_str = date_str.replace(month_name, month_num)
pubdate = update = makeDate(date_str,self.dateformat)
logger.debug("numChapters: (%s)"%self.story.getMetadata('numChapters'))
self.story.setMetadata('dateUpdated', update)
self.story.setMetadata('datePublished', pubdate)
self.story.setMetadata('dateUpdated', makeDate(update, self.dateformat))
self.story.setMetadata('datePublished', makeDate(pubdate, self.dateformat))
self.story.setMetadata('language','Russian')
dlinfo = soup.select_one('header.d-flex.flex-column.gap-12.word-break')
## after site change, I don't see word count anywhere.
# pr=soup.find('a', href=re.compile(r'/printfic/\w+'))
# pr='https://'+self.host+pr['href']
# pr = self.make_soup(self._fetchUrl(pr))
# pr=pr.findAll('div', {'class' : 'part_text'})
# i=0
# for part in pr:
# i=i+len(stripHTML(part).split(' '))
# self.story.setMetadata('numWords', unicode(i))
series_label = dlinfo.select_one('div.description.word-break').find('strong', string='Серия:')
logger.debug('Series: %s'%str(series_label))
if series_label:
series_div = series_label.find_next_sibling("div")
# No accurate series number as for that, additional request needs to be made
self.setSeries(stripHTML(series_div.a), 1)
self.story.setMetadata('seriesUrl','https://' + self.getSiteDomain() + series_div.a.get('href'))
dlinfo = soup.find('dl',{'class':'info'})
i=0
fandoms = dlinfo.select_one('div:not([class])').find_all('a', href=re.compile(r'/fanfiction/\w+'))
fandoms = dlinfo.find('dd').findAll('a', href=re.compile(r'/fanfiction/\w+'))
for fandom in fandoms:
self.story.addToList('category',fandom.string)
i=i+1
if i > 1:
self.story.addToList('genre', u'Кроссовер')
tags = soup.find('div',{'class':'tags'})
if tags:
for genre in tags.find_all('a',href=re.compile(r'/tags/')):
self.story.addToList('genre',stripHTML(genre))
for genre in dlinfo.findAll('a',href=re.compile(r'/genres/')):
self.story.addToList('genre',stripHTML(genre))
logger.debug("category: (%s)"%self.story.getMetadata('category'))
logger.debug("genre: (%s)"%self.story.getMetadata('genre'))
ratingdt = dlinfo.find('div',{'class':re.compile(r'badge-rating-.*')})
self.story.setMetadata('rating', stripHTML(ratingdt.find('span')))
# meta=table.find_all('a', href=re.compile(r'/ratings/'))
ratingdt = dlinfo.find('dt',text='Рейтинг:')
self.story.setMetadata('rating', stripHTML(ratingdt.next_sibling))
# meta=table.findAll('a', href=re.compile(r'/ratings/'))
# i=0
# for m in meta:
# if i == 0:
@ -208,186 +209,39 @@ class FicBookNetAdapter(BaseSiteAdapter):
# i=2
# self.story.addToList('genre', m.find('b').text)
# elif i == 2:
# self.story.addToList('warnings', m.find('b').text)
# self.story.addToList('warnings', m.find('b').text)
if dlinfo.find('div', {'class':'badge-status-finished'}):
if dlinfo.find('span', {'style' : 'color: green'}):
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
try:
self.story.setMetadata('universe', stripHTML(dlinfo.find('a', href=re.compile('/fandom_universe/'))))
except AttributeError:
pass
paircharsdt = soup.find('strong',string='Пэйринг и персонажи:')
# site keeps both ships and indiv chars in /pairings/ links.
if paircharsdt:
for paira in paircharsdt.find_next('div').find_all('a', href=re.compile(r'/pairings/')):
if 'pairing-highlight' in paira['class']:
self.story.addToList('ships',stripHTML(paira))
chars=stripHTML(paira).split('/')
for char in chars:
self.story.addToList('characters',char)
else:
self.story.addToList('characters',stripHTML(paira))
summary=soup.find('div', itemprop='description')
if summary:
# Fix for the text not displaying properly
summary['class'].append('part_text')
self.setDescription(url,summary)
#self.story.setMetadata('description', summary.text)
stats = soup.find('div', {'class':'hat-actions-container'})
targetdata = stats.find_all('span', {'class' : 'main-info'})
for data in targetdata:
svg_class = data.find('svg')['class'][1] if data.find('svg') else None
value = int(stripHTML(data)) if stripHTML(data).isdigit() else 0
if svg_class == 'ic_thumbs-up' and value > 0:
self.story.setMetadata('likes', value)
#logger.debug("likes: (%s)"%self.story.getMetadata('likes'))
elif svg_class == 'ic_bubble-dark' and value > 0:
self.story.setMetadata('reviews', value)
#logger.debug("reviews: (%s)"%self.story.getMetadata('reviews'))
elif svg_class == 'ic_bookmark' and value > 0:
self.story.setMetadata('numCollections', value)
logger.debug("numCollections: (%s)"%self.story.getMetadata('numCollections'))
# Grab the amount of pages and words
targetpages = soup.find('strong',string='Размер:').find_next('div')
if targetpages:
targetpages_text = re.sub(r"(?<!\,)\s| ", "", targetpages.text, flags=re.UNICODE | re.MULTILINE)
pages_raw = re.search(r'(\d+)(?:страницы|страниц)', targetpages_text, re.UNICODE)
pages = int(pages_raw.group(1))
if pages > 0:
self.story.setMetadata('pages', pages)
logger.debug("pages: (%s)"%self.story.getMetadata('pages'))
numWords_raw = re.search(r"(\d+)(?:слова|слов)", targetpages_text, re.UNICODE)
numWords = int(numWords_raw.group(1))
if numWords > 0:
self.story.setMetadata('numWords', numWords)
logger.debug("numWords: (%s)"%self.story.getMetadata('numWords'))
# Grab FBN Category
class_tag = soup.select_one('div[class^="badge-with-icon direction"]').find('span', {'class' : 'badge-text'}).text
if class_tag:
self.story.setMetadata('classification',class_tag)
#logger.debug("classification: (%s)"%self.story.getMetadata('classification'))
# Find dedication.
ded = soup.find('div', {'class' : 'js-public-beta-dedication'})
if ded:
ded['class'].append('part_text')
self.story.setMetadata('dedication',ded)
# Find author comment
comm = soup.find('div', {'class' : 'js-public-beta-author-comment'})
if comm:
comm['class'].append('part_text')
self.story.setMetadata('authorcomment',comm)
follows = stats.find('fanfic-follow-button')[':follow-count']
if int(follows) > 0:
self.story.setMetadata('follows', int(follows))
logger.debug("follows: (%s)"%self.story.getMetadata('follows'))
# Grab the amount of awards
numAwards = 0
try:
awards = soup.find('fanfic-reward-list')[':initial-fic-rewards-list']
award_list = json.loads(awards)
numAwards = int(len(award_list))
# Grab the awards, but if multiple awards have the same name, only one will be kept; only an issue with hundreds of them.
self.story.extendList('awards', [str(award['user_text']) for award in award_list])
#logger.debug("awards (%s)"%self.story.getMetadata('awards'))
except (TypeError, KeyError):
logger.debug("Could not grab the awards")
if numAwards > 0:
self.story.setMetadata('numAwards', numAwards)
logger.debug("Num Awards (%s)"%self.story.getMetadata('numAwards'))
if get_cover:
cover = soup.find('fanfic-cover', {'class':"jsVueComponent"})
if cover is not None:
self.setCoverImage(url,cover['src-original'])
def replace_formatting(self,tag):
tname = tag.name
## operating on plain text because BS4 is hard to work on
## text with.
## stripHTML() discards whitespace around other tags, like <i>
txt = tag.get_text()
txt = txt.replace("\n","<br/>")
soup = self.make_soup("<"+tname+">"+txt+"</"+tname+">")
return soup.find(tname)
tags = dlinfo.findAll('dt')
for tag in tags:
label = translit.translit(tag.text)
if 'Piersonazhi:' in label or u'Персонажи:' in label:
chars=stripHTML(tag.next_sibling).split(', ')
for char in chars:
self.story.addToList('characters',char)
break
summary=soup.find('div', {'class' : 'urlize'})
self.setDescription(url,summary)
#self.story.setMetadata('description', summary.text)
# grab the text for an individual chapter.
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self.get_request(url))
soup = self.make_soup(self._fetchUrl(url))
chapter = soup.find('div', {'id' : 'content'})
if chapter is None: ## still needed?
chapter = soup.find('div', {'class' : 'public_beta'})
if chapter == None:
chapter = soup.find('div', {'class' : 'public_beta_disabled'})
if chapter is None:
if None == chapter:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
## ficbook uses weird CSS white-space: pre-wrap; for
## paragraphing. Doesn't work with txt output
if 'part_text' in chapter['class'] and self.getConfig('replace_text_formatting'):
## copy classes, except part_text
divclasses = chapter['class']
divclasses.remove('part_text')
chapter = self.replace_formatting(chapter)
chapter['class'] = divclasses
exclude_notes=self.getConfigList('exclude_notes')
if 'headnotes' not in exclude_notes:
# Find the headnote
head_note = soup.select_one("div.part-comment-top div.js-public-beta-comment-before")
if head_note:
# Create the structure for the headnote
head_notes_div_tag = soup.new_tag('div', attrs={'class': 'fff_chapter_notes fff_head_notes'})
head_b_tag = soup.new_tag('b')
head_b_tag.string = 'Примечания:'
if 'text-preline' in head_note['class'] and self.getConfig('replace_text_formatting'):
head_blockquote_tag = self.replace_formatting(head_note)
head_blockquote_tag.name = 'blockquote'
else:
head_blockquote_tag = soup.new_tag('blockquote')
head_blockquote_tag.string = stripHTML(head_note)
head_notes_div_tag.append(head_b_tag)
head_notes_div_tag.append(head_blockquote_tag)
# Prepend the headnotes to the chapter, <hr> to mimic the site
chapter.insert(0, head_notes_div_tag)
chapter.insert(1, soup.new_tag('hr'))
if 'footnotes' not in exclude_notes:
# Find the endnote
end_note = soup.select_one("div.part-comment-bottom div.js-public-beta-comment-after")
if end_note:
# Create the structure for the footnote
end_notes_div_tag = soup.new_tag('div', attrs={'class': 'fff_chapter_notes fff_foot_notes'})
end_b_tag = soup.new_tag('b')
end_b_tag.string = 'Примечания:'
if 'text-preline' in end_note['class'] and self.getConfig('replace_text_formatting'):
end_blockquote_tag = self.replace_formatting(end_note)
end_blockquote_tag.name = 'blockquote'
else:
end_blockquote_tag = soup.new_tag('blockquote')
end_blockquote_tag.string = stripHTML(end_note)
end_notes_div_tag.append(end_b_tag)
end_notes_div_tag.append(end_blockquote_tag)
# Append the endnotes to the chapter, <hr> to mimic the site
chapter.append(soup.new_tag('hr'))
chapter.append(end_notes_div_tag)
return self.utf8FromSoup(url,chapter)

View file

@ -1,225 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2021 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import absolute_import
import logging
logger = logging.getLogger(__name__)
import re
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from .base_adapter import BaseSiteAdapter, makeDate
class FictionAlleyArchiveOrgSiteAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev','fa')
self.is_adult=False
# get storyId from url--url validation guarantees query correct
m = re.match(self.getSiteURLPattern(),url)
if m:
# normalized story URL.
url = "https://"+self.getSiteDomain()+"/authors/"+m.group('auth')+"/"+m.group('id')+".html"
self._setURL(url)
else:
raise exceptions.InvalidStoryURL(url,
self.getSiteDomain(),
self.getSiteExampleURLs())
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%m/%d/%Y"
def _setURL(self,url):
# logger.debug("set URL:%s"%url)
super(FictionAlleyArchiveOrgSiteAdapter, self)._setURL(url)
m = re.match(self.getSiteURLPattern(),url)
if m:
self.story.setMetadata('authorId',m.group('auth'))
self.story.setMetadata('storyId',m.group('id'))
@staticmethod
def getSiteDomain():
return 'www.fictionalley-archive.org'
@classmethod
def getAcceptDomains(cls):
return ['www.fictionalley-archive.org',
'www.fictionalley.org']
@classmethod
def getSiteExampleURLs(cls):
return "https://"+cls.getSiteDomain()+"/authors/drt/DA.html https://"+cls.getSiteDomain()+"/authors/drt/JOTP01a.html"
@classmethod
def getURLDomain(cls):
return 'https://' + cls.getSiteDomain()
def getSiteURLPattern(self):
# http://www.fictionalley-archive.org/authors/drt/DA.html
# http://www.fictionalley-archive.org/authors/drt/JOTP01a.html
return r"https?://www.fictionalley(-archive)?.org/authors/(?P<auth>[a-zA-Z0-9_]+)/(?P<id>[a-zA-Z0-9_]+)\.html"
def extractChapterUrlsAndMetadata(self):
## could be either chapter list page or one-shot text page.
logger.debug("URL: "+self.url)
(data,rurl) = self.get_request_redirected(self.url)
if rurl != self.url:
self._setURL(rurl)
logger.debug("set to redirected url:%s"%self.url)
soup = self.make_soup(data)
# If chapter list page, get the first chapter to look for adult check
chapterlinklist = soup.select('h5.mb-1 > a')
# logger.debug(chapterlinklist)
if not chapterlinklist:
# no chapter list, it's either a chapter URL or a single chapter story
# <nav aria-label="Chapter Navigation">
# <a class="page-link" href="/authors/mz_xxo/HPATOTFI.html">Index</a>
storya = soup.select_one('nav[aria-label="Chapter Navigation"] a')
# logger.debug(storya)
if storya:
## multi chapter story
self._setURL(self.getURLDomain()+storya['href'])
logger.debug("Normalizing to URL: "+self.url)
# ## title's right there...
# self.story.setMetadata('title',stripHTML(storya))
data = self.get_request(self.url)
soup = self.make_soup(data)
chapterlinklist = soup.select('h5.mb-1 > a')
# logger.debug(chapterlinklist)
else:
## single chapter story.
# logger.debug("Single chapter story")
pass
self.story.setMetadata('title',stripHTML(soup.select_one('h1')))
## authorid already set.
## <h1 class="title" align="center">Just Off The Platform II by <a href="http://www.fictionalley.org/authors/drt/">DrT</a></h1>
authora=soup.select_one('h1 + h3 > a')
self.story.setMetadata('author',stripHTML(authora))
self.story.setMetadata('authorUrl',self.getURLDomain()+authora['href'])
if chapterlinklist:
# Find the chapters:
for chapter in chapterlinklist:
listitem = chapter.parent.parent.parent
# logger.debug(listitem)
# date
date = stripHTML(listitem.select_one('small.text-nowrap'))
chapterDate = makeDate(date,self.dateformat)
wordshits = listitem.select('span.font-weight-normal')
chap_data = {
'date':chapterDate.strftime(self.getConfig("datechapter_format",self.getConfig("datePublished_format","%Y-%m-%d"))),
'words':stripHTML(wordshits[0]),
'hits':stripHTML(wordshits[1]),
'summary':stripHTML(listitem.select_one('p.my-2')),
}
# logger.debug(chap_data)
self.add_chapter(chapter,self.getURLDomain()+chapter['href'], chap_data)
else:
self.add_chapter(self.story.getMetadata('title'),self.url)
cardbody = soup.select_one('div.card-body')
searchs_to_meta = (
# sitetype, ffftype, islist
('Rating', 'rating', False),
('House', 'house', True),
('Character', 'characters', True),
('Genre', 'genre', True),
('Era', 'era', True),
('Spoiler', 'spoilers', True),
('Ship', 'ships', True),
)
for (sitetype,ffftype, islist) in searchs_to_meta:
# logger.debug((sitetype,ffftype, islist))
tags = cardbody.select('a[href^="/stories?Include.%s"]'%sitetype)
# logger.debug(tags)
if tags:
if islist:
self.story.extendList(ffftype, [ stripHTML(a) for a in tags ])
else:
self.story.setMetadata(ffftype, stripHTML(tags[0]))
# Published: 09/26/2003 Updated: 04/13/2004 Words: 14,268 Chapters: 5 Hits: 743
badgeinfos = cardbody.select('div.badge-info')
# logger.debug(badgeinfos)
for badge in badgeinfos:
txt = stripHTML(badge)
(key,val)=txt.split(':')
# logger.debug((key,val))
if key in ( 'Published', 'Updated'):
date = makeDate(val,self.dateformat)
self.story.setMetadata('date'+key,date)
elif key in ('Hits'):
self.story.setMetadata(key.lower(),val)
elif key == 'Words':
self.story.setMetadata('numWords',val)
summary = soup.find('dt',string='Story Summary:')
if summary:
summary = summary.find_next_sibling('dd')
summary.name='div'
self.setDescription(self.url,summary)
return
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
data = self.get_request(url)
soup = self.make_soup(data)
# this may be a brittle way to get the chapter text.
# Site doesn't give a lot of hints.
chaptext = soup.select_one('main#content div:not([class])')
# not sure how, but we can get html, etc tags still in some
# stories. That breaks later updates because it confuses
# epubutils.py
# Yes, this still applies to fictionalley-archive.
for tag in chaptext.find_all('head') + chaptext.find_all('meta') + chaptext.find_all('script'):
tag.extract()
for tag in chaptext.find_all('body') + chaptext.find_all('html'):
tag.name = 'div'
if self.getConfig('include_author_notes'):
row = chaptext.find_previous_sibling('div',class_='row')
logger.debug(row)
andt = row.find('dt',string="Author's Note:")
logger.debug(andt)
if andt:
chaptext.insert(0,andt.parent.extract())
# post notes aren't as structured(?)
for div in chaptext.find_next_siblings('div',class_='row'):
chaptext.append(div.extract())
# logger.debug(chaptext)
return self.utf8FromSoup(url,chaptext)
def getClass():
return FictionAlleyArchiveOrgSiteAdapter

View file

@ -0,0 +1,244 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib
import urllib2
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
class FictionAlleyOrgSiteAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev','fa')
self.decode = ["Windows-1252",
"utf8"] # 1252 is a superset of iso-8859-1.
# Most sites that claim to be
# iso-8859-1 (and some that claim to be
# utf8) are really windows-1252.
self.is_adult=False
# get storyId from url--url validation guarantees query correct
m = re.match(self.getSiteURLPattern(),url)
if m:
self.story.setMetadata('authorId',m.group('auth'))
self.story.setMetadata('storyId',m.group('id'))
# normalized story URL.
self._setURL(url)
else:
raise exceptions.InvalidStoryURL(url,
self.getSiteDomain(),
self.getSiteExampleURLs())
@staticmethod
def getSiteDomain():
return 'www.fictionalley.org'
@classmethod
def getSiteExampleURLs(cls):
return "http://"+cls.getSiteDomain()+"/authors/drt/DA.html http://"+cls.getSiteDomain()+"/authors/drt/JOTP01a.html"
def getSiteURLPattern(self):
# http://www.fictionalley.org/authors/drt/DA.html
# http://www.fictionalley.org/authors/drt/JOTP01a.html
return re.escape("http://"+self.getSiteDomain())+"/authors/(?P<auth>[a-zA-Z0-9_]+)/(?P<id>[a-zA-Z0-9_]+)\.html"
def _postFetchWithIAmOld(self,url):
if self.is_adult or self.getConfig("is_adult"):
params={'iamold':'Yes',
'action':'ageanswer'}
logger.info("Attempting to get cookie for %s" % url)
## posting on list doesn't work, but doesn't hurt, either.
data = self._postUrl(url,params)
else:
data = self._fetchUrl(url)
return data
def extractChapterUrlsAndMetadata(self):
## could be either chapter list page or one-shot text page.
url = self.url
logger.debug("URL: "+url)
try:
data = self._postFetchWithIAmOld(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
chapterdata = data
# If chapter list page, get the first chapter to look for adult check
chapterlinklist = soup.findAll('a',{'class':'chapterlink'})
if chapterlinklist:
chapterdata = self._postFetchWithIAmOld(chapterlinklist[0]['href'])
if "Are you over seventeen years old" in chapterdata:
raise exceptions.AdultCheckRequired(self.url)
if not chapterlinklist:
# no chapter list, chapter URL: change to list link.
# second a tag inside div breadcrumbs
storya = soup.find('div',{'class':'breadcrumbs'}).findAll('a')[1]
self._setURL(storya['href'])
url=self.url
logger.debug("Normalizing to URL: "+url)
## title's right there...
self.story.setMetadata('title',stripHTML(storya))
data = self._fetchUrl(url)
soup = self.make_soup(data)
chapterlinklist = soup.findAll('a',{'class':'chapterlink'})
else:
## still need title from somewhere. If chapterlinklist,
## then chapterdata contains a chapter, find title the
## same way.
chapsoup = self.make_soup(chapterdata)
storya = chapsoup.find('div',{'class':'breadcrumbs'}).findAll('a')[1]
self.story.setMetadata('title',stripHTML(storya))
del chapsoup
del chapterdata
## authorid already set.
## <h1 class="title" align="center">Just Off The Platform II by <a href="http://www.fictionalley.org/authors/drt/">DrT</a></h1>
authora=soup.find('h1',{'class':'title'}).find('a')
self.story.setMetadata('author',authora.string)
self.story.setMetadata('authorUrl',authora['href'])
if len(chapterlinklist) == 1:
self.chapterUrls.append((self.story.getMetadata('title'),chapterlinklist[0]['href']))
else:
# Find the chapters:
for chapter in chapterlinklist:
# just in case there's tags, like <i> in chapter titles.
self.chapterUrls.append((stripHTML(chapter),chapter['href']))
self.story.setMetadata('numChapters',len(self.chapterUrls))
## Go scrape the rest of the metadata from the author's page.
data = self._fetchUrl(self.story.getMetadata('authorUrl'))
soup = self.make_soup(data)
# <dl><dt><a class = "Rid story" href = "http://www.fictionalley.org/authors/aafro_man_ziegod/TMH.html">
# [Rid] The Magical Hottiez</a> by <a class = "pen_name" href = "http://www.fictionalley.org/authors/aafro_man_ziegod/">Aafro Man Ziegod</a> </small></dt>
# <dd><small class = "storyinfo"><a href = "http://www.fictionalley.org/ratings.html" target = "_new">Rating:</a> PG-13 - Spoilers: PS/SS, CoS, PoA, GoF, QTTA, FB - 4264 hits - 5060 words<br />
# Genre: Humor, Romance - Main character(s): None - Ships: None - Era: Multiple Eras<br /></small>
# Chaos ensues after Witch Weekly, seeking to increase readers, decides to create a boyband out of five seemingly talentless wizards: Harry Potter, Draco Malfoy, Ron Weasley, Neville Longbottom, and Oliver "Toss Your Knickers Here" Wood.<br />
# <small class = "storyinfo">Published: June 3, 2002 (between Goblet of Fire and Order of Phoenix) - Updated: June 3, 2002</small>
# </dd></dl>
storya = soup.find('a',{'href':self.story.getMetadata('storyUrl')})
storydd = storya.findNext('dd')
# Rating: PG - Spoilers: None - 2525 hits - 736 words
# Genre: Humor - Main character(s): H, R - Ships: None - Era: Multiple Eras
# Harry and Ron are back at it again! They reeeeeeally don't want to be back, because they know what's awaiting them. "VH1 Goes Inside..." is back! Why? 'Cos there are soooo many more couples left to pick on.
# Published: September 25, 2004 (between Order of Phoenix and Half-Blood Prince) - Updated: September 25, 2004
## change to text and regexp find.
metastr = stripHTML(storydd).replace('\n',' ').replace('\t',' ')
m = re.match(r".*?Rating: (.+?) -.*?",metastr)
if m:
self.story.setMetadata('rating', m.group(1))
m = re.match(r".*?Genre: (.+?) -.*?",metastr)
if m:
for g in m.group(1).split(','):
self.story.addToList('genre',g)
m = re.match(r".*?Published: ([a-zA-Z]+ \d\d?, \d\d\d\d).*?",metastr)
if m:
self.story.setMetadata('datePublished',makeDate(m.group(1), "%B %d, %Y"))
m = re.match(r".*?Updated: ([a-zA-Z]+ \d\d?, \d\d\d\d).*?",metastr)
if m:
self.story.setMetadata('dateUpdated',makeDate(m.group(1), "%B %d, %Y"))
m = re.match(r".*? (\d+) words Genre.*?",metastr)
if m:
self.story.setMetadata('numWords', m.group(1))
for small in storydd.findAll('small'):
small.extract() ## removes the <small> tags, leaving only the summary.
storydd.name = 'div' ## change tag name else Calibre treats it oddly.
self.setDescription(url,storydd)
#self.story.setMetadata('description',stripHTML(storydd))
return
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
data = self._fetchUrl(url)
# find <!-- headerend --> & <!-- footerstart --> and
# replaced with matching div pair for easier parsing.
# Yes, it's an evil kludge, but what can ya do? Using
# something other than div prevents soup from pairing
# our div with poor html inside the story text.
crazy = "crazytagstringnobodywouldstumbleonaccidently"
data = data.replace('<!-- headerend -->','<'+crazy+' id="storytext">').replace('<!-- footerstart -->','</'+crazy+'>')
# problems with some stories confusing Soup. This is a nasty
# hack, but it works.
data = data[data.index('<'+crazy+''):]
# ditto with extra crap at the end.
data = data[:data.index('</'+crazy+'>')+len('</'+crazy+'>')]
soup = self.make_soup(data)
body = soup.findAll('body') ## some stories use a nested body and body
## tag, in which case we don't
## need crazytagstringnobodywouldstumbleonaccidently
## and use the second one instead.
if len(body)>1:
text = body[1]
text.name='div' # force to be a div to avoid multiple body tags.
else:
text = soup.find(crazy, {'id' : 'storytext'})
text.name='div' # change to div tag.
if not data or not text:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
# not sure how, but we can get html, etc tags still in some
# stories. That breaks later updates because it confuses
# epubutils.py
for tag in text.findAll('head'):
tag.extract()
for tag in text.findAll('body') + text.findAll('html'):
tag.name = 'div'
return self.utf8FromSoup(url,text)
def getClass():
return FictionAlleyOrgSiteAdapter

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2022 FanFicFare team
# Copyright 2016 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,103 +15,15 @@
# limitations under the License.
#
from __future__ import absolute_import
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
from .. import exceptions as exceptions
from ..htmlcleanup import stripHTML
# py2 vs py3 transition
from .base_adapter import BaseSiteAdapter, makeDate
ampfandoms = ["A Falcone & Driscoll Investigation",
"Alias Smith & Jones",
"Atelier Escha & Logy",
"Austin & Ally",
"Baby & Me/赤ちゃんと僕",
"Barney & Friends",
"Between Love & Goodbye",
"Beyond Good & Evil",
"Bill & Ted's Excellent Adventure/Bogus Journey",
"BLACK & WHITE",
"Bonnie & Clyde",
"Brandy & Mr. Whiskers",
"Brothers & Sisters",
"Bucket & Skinner's Epic Adventures",
"Calvin & Hobbes",
"Cats & Dogs",
"Command & Conquer",
"Devil & Devil",
"Dharma & Greg",
"Dicky & Dawn",
"Drake & Josh",
"Edgar & Ellen",
"Franklin & Bash",
"Gabby Duran & The Unsittables",
"Girls und Panzer/ガールズ&パンツァー",
"Gnomeo & Juliet",
"Grim Adventures of Billy & Mandy",
"Half & Half/ハーフ・アンド・ハーフ",
"Hansel & Gretel",
"Hatfields & McCoys",
"High & Low - The Story of S.W.O.R.D.",
"Home & Away",
"Hudson & Rex",
"Huntik: Secrets & Seekers",
"Imagine Me & You",
"Jekyll & Hyde",
"Jonathan Strange & Mr. Norrell",
"Knight's & Magic/ナイツ&マジック",
"Law & Order: Los Angeles",
"Law & Order: Organized Crime",
"Lilo & Stitch",
"Locke & Key",
"Lockwood & Co.",
"Lost & Found Music Studios",
"Lu & Og",
"Me & My Brothers",
"Melissa & Joey",
"Mickey Mouse & Friends",
"Mike & Molly",
"Mike, Lu & Og",
"Miraculous: Tales of Ladybug & Cat Noir",
"Mork & Mindy",
"Mount&Blade",
"Mr. & Mrs. Smith",
"Mr. Peabody & Sherman",
"Muhyo & Roji",
"Nicky, Ricky, Dicky & Dawn",
"Oliver & Company",
"Ozzy & Drix",
"Panty & Stocking with Garterbelt/パンティストッキングwithガーターベルト",
"Penryn & the End of Days",
"Prep & Landing",
"Prince & Hero/王子とヒーロー",
"Prince & Me",
"Puzzle & Dragons",
"Ren & Stimpy Show",
"Rizzoli & Isles",
"Romeo & Juliet",
"Rosemary & Thyme",
"Sam & Cat",
"Sam & Max",
"Sapphire & Steel",
"Scott & Bailey",
"Shakespeare & Hathaway: Private Investigators",
"Soul Nomad & the World Eaters",
"Superman & Lois",
"Tiger & Bunny/タイガー&バニー",
"Trains & Automobiles",
"Upin & Ipin",
"Wallace & Gromit",
"Witch & Wizard",
"Wolverine & the X-Men",
"Yotsuba&!/よつばと!",
"Young & Hungry",
]
from base_adapter import BaseSiteAdapter, makeDate
class FictionHuntComSiteAdapter(BaseSiteAdapter):
@ -119,32 +31,16 @@ class FictionHuntComSiteAdapter(BaseSiteAdapter):
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev','fichunt')
## new types:
## https://fictionhunt.com/stories/7edm248/the-last-of-his-kind/chapters/1
## https://fictionhunt.com/stories/89kzg4z/the-last-of-his-kind-new
## old type:
## http://fictionhunt.com/read/12411643/1
# get storyId from url--url validation guarantees query correct
m = re.match(self.getSiteURLPattern(),url)
if m:
# logger.debug(m.groupdict())
self.story.setMetadata('storyId',m.group('id'))
if m.group('type') == "stories": # newer URL
# normalized story URL.
self._setURL("https://"+self.getSiteDomain()\
+"/stories/"+self.story.getMetadata('storyId')+"/"+ (m.group('title') or ""))
else:
self._setURL("https://"+self.getSiteDomain()\
+"/read/"+self.story.getMetadata('storyId')+"/1")
# logger.debug(self.url)
else:
raise exceptions.InvalidStoryURL(url,
self.getSiteDomain(),
self.getSiteExampleURLs())
# get storyId from url--url validation guarantees second part is storyId
self.story.setMetadata('storyId',self.parsedUrl.path.split('/',)[2])
# normalized story URL.
self._setURL("http://"+self.getSiteDomain()\
+"/read/"+self.story.getMetadata('storyId')+"/1")
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%Y-%m-%d %H:%M:%S"
self.dateformat = "%d-%m-%Y"
@staticmethod
def getSiteDomain():
@ -152,55 +48,17 @@ class FictionHuntComSiteAdapter(BaseSiteAdapter):
@classmethod
def getSiteExampleURLs(cls):
return "https://fictionhunt.com/stories/1a1a1a/story-title http://fictionhunt.com/read/1234/1"
return "http://fictionhunt.com/read/1234/1"
def getSiteURLPattern(self):
## https://fictionhunt.com/stories/7edm248/the-last-of-his-kind/chapters/1
## https://fictionhunt.com/stories/89kzg4z/the-last-of-his-kind-new
## http://fictionhunt.com/read/12411643/1
return r"https?://(www.)?fictionhunt.com/(?P<type>read|stories)/(?P<id>[0-9a-z]+)(/(?P<title>[^/]+))?(/|/[^/]+)*/?$"
return r"http://(www.)?fictionhunt.com/read/\d+(/\d+)?(/|/[^/]+)?/?$"
def needToLoginCheck(self, data):
## FH is apparently reporting "Story has been removed" for all
## chapters when not logged in now.
if 'https://fictionhunt.com/login' in data:
return True
else:
return False
def performLogin(self, url):
params = {}
if self.password:
params['identifier'] = self.username
params['password'] = self.password
else:
params['identifier'] = self.getConfig("username")
params['password'] = self.getConfig("password")
params['remember'] = 'on'
loginUrl = 'https://' + self.getSiteDomain() + '/login'
if not params['identifier']:
logger.info("This site requires login.")
raise exceptions.FailedToLogin(url,params['identifier'])
## need to pull empty login page first to get authenticity_token
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['identifier']))
soup = self.make_soup(self.get_request(loginUrl,usecache=False))
params['_token']=soup.find('input', {'name':'_token'})['value']
d = self.post_request(loginUrl, params, usecache=False)
# logger.debug(d)
if self.needToLoginCheck(d):
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['identifier']))
raise exceptions.FailedToLogin(url,params['identifier'])
return False
else:
return True
def use_pagecache(self):
'''
adapters that will work with the page cache need to implement
this and change it to True.
'''
return True
def doExtractChapterUrlsAndMetadata(self,get_cover=True):
@ -208,132 +66,80 @@ class FictionHuntComSiteAdapter(BaseSiteAdapter):
# metadata and chapter list
url = self.url
data = self.get_request(url)
## As per #784, site isn't requiring login anymore.
## Login check commented since we've seen it toggle before.
# if self.needToLoginCheck(data):
# self.performLogin(url)
# data = self.get_request(url,usecache=False)
try:
data = self._fetchUrl(url)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.meta)
else:
raise e
# use BeautifulSoup HTML parser to make everything easier to find.
soup = self.make_soup(data)
## detect old storyUrl, switch to new storyUrl:
canonlink = soup.find('link',rel='canonical')
if canonlink:
# logger.debug(canonlink)
canonlink = re.sub(r"/chapters/\d+","",canonlink['href'])
# logger.debug(canonlink)
self._setURL(canonlink)
url = self.url
data = self.get_request(url)
soup = self.make_soup(data)
else:
# in case title changed
self._setURL(soup.select_one("div.Story__details a")['href'])
url = self.url
# logger.debug(data)
self.story.setMetadata('title',stripHTML(soup.find('h1',{'class':'Story__title'})))
self.story.setMetadata('title',stripHTML(soup.find('div',{'class':'title'})).strip())
summhead = soup.find('h5',string='Summary')
self.setDescription(url,summhead.find_next('div'))
self.setDescription(url,'<i>(Story descriptions not available on fictionhunt.com)</i>')
## author:
autha = soup.find('div',{'class':'StoryContents__meta'}).find('a') # first a in StoryContents__meta
self.story.setMetadata('authorId',autha['href'].split('/')[4])
self.story.setMetadata('authorUrl',autha['href'])
self.story.setMetadata('author',autha.string)
updlab = soup.find('label',string='Last Updated:')
if updlab:
update = updlab.find_next('time')['datetime']
self.story.setMetadata('dateUpdated', makeDate(update, self.dateformat))
publab = soup.find('label',string='Published:')
if publab:
pubdate = publab.find_next('time')['datetime']
self.story.setMetadata('datePublished', makeDate(pubdate, self.dateformat))
## need author page for some metadata.
authsoup = None
authpagea = autha
authstorya = None
## Rating and exact word count doesn't appear on the summary
## page, try to get from author page.
## find story url, might need to spin through author's pages.
while authpagea and not authstorya:
authsoup = self.make_soup(self.get_request(authpagea['href']))
authpagea = authsoup.find('a',{'rel':'next'})
# CSS selectors don't allow : or / unquoted, which
# BS4(and dependencies) didn't used to enforce.
authstorya = authsoup.select_one('h4.Story__item-title a[href="%s"]'%self.url)
if not authstorya:
raise exceptions.FailedToDownload("Error finding %s on author page(s)" % self.url)
meta = authstorya.find_parent('li').find('div',class_='Story__meta-info')
meta=meta.text.split()
self.story.setMetadata('numWords',meta[meta.index('words')-1])
self.story.setMetadata('rating',meta[meta.index('Rating:')+1])
# logger.debug(meta)
# Find authorid and URL from... author url.
# fictionhunt doesn't have author pages, use ffnet original author link.
a = soup.find('a', href=re.compile(r"fanfiction.net/u/\d+"))
self.story.setMetadata('authorId',a['href'].split('/')[-1])
self.story.setMetadata('authorUrl','https://www.fanfiction.net/u/'+self.story.getMetadata('authorId'))
self.story.setMetadata('author',a.string)
# Find original ffnet URL
a = soup.find('a', string="Source")
a = soup.find('a', href=re.compile(r"fanfiction.net/s/\d+"))
self.story.setMetadata('origin',stripHTML(a))
self.story.setMetadata('originUrl',a['href'])
datesdiv = soup.find('div',{'class':'dates'})
if stripHTML(datesdiv.find('label')) == 'Completed' : # first label is status.
# Fleur D. & Harry P. & Hermione G. & Susan B. - Words: 42,848 - Rated: M - English - None - Chapters: 9 - Reviews: 248 - Updated: 21-09-2016 - Published: 16-05-2015 - by Elven Sorcerer (FFN)
# None - Words: 13,087 - Rated: M - English - Romance & Supernatural - Chapters: 3 - Reviews: 5 - Updated: 21-09-2016 - Published: 20-09-2016
# Harry P. & OC - Words: 10,910 - Rated: M - English - None - Chapters: 5 - Reviews: 6 - Updated: 21-09-2016 - Published: 11-09-2016
# Dudley D. & Harry P. & Nagini & Vernon D. - Words: 4,328 - Rated: K+ - English - None - Chapters: 2 - Updated: 21-09-2016 - Published: 20-09-2016 -
details = soup.find('div',{'class':'details'})
detail_re = \
r'(?P<characters>.+) - Words: (?P<numWords>[0-9,]+) - Rated: (?P<rating>[a-zA-Z\\+]+) - (?P<language>.+) - (?P<genre>.+)'+ \
r' - Chapters: (?P<numChapters>[0-9,]+)( - Reviews: (?P<reviews>[0-9,]+))? - Updated: (?P<dateUpdated>[0-9-]+)'+ \
r' - Published: (?P<datePublished>[0-9-]+)(?P<completed> - Complete)?'
details_dict = re.match(detail_re,stripHTML(details)).groupdict()
# lists
for meta in ('characters','genre'):
if details_dict[meta] != 'None':
self.story.extendList(meta,details_dict[meta].split(' & '))
# scalars
for meta in ('numWords','numChapters','rating','language','reviews'):
self.story.setMetadata(meta,details_dict[meta])
# dates
for meta in ('datePublished','dateUpdated'):
self.story.setMetadata(meta, makeDate(details_dict[meta], self.dateformat))
# status
if details_dict['completed']:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
for a in soup.select("div.genres a"):
self.story.addToList('genre',stripHTML(a))
for a in soup.select("section.characters li.Tags__item a"):
self.story.addToList('characters',stripHTML(a))
for a in soup.select('a[href*="pairings="]'):
self.story.addToList('ships',stripHTML(a).replace("+","/"))
for a in soup.select('div.Story__type a[href*="fandoms="]'):
# logger.debug(a)
fandomstr=stripHTML(a).replace(' Fanfiction','').strip()
# logger.debug("'%s'"%fandomstr)
## haven't thought of a better way to detect and *not*
## split on fandoms with a '&' in them.
for ampfandom in ampfandoms:
if ampfandom in fandomstr:
self.story.addToList('category',ampfandom)
fandomstr = fandomstr.replace(ampfandom,'')
for fandom in fandomstr.split('&'):
if fandom:
self.story.addToList('category',fandom)
## Currently no 'Original' stories on the site, but does list
## it as a search type. Set extratags: and uncomment this if
## and when.
# if self.story.getList('category'):
# self.story.addToList('category', 'FanFiction')
# else:
# self.story.addToList('category', 'Original')
for chapli in soup.select('ul.StoryContents__chapters li'):
self.add_chapter(stripHTML(chapli.select_one('span.chapter-title')),chapli.select_one('a')['href'])
if self.num_chapters() == 0:
raise exceptions.FailedToDownload("Story at %s has no chapters." % self.url)
# It's assumed that the number of chapters is correct.
# There's no complete list of chapters, so the only
# alternative is to get the num of chaps from the last
# indiated chapter list instead.
for i in range(1,1+int(self.story.getMetadata('numChapters'))):
self.chapterUrls.append(("Chapter "+unicode(i),"http://"+self.getSiteDomain()\
+"/read/"+self.story.getMetadata('storyId')+"/%s"%i))
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
data = self.get_request(url)
data = self._fetchUrl(url)
soup = self.make_soup(data)
div = soup.find('div', {'class' : 'StoryChapter__text'})
div = soup.find('div', {'class' : 'text'})
return self.utf8FromSoup(url,div)

View file

@ -1,594 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2020 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#### Hazel's fiction.live fanficfare adapter
# what an *adventure* this was. fiction.live is an angular web3.0 app that does async background stuff everywhere.
# they're not kidding about it being live.
# can I wrangle it's stories into books for offline reading? yes I 98% can!
### won't support, because they aren't part of the text
# chat, threads, chat replies on vote options
### can't support because wtf this is a book
# music / audio embeds
# per-user achivement tracking with fancy achievement-get animations
# story scripting (shows script tags visible in the text, not computed values or input fields)
import re
import json
from datetime import datetime
import itertools
import logging
logger = logging.getLogger(__name__)
# __package__ = 'fanficfare.adapters' # fixes dev issues with unknown package base
from .base_adapter import BaseSiteAdapter
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from ..six import ensure_text
def getClass():
return FictionLiveAdapter
class FictionLiveAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev','flive')
self.story_id = self.parsedUrl.path.split('/')[3]
self.story.setMetadata('storyId', self.story_id)
self.chapter_id_to_api = {}
# normalize URL. omits title in the url
self._setURL("https://fiction.live/stories//{s_id}".format(s_id = self.story_id));
@staticmethod
def getSiteDomain():
return "fiction.live"
@classmethod
def getAcceptDomains(cls):
return ["fiction.live", "beta.fiction.live"] # I still remember anonkun, but the domain has now lapsed
def getSiteURLPattern(self):
# I'd like to thank regex101.com for helping me screw this up less
return r"https?://(beta\.)?fiction\.live/[^/]*/[^/]*/([a-zA-Z0-9\-]+)(/(home)?)?$"
@classmethod
def getSiteExampleURLs(cls):
return ("https://fiction.live/stories/Example-Story-Title/17CharacterIDhere/home "
+"https://fiction.live/stories/Example-Story-With-Long-ID/-20CharacterIDisHere "
+"https://fiction.live/Sci-fi/Example-Story-With-URL-Genre/17CharacterIDhere/ "
+"https://fiction.live/stories/Example-Story-With-UUID/00000000-0000-4000-0000-000000000000/")
@classmethod
def get_section_url(cls,url):
## minimal URL used for section names in INI and reject list
## for comparison
# logger.debug("pre--url:%s"%url)
url = re.sub(r"https?://(beta\.)?fiction\.live/[^/]*/[^/]*/(?P<id>[a-zA-Z0-9\-]+)(/(home)?)?$",r'https://fiction.live/stories//\g<id>',url)
# logger.debug("post-url:%s"%url)
return url
def parse_timestamp(self, timestamp):
# fiction.live date format is unix-epoch milliseconds. not a good fit for fanficfare's makeDate.
# doesn't use a timezone object and returns tz-naive datetimes. I *think* I can leave the rest to fanficfare
return datetime.fromtimestamp(timestamp / 1000.0, None)
def img_url_trans(self,imgurl):
"Apparently site changed cdn URLs for images more than once."
# logger.debug("pre--imgurl:%s"%imgurl)
imgurl = re.sub(r'(\w+)\.cloudfront\.net',r'cdn6.fiction.live/file/fictionlive',imgurl)
imgurl = re.sub(r'www\.filepicker\.io/api/file/(\w+)',r'cdn4.fiction.live/fp/\1',imgurl)
imgurl = re.sub(r'cdn[34].fiction.live/(.+)',r'cdn6.fiction.live/file/fictionlive/\1',imgurl)
# logger.debug("post-imgurl:%s"%imgurl)
return imgurl
def doExtractChapterUrlsAndMetadata(self, get_cover=True):
metadata_url = "https://fiction.live/api/node/{s_id}/"
response = self.get_request(metadata_url.format(s_id = self.story_id))
if not response: # this is how fiction.live responds to nonsense urls -- HTTP200 with empty response
raise exceptions.StoryDoesNotExist("Empty response for " + self.url)
data = json.loads(response)
## get metadata for multi route chapters
if 'multiRoute' in data and data['multiRoute'] == True:
route_metadata_url = "https://fiction.live/api/anonkun/routes/{s_id}/"
response = self.get_request(route_metadata_url.format(s_id = self.story_id))
if not response: # this is how fiction.live responds to nonsense urls -- HTTP200 with empty response
raise exceptions.StoryDoesNotExist("Empty response for " + self.url)
data["route_metadata"] = json.loads(response)
self.extract_metadata(data, get_cover)
self.add_chapters(data)
def extract_metadata(self, data, get_cover):
# on one hand, we've got nicely-formatted JSON and can just index into the thing we want, no parsing needed.
# on the other, nearly *everything* in this api is optional. found that out the hard way.
# not optional
self.story.setMetadata('title', stripHTML(data['t']))
# stories have ut, rt, ct, and cht. fairly sure that ut = update time and rt = release time.
# ct is 'creation time' and everything in the api has it -- you can create stories and edit before publishing
# cht is *chunktime* -- newest story chunk added.
# ut for update time includes other kinds of update -- threads, chat etc
# ct <= rt <= cht <= ut
self.story.setMetadata("dateUpdated", self.parse_timestamp(data['cht']))
self.story.setMetadata("datePublished", self.parse_timestamp(data['rt']))
self.most_recent_chunk = data['cht'] if 'cht' in data else 9999999999999998
# nearly everything optional from here out
if 'storyStatus' in data:
status_translate = {'active': "In-Progress", 'finished': "Completed"} # fiction.live to fanficfare
status = data['storyStatus']
self.story.setMetadata('status', status_translate.get(status, status.title()))
elif 'complete' in data:
if data['complete'] == True:
self.story.setMetadata('status', "Completed")
else:
self.story.setMetadata('status', "In-Progress")
else:
self.story.setMetadata('status', "In-Progress")
if 'contentRating' in data:
self.story.setMetadata('rating', data['contentRating'])
elif 'tAge' in data:
self.story.setMetadata('rating', data['tAge'])
else:
self.story.setMetadata('rating', "teen")
if 'w' in data: self.story.setMetadata('numWords', data['w'])
if 'likeCount' in data: self.story.setMetadata('likes', data['likeCount'])
if 'rInput' in data: self.story.setMetadata('reader_input', data['rInput'].title())
summary = stripHTML(data['d']) if 'd' in data else ""
firstblock = data['b'].strip() if 'b' in data else ""
self.setDescription(self.url, summary if not firstblock else summary + "\n<br />\n" + firstblock)
tags = data['ta'] if 'ta' in data else []
if (self.story.getMetadataRaw('rating') in {"nsfw", "adult"} or 'smut' in tags) and \
not (self.is_adult or self.getConfig("is_adult")):
raise exceptions.AdultCheckRequired(self.url)
show_spoiler_tags = self.getConfig('show_spoiler_tags')
spoiler_tags = data['spoilerTags'] if 'spoilerTags' in data else []
for tag in tags:
if show_spoiler_tags or not tag in spoiler_tags:
self.story.addToList('tags', tag)
authors = data['u'] # non-optional
if len(authors) > 1:
for author in data['u']:
if '_id' in author and author['n']: # some stories have spurious co-authors (may have been fixed?)
self.story.addToList('author', author['n'])
self.story.addToList('authorUrl', "https://fiction.live/user/" + author['n'] + "/")
self.story.addToList('authorId', author['_id'])
else: # TODO: can avoid this?
author = authors[0]
self.story.setMetadata('author', author['n'])
self.story.setMetadata('authorUrl', "https://fiction.live/user/" + author['n'] + "/")
self.story.setMetadata('authorId', author['_id'])
if 'isLive' in data and data['isLive']:
self.story.setMetadata('live', "Now! (at time of download)")
elif 'nextLive' in data and data['nextLive']:
# formatted to match site, not other fanficfare timestamps
next_live_time = self.parse_timestamp(data['nextLive'])
self.story.setMetadata('live', next_live_time)
show_nsfw_cover_images = self.getConfig('show_nsfw_cover_images')
nsfw_cover = data['nsfwCover'] if 'nsfwCover' in data else False
if get_cover and 'i' in data:
if show_nsfw_cover_images or not nsfw_cover:
coverUrl = data['i'][0]
self.setCoverImage(self.url, coverUrl)
# gonna need these later for adding details to achievement-granting links in the text
try:
self.achievements = data['achievements']['achievements']
except KeyError:
self.achievements = []
def add_chapters(self, data):
## chapter urls are for the api. they return json and aren't user-navigatable, or the same as on the website
chunkrange_url = "https://fiction.live/api/anonkun/chapters/{s_id}/{start}/{end}/"
## api url to get content of a multi route chapter. requires only the route id and no timestamps
route_chunkrange_url = "https://fiction.live/api/anonkun/route/{c_id}/chapters"
def add_chapter_url(title, bounds):
"Adds a chapter url based on the start/end chunk-range timestamps."
start, end = bounds
end -= 1
chapter_url = chunkrange_url.format(s_id = data['_id'], start = start, end = end)
self.add_chapter(title, chapter_url)
def add_route_chapter_url(title, route_id):
"Adds a route chapter url based on the route id."
chapter_url = route_chunkrange_url.format(c_id = route_id)
self.add_chapter(title, chapter_url)
def pair(iterable):
"[1,2,3,4] -> [(1, 2), (2, 3), (3, 4)]"
a, b = itertools.tee(iterable, 2)
next(b, None)
return list(zip(a, b))
def map_chap_ids_to_api(chapter_ids, route_ids, times):
for index, bounds in enumerate(times):
start, end = bounds
end -= 1
chapter_url = chunkrange_url.format(s_id = data['_id'], start = start, end = end)
self.chapter_id_to_api[chapter_ids[index]] = chapter_url
for route_id in route_ids:
chapter_url = route_chunkrange_url.format(c_id = route_id)
self.chapter_id_to_api[route_id] = chapter_url
## first thing to do is seperate out the appendices
appendices, maintext, routes = [], [], []
chapters = data['bm'] if 'bm' in data else []
## not all stories use multiple routes. Those that do have a route id and a title for each route
if 'route_metadata' in data and data['route_metadata']:
for r in data['route_metadata']:
# checking if route title even exists or is None, since most things in the api are optional
if 't' in r and r['t'] is not None:
title = r['t']
else:
title = ""
routes.append({"id": r['_id'], "title": title})
for c in chapters:
appendices.append(c) if c['title'].startswith('#special') else maintext.append(c)
## main-text chapter extraction processing. *should* now handle all the edge cases.
## relies on fanficfare ignoring empty chapters!
titles = ["Home"] + [c['title'] for c in maintext]
chapter_ids = ['home'] + [c['id'] for c in maintext]
times = [data['ct']] + [c['ct'] for c in maintext] + [self.most_recent_chunk + 2] # need to be 1 over, and add_url etc does -1
times = pair(times)
if self.getConfig('include_appendices', True): # Add appendices after main text if desired
titles = titles + ["Appendix: " + a['title'][9:] for a in appendices]
chapter_ids = chapter_ids + [a['id'] for a in appendices]
times = times + [(a['ct'], a['ct'] + 2) for a in appendices]
route_ids = [r['id'] for r in routes]
map_chap_ids_to_api(chapter_ids, route_ids, times) # Map chapter ids to API URLs for use when comparing the two
# doesn't actually run without the call to list.
list(map(add_chapter_url, titles, times))
for r in routes: # add route at the end, after appendices
route_id = r['id'] # to get route chapter content, the route id is needed, not the timestamp
chapter_title = "Route: " + r['title'] # 'Route: ' at beginning of name, since it's a multiroute chapter
add_route_chapter_url(chapter_title, route_id)
def getChapterText(self, url):
chunk_handler = {
"choice" : self.format_choice,
"readerPost" : self.format_readerposts,
"chapter" : self.format_chapter
}
response = self.get_request(url)
data = json.loads(response)
if data == []:
return ""
# and *now* we can assume there's at least one chunk in the data -- chapters can be totally empty.
# are we trying to read an appendix? check the first chunk to find out.
getting_appendix = len(data) == 1 and 't' in data[0] and data[0]['t'].startswith("#special")
text = ""
for count, chunk in enumerate(data):
# logger.debug(count) # pollutes the debug log, shows which chunk crashed the handler
text += "<div>" # chapter chunks aren't always well-delimited in their contents
# appendix chunks are mixed in with other things
if not getting_appendix and 't' in chunk and chunk['t'].startswith("#special"): # t = title = bookmark
continue
handler = chunk_handler.get(chunk['nt'], self.format_unknown) # nt = node type
text += handler(chunk)
show_timestamps = self.getConfig('show_timestamps')
if show_timestamps and 'ct' in chunk:
#logger.debug("Adding timestamp for chunk...")
timestamp = ensure_text(self.parse_timestamp(chunk['ct']).strftime("%x -- %X"))
text += '<div class="ut">' + timestamp + '</div>'
text += "</div><br />\n"
## soup to repair the most egregious HTML errors.
return self.utf8FromSoup(url,self.make_soup(text))
### everything from here out is chunk data handling.
def format_chapter(self, chunk):
"""Handles any formatting in the chapter body text for text chapters.
In the 'default case' where we're getting boring chapter-chunk body text, just calls utf8fromSoup
and returns the text as is on the website."""
soup = self.make_soup(chunk['b'] if 'b' in chunk else "")
if self.getConfig('legend_spoilers',True):
soup = self.add_spoiler_legends(soup)
if self.achievements:
soup = self.append_achievments(soup)
return str(soup)
def add_spoiler_legends(self, soup):
# find spoiler links and change link-anchor block to legend block
spoilers = soup.find_all('a', class_="tydai-spoiler")
for link_tag in spoilers:
link_tag.name = 'fieldset'
legend = soup.new_tag('legend')
legend.string = "Spoiler"
link_tag.insert(0, legend)
return soup
def fictionlive_normalize(self, string):
# might be able to use this to preserve titles in normalized urls, if the scheme is the same
# BUG: in achivement ids these are all replaced, but I *don't* know that the list is complete.
# should be rare, thankfully. *most* authors don't use any funny characters in the achievment's *ID*
special_chars = "\"\\,.!?+=/[](){}<>_'@#$%^&*~`;:|" # not the hyphen, which is used to represent spaces
return string.lower().replace(" ", "-").translate({ord(x) : None for x in special_chars})
def append_achievments(self, soup):
# achivements are present in the text as a kind of link, and you get the shiny popup by clicking them.
achievement_links = soup.find_all('a', class_="tydai-achievement")
achieved_ids = []
for link_tag in achievement_links:
# these are not only prepended by a unicode lightning-bolt, but also format clearly as a link
# should use .u css selector -- part of output_css defaults? or just let replace_tags_with_spans do it?
new_u = soup.new_tag('u')
new_u.string = link_tag.text # copy out the link text into a new element
# html entities for improved compatability with AZW3 conversion
link_tag.string = "&#x26A1;" # then overwrite
link_tag.insert(1, new_u)
## while we've got the achievment links, get the ids from the link
a_id = link_tag['data-id']
a_id = self.fictionlive_normalize(a_id)
achieved_ids.append(a_id)
if achieved_ids:
logger.debug("achievements (this chunk): " + ", ".join(achieved_ids))
# can't replicate the animated shiny announcement popup, so have an end-of-chunk announcement instead
# TODO: achievement images -- does anyone use them?
a_source = "<br />\n<fieldset><legend>&#x26A1; Achievement obtained!</legend>\n<h4>{}</h4>\n{}</fieldset>\n"
for a_id in achieved_ids:
if a_id in self.achievements:
a_title = self.achievements[a_id]['t'] if 't' in self.achievements[a_id] else a_id.title()
a_text = self.achievements[a_id]['d'] if 'd' in self.achievements[a_id] else ""
soup.append(self.make_soup(a_source.format(a_title, a_text)))
else:
a_title = a_id.title()
error = "<br />\n<fieldset><legend>Error: Achievement not found.</legend>Couldn't find '{}'. Ask the story author to check if the achievment exists."
soup.append(self.make_soup(error.format(a_title)))
return soup
def count_votes(self, chunk):
"""So, fiction.live's api doesn't return the counted votes you see on the website.
After all, it needs to allow for things like revoking a vote,
with the count live and updated in realtime on your client.
So instead we get the raw vote-data, but have to count it ourselves."""
# optional.
choices = chunk['choices'] if 'choices' in chunk else []
def counter(votes):
output = [0] * len(choices)
for vote in votes.values():
## votes are either a single option-index or a list of option-indicies, depending on the choice type
if 'multiple' in chunk and chunk['multiple'] == False:
vote = [vote] # normalize to list
for v in vote:
# v should only be int, but there is at least one story where some unrelated string was returned,
# so let's just ignore non-int values here
if not isinstance(v, int):
continue
if 0 <= v < len(choices):
output[v] += 1
return output
# I believe that verified is always a subset of all votes, but that's not enforced here
total_votes = counter(chunk['votes'] if 'votes' in chunk else {})
verified_votes = counter(chunk['userVotes'] if 'userVotes' in chunk else {})
# Choices can link to route chapters, where the index of the choice in list 'choices' is a key in the
# 'routes' dict and the dict value is the route id.
# That route id is needed for the url to create the internal link from the choice to the route chapter.
routes = chunk['routes'] if 'routes' in chunk else {}
if choices and len(routes) > 0:
altered_choices = []
for i, choice in enumerate(choices):
choice_index = str(i)
if choice_index in routes.keys():
route_chunkrange_url = "https://fiction.live/api/anonkun/route/{c_id}/chapters"
route_url = route_chunkrange_url.format(c_id=routes[choice_index])
choice_link = "<a data-orighref='" + route_url + "' >" + choice + "</a>"
altered_choices.append(choice_link)
else:
altered_choices.append(choice)
choices = altered_choices
return zip(choices, verified_votes, total_votes)
def format_choice(self, chunk):
options = self.count_votes(chunk)
# crossed-out writeins. authors can censor user-written choices, and (optionally) offer a reason.
x_outs = [int(x) for x in chunk['xOut']] if 'xOut' in chunk else []
x_reasons = chunk['xOutReasons'] if 'xOutReasons' in chunk else {}
closed = "closed" if 'closed' in chunk else "open" # BUG: check on reopened votes
num_voters = len(chunk['votes']) if 'votes' in chunk else 0
vote_title = chunk['b'] if 'b' in chunk else "Choices"
output = ""
# start with the header
output += u"<h4><span>" + vote_title + " — <small>Voting " + closed
output += u"" + str(num_voters) + " voters</small></span></h4>\n"
# we've got everything needed to build the html for our vote table.
output += "<table class=\"voteblock\">\n"
# filter out the crossed-out options, which display last
crossed = []
for index, (choice_text, verified_votes, total_votes) in enumerate(options):
if index in x_outs:
crossed.append((index, choice_text, verified_votes, total_votes))
else:
output += "<tr class=\"choiceitem\"><td>" + str(choice_text) + "</td><td class=\"votecount\">"
if verified_votes > 0:
output += "" + str(verified_votes) + "/"
output += str(total_votes)+ " </td></tr>\n"
# crossed out options are: displayed last, struckthrough, smaller, with the reason below, and no vote count.
# also greyed out, but that's a bit much.
for index, choice_text, _, _ in crossed:
if choice_text == "permanentlyRemoved":
continue
else:
x_reason = x_reasons[str(index)] if str(index) in x_reasons else ""
output += "<tr class=\"choiceitem\"><td colspan=\"2\"><small><strike>" \
+ str(choice_text) + "</strike><br>" + str(x_reason) + "</small></td></tr>"
output += "</table>\n"
return output
def format_readerposts(self, chunk):
closed = "Closed" if 'closed' in chunk else "Open"
posts = chunk['votes'] if 'votes' in chunk else {}
dice = chunk['dice'] if 'dice' in chunk else {}
# now matches the site and does *not* include dicerolls as posts!
num_votes = str(len(posts)) + " posts" if len(posts) != 0 else "be the first to post."
posts_title = chunk['b'] if 'b' in chunk else "Reader Posts"
output = ""
output += u"<h4><span>" + posts_title + " — <small> Posting " + closed
output += u"" + num_votes + "</small></span></h4>\n"
## so. a voter can roll with their post. these rolls are in a seperate dict, but have the **same uid**.
## they're then formatted with the roll above the writein for that user.
## I *think* that formatting roll-only before writein-only posts is correct, but tbh, it's hard to tell.
## writeins are usually opened by the author for posts or rolls, not both at once.
## people tend to only mix the two by accident.
if dice != {}:
for uid, roll in dice.items():
output += '<div class="choiceitem">'
if roll: # optional. just because there's a list entry for it doesn't mean it has a value!
output += '<div class="dice">' + str(roll) + '</div>\n'
if uid in posts:
post = posts[uid]
if post:
output += str(post)
del posts[uid] # it's handled here with the roll instead of later
output += '</div>'
for post in posts.values():
if post:
output += '<div class="choiceitem">' + str(post) + '</div>\n'
return output
def normalize_chapterurl(self, url):
if url.startswith(r'https://fiction.live/api/anonkun/chapters'):
return url
pattern = None
if url.startswith(r'https://fiction.live/api/anonkun/route'):
pattern = r"https?://(?:beta\.)?fiction\.live/[^/]*/[^/]*/[a-zA-Z0-9]+/routes/([a-zA-Z0-9]+)"
elif url.startswith(r'https://fiction.live/'):
pattern = r"https?://(?:beta\.)?fiction\.live/[^/]*/[^/]*/[a-zA-Z0-9]+/[^/]*(/[a-zA-Z0-9]+|home)"
# regex101 rocks
if not pattern:
return url
match = re.match(pattern, url)
if not match:
return url
chapter_id = match.group(1)
if chapter_id.startswith('/'):
chapter_id = chapter_id[1:]
if chapter_id and chapter_id in self.chapter_id_to_api:
return self.chapter_id_to_api[chapter_id]
return url
def format_unknown(self, chunk):
raise NotImplementedError("Unknown chunk type ({}) in fiction.live story.".format(chunk))
# in future, I'd like to handle audio embeds somehow. but they're not availble to add to stories right now.
# pretty sure they'll just format as a link (with a special tydai-audio class) and should be easier than achievements
# TODO: verify that show_timestamps is working, check times!
# TODO: find a story that uses achievement images and implement them?
### known bugs:
# TODO: support chapter urls for single-chapter / chapter-range downloads
# complicated -- urls for getChapterText are API urls generated by add_chapters, not the public/website ones
# in particular, may need more API reversing to figure out how to get the *end* of the chunk range
# find in 'bm' in the metadata?

View file

@ -1,12 +1,8 @@
from __future__ import absolute_import
import re
import logging
logger = logging.getLogger(__name__)
# py2 vs py3 transition
from ..six import text_type as unicode
from ..six.moves.urllib import parse as urlparse
import urllib2
import urlparse
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
@ -23,7 +19,7 @@ class FictionManiaTVAdapter(BaseSiteAdapter):
SITE_ABBREVIATION = 'fmt'
SITE_DOMAIN = 'fictionmania.tv'
BASE_URL = 'https://' + SITE_DOMAIN + '/stories/'
BASE_URL = 'http://' + SITE_DOMAIN + '/stories/'
READ_TEXT_STORY_URL_TEMPLATE = BASE_URL + 'readtextstory.html?storyID=%s'
DETAILS_URL_TEMPLATE = BASE_URL + 'details.html?storyID=%s'
@ -40,6 +36,23 @@ class FictionManiaTVAdapter(BaseSiteAdapter):
self._setURL(self.READ_TEXT_STORY_URL_TEMPLATE % story_id)
self.story.setMetadata('siteabbrev', self.SITE_ABBREVIATION)
# Always single chapters, probably should use the Anthology feature to
# merge chapters of a story
self.story.setMetadata('numChapters', 1)
def _customized_fetch_url(self, url, exception=None, parameters=None):
if exception:
try:
data = self._fetchUrl(url, parameters)
except urllib2.HTTPError:
raise exception(self.url)
# Just let self._fetchUrl throw the exception, don't catch and
# customize it.
else:
data = self._fetchUrl(url, parameters)
return self.make_soup(data)
@staticmethod
def getSiteDomain():
return FictionManiaTVAdapter.SITE_DOMAIN
@ -49,11 +62,11 @@ class FictionManiaTVAdapter(BaseSiteAdapter):
return cls.READ_TEXT_STORY_URL_TEMPLATE % 1234
def getSiteURLPattern(self):
return r'https?' + re.escape(self.BASE_URL[len('https'):]) + r'(readtextstory|readhtmlstory|readxstory|details)\.html\?storyID=\d+$'
return 'https?' + re.escape(self.BASE_URL[len('http'):]) + '(readtextstory|readxstory|details)\.html\?storyID=\d+$'
def extractChapterUrlsAndMetadata(self):
url = self.DETAILS_URL_TEMPLATE % self.story.getMetadata('storyId')
soup = self.make_soup(self.get_request(url))
soup = self._customized_fetch_url(url)
keep_summary_html = self.getConfig('keep_summary_html')
for row in soup.find('table')('tr'):
@ -66,7 +79,7 @@ class FictionManiaTVAdapter(BaseSiteAdapter):
if key == 'Title':
self.story.setMetadata('title', value)
self.add_chapter(value, self.url)
self.chapterUrls.append((value, self.url))
elif key == 'File Name':
self.story.setMetadata('fileName', value)
@ -94,9 +107,7 @@ class FictionManiaTVAdapter(BaseSiteAdapter):
elif key == 'New Name':
self.story.setMetadata('newName', value)
## I've encountered a few storyies that have None as the
## value for Other Names [GComyn]
elif key == 'Other Names' and value != None:
elif key == 'Other Names':
for name in value.split(', '):
self.story.addToList('characters', name)
@ -106,7 +117,7 @@ class FictionManiaTVAdapter(BaseSiteAdapter):
self.story.setMetadata('rating', value)
elif key == 'Complete':
self.story.setMetadata('status', 'Completed' if value == 'yes' else 'In-Progress')
self.story.setMetadata('status', 'Completed' if value == 'Complete' else 'In-Progress')
elif key == 'Categories':
for element in cells[1]('a'):
@ -136,78 +147,20 @@ class FictionManiaTVAdapter(BaseSiteAdapter):
self.story.setMetadata('readings', value)
def getChapterText(self, url):
if self.getConfig("download_text_version",False):
soup = self.make_soup(self.get_request(url))
element = soup.find('pre')
element.name = 'div'
soup = self._customized_fetch_url(url)
element = soup.find('pre')
element.name = 'div'
# The story's content is contained in a <pre> tag, probably taken 1:1
# from the source text file. A simple replacement of all newline
# characters with a break line tag should take care of formatting.
# The story's content is contained in a <pre> tag, probably taken 1:1
# from the source text file. A simple replacement of all newline
# characters with a break line tag should take care of formatting.
# While wrapping in paragraphs would be possible, it's too much work,
# I'd rather display the story 1:1 like it was found in the pre tag.
content = unicode(element)
content = content.replace('\n', '<br/>')
# While wrapping in paragraphs would be possible, it's too much work,
# I'd rather display the story 1:1 like it was found in the pre tag.
content = unicode(element)
content = content.replace('\n', '<br/>')
if self.getConfig('non_breaking_spaces'):
return content.replace(' ', '&nbsp;')
if self.getConfig('non_breaking_spaces'):
return content.replace(' ', '&nbsp;')
## Normally, getChapterText should use self.utf8FromSoup(),
## but this is converting from plain(ish) text. -- JM
return content
else:
# try SWI (story with images) version first
# <div style="margin-left:10ex;margin-right:10ex">
## fetching SWI version now instead of text.
htmlurl = url.replace('readtextstory','readhtmlstory')
## Used to find by style, but it's inconsistent now. we've seen:
## margin-left:10ex;margin-right:10ex
## margin-right: 5%; margin-left: 5%
## margin-left:5%; margin-right:5%
## margin-left:5%; margin-right:5%; background: white
## And there's some without a <div> tag (or an unclosed div)
## Only the comments appear to be consistent.
beginmarker='<!--Read or display the file-->'
endmarker='''<hr size=1 noshade>
<!--review add read, top and bottom-->
'''
data = self.get_request(htmlurl)
try:
## if both markers are found, assume whatever is in between
## is the chapter text.
soup = self.make_soup(data[data.index(beginmarker):data.index(endmarker)])
return self.utf8FromSoup(htmlurl,soup)
except Exception as e:
# logger.debug(e)
# logger.debug(soup)
logger.debug("Story With Images(SWI) not found, falling back to HTML.")
## fetching html version now instead of text.
## Note that html and SWI pages are *not* formatted the same.
soup = self.make_soup(self.get_request(url.replace('readtextstory','readxstory')))
# logger.debug(soup)
# remove first hr and everything before
remove = soup.find('hr')
# logger.debug(remove)
for tag in remove.find_previous_siblings():
tag.extract()
remove.extract()
# remove trailing hr, parent tags and everything after.
remove = soup.find('hr',size='1') # <center><hr size=1>
if remove.parent.name == 'center':
## can also be directly in body without <center>
remove = remove.parent
# logger.debug(remove)
for tag in remove.find_next_siblings():
tag.extract()
remove.extract()
content = soup.find('body')
content.name='div'
return self.utf8FromSoup(url,content)
return content

View file

@ -0,0 +1,194 @@
# -*- coding: utf-8 -*-
# Copyright 2013 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
import time
import json
#from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
from base_adapter import BaseSiteAdapter, makeDate
class FictionPadSiteAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev','fpad')
self.dateformat = "%Y-%m-%dT%H:%M:%SZ"
self.is_adult=False
self.username = None
self.password = None
# get storyId from url--url validation guarantees query correct
m = re.match(self.getSiteURLPattern(),url)
if m:
self.story.setMetadata('storyId',m.group('id'))
# normalized story URL.
self._setURL("https://"+self.getSiteDomain()
+"/author/"+m.group('author')
+"/stories/"+self.story.getMetadata('storyId'))
else:
raise exceptions.InvalidStoryURL(url,
self.getSiteDomain(),
self.getSiteExampleURLs())
@staticmethod
def getSiteDomain():
return 'fictionpad.com'
@classmethod
def getSiteExampleURLs(cls):
return "https://fictionpad.com/author/Author/stories/1234/Some-Title"
def getSiteURLPattern(self):
# http://fictionpad.com/author/Serdd/stories/4275
return r"http(s)?://(www\.)?fictionpad\.com/author/(?P<author>[^/]+)/stories/(?P<id>\d+)"
# <form method="post" action="/signin">
# <input name="authenticity_token" type="hidden" value="u+cfdXh46dRnwVnSlmE2B2BFmHgu760paqgBG6KQeos=" />
# <input type="hidden" name="remember" value="1">
# <strong class="help-start text-center">or with FictionPad</strong>
# <label class="control-label hidden-placeholder">Pseudonym or Email Address</label>
# <input name="login" class="input-block-level" type="text" placeholder="Pseudonym or Email Address" maxlength="50" required autofocus>
# <label class="control-label hidden-placeholder">Password</label>
# <input name="password" class="input-block-level" type="password" placeholder="Password" minlength="6" required>
# <button type="submit" class="btn btn-primary btn-block">Sign In</button>
# <p class="help-end">
# <a href="/passwordreset">Forgot your password?</a>
# </p>
# </form>
def performLogin(self):
params = {}
if self.password:
params['login'] = self.username
params['password'] = self.password
else:
params['login'] = self.getConfig("username")
params['password'] = self.getConfig("password")
params['remember'] = '1'
loginUrl = 'http://' + self.getSiteDomain() + '/signin'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['login']))
## need to pull empty login page first to get authenticity_token
soup = self.make_soup(self._fetchUrl(loginUrl))
params['authenticity_token']=soup.find('input', {'name':'authenticity_token'})['value']
data = self._postUrl(loginUrl, params)
if "Invalid email/pseudonym and password combination." in data:
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['login']))
raise exceptions.FailedToLogin(loginUrl,params['login'])
def extractChapterUrlsAndMetadata(self):
# fetch the chapter. From that we will get almost all the
# metadata and chapter list
url=self.url
logger.debug("URL: "+url)
try:
data = self._fetchUrl(url)
if "This is a mature story. Please sign in to read it." in data:
self.performLogin()
data = self._fetchUrl(url)
find = "wordyarn.config.page = "
data = data[data.index(find)+len(find):]
data = data[:data.index("</script>")]
data = data[:data.rindex(";")]
data = data.replace('tables:','"tables":')
tables = json.loads(data)['tables']
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(url)
else:
raise e
# looks like only one author per story allowed.
author = tables['users'][0]
story = tables['stories'][0]
story_ver = tables['story_versions'][0]
logger.debug("story:%s"%story)
self.story.setMetadata('authorId',author['id'])
self.story.setMetadata('author',author['display_name'])
self.story.setMetadata('authorUrl','https://'+self.host+'/author/'+author['display_name']+'/stories')
self.story.setMetadata('title',story_ver['title'])
self.setDescription(url,story_ver['description'])
if not ('assets/story_versions/covers' in story_ver['profile_image_url@2x']):
self.setCoverImage(url,story_ver['profile_image_url@2x'])
self.story.setMetadata('datePublished',makeDate(story['published_at'], self.dateformat))
self.story.setMetadata('dateUpdated',makeDate(story['published_at'], self.dateformat))
self.story.setMetadata('followers',story['followers_count'])
self.story.setMetadata('comments',story['comments_count'])
self.story.setMetadata('views',story['views_count'])
self.story.setMetadata('likes',int(story['likes'])) # no idea why they floated these.
if 'dislikes' in story:
self.story.setMetadata('dislikes',int(story['dislikes']))
if story_ver['is_complete']:
self.story.setMetadata('status', 'Completed')
else:
self.story.setMetadata('status', 'In-Progress')
self.story.setMetadata('rating', story_ver['maturity_level'])
self.story.setMetadata('numWords', unicode(story_ver['word_count']))
for i in tables['fandoms']:
self.story.addToList('category',i['name'])
for i in tables['genres']:
self.story.addToList('genre',i['name'])
for i in tables['characters']:
self.story.addToList('characters',i['name'])
for c in tables['chapters']:
chtitle = "Chapter %d"%c['number']
if c['title']:
chtitle += " - %s"%c['title']
self.chapterUrls.append((chtitle,c['body_url']))
self.story.setMetadata('numChapters',len(self.chapterUrls))
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
if not url:
data = u"<em>This chapter has no text.</em>"
else:
data = self._fetchUrl(url)
soup = self.make_soup(u"<div id='story'>"+data+u"</div>")
return self.utf8FromSoup(url,soup)
def getClass():
return FictionPadSiteAdapter

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,15 +15,15 @@
# limitations under the License.
#
from __future__ import absolute_import
import time
import logging
logger = logging.getLogger(__name__)
import re
# py2 vs py3 transition
import urllib2
import time
## They're from the same people and pretty much identical.
from .adapter_fanfictionnet import FanFictionNetSiteAdapter
from adapter_fanfictionnet import FanFictionNetSiteAdapter
class FictionPressComSiteAdapter(FanFictionNetSiteAdapter):
@ -43,15 +43,8 @@ class FictionPressComSiteAdapter(FanFictionNetSiteAdapter):
def getSiteExampleURLs(cls):
return "https://www.fictionpress.com/s/1234/1/ https://www.fictionpress.com/s/1234/12/ http://www.fictionpress.com/s/1234/1/Story_Title http://m.fictionpress.com/s/1234/1/"
@classmethod
def _get_site_url_pattern(cls):
return r"https?://(www|m)?\.fictionpress\.com/s/(?P<id>\d+)(/\d+)?(/(?P<title>[^/]+))?/?$"
## normalized chapter URLs DO contain the story title now, but
## normalized to current urltitle in case of title changes.
def normalize_chapterurl(self,url):
return re.sub(r"https?://(www|m)\.(?P<keep>fictionpress\.com/s/\d+/\d+/).*",
r"https://www.\g<keep>",url)+self.urltitle
def getSiteURLPattern(self):
return r"https?://(www|m)?\.fictionpress\.com/s/\d+(/\d+)?(/|/[a-zA-Z0-9_-]+)?/?$"
def getClass():
return FictionPressComSiteAdapter

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,18 +15,18 @@
# limitations under the License.
#
from __future__ import absolute_import
import time
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
import time
import httplib, urllib
from .. import exceptions as exceptions
from ..htmlcleanup import stripHTML
# py2 vs py3 transition
from ..six import text_type as unicode
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
class FicwadComSiteAdapter(BaseSiteAdapter):
@ -46,10 +46,10 @@ class FicwadComSiteAdapter(BaseSiteAdapter):
@classmethod
def getSiteExampleURLs(cls):
return "https://ficwad.com/story/1234"
return "http://ficwad.com/story/1234"
def getSiteURLPattern(self):
return r"https?:"+re.escape(r"//"+self.getSiteDomain())+r"/story/\d+?$"
return re.escape(r"http://"+self.getSiteDomain())+"/story/\d+?$"
def performLogin(self,url):
params = {}
@ -61,13 +61,12 @@ class FicwadComSiteAdapter(BaseSiteAdapter):
params['username'] = self.getConfig("username")
params['password'] = self.getConfig("password")
loginUrl = 'https://' + self.getSiteDomain() + '/account/login'
loginUrl = 'http://' + self.getSiteDomain() + '/account/login'
logger.debug("Will now login to URL (%s) as (%s)" % (loginUrl,
params['username']))
d = self.post_request(loginUrl,params,usecache=False)
d = self._postUrl(loginUrl,params,usecache=False)
if "Login attempt failed..." in d or \
'<div id="error">Please enter your username and password.</div>' in d:
if "Login attempt failed..." in d:
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['username']))
raise exceptions.FailedToLogin(url,params['username'])
@ -75,6 +74,13 @@ class FicwadComSiteAdapter(BaseSiteAdapter):
else:
return True
def use_pagecache(self):
'''
adapters that will work with the page cache need to implement
this and change it to True.
'''
return True
def extractChapterUrlsAndMetadata(self):
# fetch the chapter. From that we will get almost all the
@ -83,45 +89,55 @@ class FicwadComSiteAdapter(BaseSiteAdapter):
url = self.url
logger.debug("URL: "+url)
data = self.get_request(url)
# non-existent/removed story urls get thrown to the front page.
if "<h4>Featured Story</h4>" in data:
raise exceptions.StoryDoesNotExist(self.url)
soup = self.make_soup(data)
# use BeautifulSoup HTML parser to make everything easier to find.
try:
data = self._fetchUrl(url)
# non-existent/removed story urls get thrown to the front page.
if "<h4>Featured Story</h4>" in data:
raise exceptions.StoryDoesNotExist(self.url)
soup = self.make_soup(data)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
# if blocked, attempt login.
if soup.find("div",{"class":"blocked"}) or soup.find("li",{"class":"blocked"}):
if self.performLogin(url): # performLogin raises
# FailedToLogin if it fails.
soup = self.make_soup(self.get_request(url,usecache=False))
soup = self.make_soup(self._fetchUrl(url,usecache=False))
divstory = soup.find('div',id='story')
storya = divstory.find('a',href=re.compile(r"^/story/\d+$"))
storya = divstory.find('a',href=re.compile("^/story/\d+$"))
if storya : # if there's a story link in the divstory header, this is a chapter page.
# normalize story URL on chapter list.
self.story.setMetadata('storyId',storya['href'].split('/',)[2])
url = "https://"+self.getSiteDomain()+storya['href']
url = "http://"+self.getSiteDomain()+storya['href']
logger.debug("Normalizing to URL: "+url)
self._setURL(url)
soup = self.make_soup(self.get_request(url))
try:
soup = self.make_soup(self._fetchUrl(url))
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
# if blocked, attempt login.
if soup.find("div",{"class":"blocked"}) or soup.find("li",{"class":"blocked"}):
if self.performLogin(url): # performLogin raises
# FailedToLogin if it fails.
soup = self.make_soup(self.get_request(url,usecache=False))
soup = self.make_soup(self._fetchUrl(url,usecache=False))
# title - first h4 tag will be title.
titleh4 = soup.find('div',{'class':'storylist'}).find('h4')
self.story.setMetadata('title', stripHTML(titleh4.a))
if 'Deleted story' in self.story.getMetadataRaw('title'):
raise exceptions.StoryDoesNotExist("This story was deleted. %s"%self.url)
# Find authorid and URL from... author url.
a = soup.find('span',{'class':'author'}).find('a', href=re.compile(r"^/a/"))
self.story.setMetadata('authorId',a['href'].split('/')[2])
self.story.setMetadata('authorUrl','https://'+self.host+a['href'])
self.story.setMetadata('authorUrl','http://'+self.host+a['href'])
self.story.setMetadata('author',a.string)
# description
@ -130,14 +146,14 @@ class FicwadComSiteAdapter(BaseSiteAdapter):
#self.story.setMetadata('description', storydiv.find("blockquote",{'class':'summary'}).p.string)
# most of the meta data is here:
metap = storydiv.find("div",{"class":"meta"})
metap = storydiv.find("p",{"class":"meta"})
self.story.addToList('category',metap.find("a",href=re.compile(r"^/category/\d+")).string)
# warnings
# <span class="req"><a href="/help/38" title="Medium Spoilers">[!!] </a> <a href="/help/38" title="Rape/Sexual Violence">[R] </a> <a href="/help/38" title="Violence">[V] </a> <a href="/help/38" title="Child/Underage Sex">[Y] </a></span>
spanreq = metap.find("span",{"class":"story-warnings"})
if spanreq: # can be no warnings.
for a in spanreq.find_all("a"):
for a in spanreq.findAll("a"):
self.story.addToList('warnings',a['title'])
## perhaps not the most efficient way to parse this, using
@ -149,9 +165,7 @@ class FicwadComSiteAdapter(BaseSiteAdapter):
if m:
self.story.setMetadata('rating', m.group(1))
## Genre appears even if list is empty. But there are a
## limited number of genres allowed by the site.
m = re.match(r".*?Genres: ((?:(?:Angst|Crossover|Drama|Erotica|Fantasy|Horror|Humor|Parody|Romance|Sci-fi)(?:,)?)+) -.*?",metastr)
m = re.match(r".*?Genres: (.+?) -.*?",metastr)
if m:
for g in m.group(1).split(','):
self.story.addToList('genre',g)
@ -185,24 +199,27 @@ class FicwadComSiteAdapter(BaseSiteAdapter):
storylistul = soup.find('ul',{'class':'storylist'})
if not storylistul:
# no list found, so it's a one-chapter story.
self.add_chapter(self.story.getMetadata('title'),url)
self.chapterUrls.append((self.story.getMetadata('title'),url))
else:
chapterlistlis = storylistul.find_all('li')
chapterlistlis = storylistul.findAll('li')
for chapterli in chapterlistlis:
if "blocked" in chapterli['class']:
# paranoia check. We should already be logged in by now.
raise exceptions.FailedToLogin(url,self.username)
else:
#print "chapterli.h4.a (%s)"%chapterli.h4.a
self.add_chapter(chapterli.h4.a.string,
u'https://%s%s'%(self.getSiteDomain(),
chapterli.h4.a['href']))
self.chapterUrls.append((chapterli.h4.a.string,
u'http://%s%s'%(self.getSiteDomain(),
chapterli.h4.a['href'])))
#print "self.chapterUrls:%s"%self.chapterUrls
self.story.setMetadata('numChapters',len(self.chapterUrls))
return
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
soup = self.make_soup(self.get_request(url))
soup = self.make_soup(self._fetchUrl(url))
span = soup.find('div', {'id' : 'storytext'})
@ -213,3 +230,4 @@ class FicwadComSiteAdapter(BaseSiteAdapter):
def getClass():
return FicwadComSiteAdapter

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2020 FanFicFare team
# Copyright 2011 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,22 +15,20 @@
# limitations under the License.
#
from __future__ import absolute_import
import time
from datetime import date, datetime
from datetime import date
from datetime import timedelta
import logging
logger = logging.getLogger(__name__)
import re
import urllib2
import cookielib as cl
import json
from ..htmlcleanup import stripHTML
from .. import exceptions as exceptions
# py2 vs py3 transition
from ..six import text_type as unicode
from ..six.moves import http_cookiejar as cl
from .base_adapter import BaseSiteAdapter, makeDate
from base_adapter import BaseSiteAdapter, makeDate
def getClass():
return FimFictionNetSiteAdapter
@ -41,12 +39,11 @@ class FimFictionNetSiteAdapter(BaseSiteAdapter):
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev','fimficnet')
self.story.setMetadata('storyId', self.parsedUrl.path.split('/',)[2])
self._setURL("https://"+self.getSiteDomain()+"/story/"+self.story.getMetadata('storyId')+"/")
self._setURL("http://"+self.getSiteDomain()+"/story/"+self.story.getMetadata('storyId')+"/")
self.is_adult = False
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
# FYI, not the only format used in this file.
self.dateformat = "%d %b %Y"
@staticmethod
@ -60,11 +57,18 @@ class FimFictionNetSiteAdapter(BaseSiteAdapter):
@classmethod
def getSiteExampleURLs(cls):
return "https://www.fimfiction.net/story/1234/story-title-here https://www.fimfiction.net/story/1234/ https://www.fimfiction.com/story/1234/1/ https://mobile.fimfiction.net/story/1234/1/story-title-here/chapter-title-here"
return "http://www.fimfiction.net/story/1234/story-title-here http://www.fimfiction.net/story/1234/ http://www.fimfiction.com/story/1234/1/ http://mobile.fimfiction.net/story/1234/1/story-title-here/chapter-title-here"
def getSiteURLPattern(self):
return r"https?://(www|mobile)\.fimfiction\.(net|com)/story/\d+/?.*"
def use_pagecache(self):
'''
adapters that will work with the page cache need to implement
this and change it to True.
'''
return True
def set_adult_cookie(self):
cookie = cl.Cookie(version=0, name='view_mature', value='true',
port=None, port_specified=False,
@ -77,57 +81,27 @@ class FimFictionNetSiteAdapter(BaseSiteAdapter):
comment_url=None,
rest={'HttpOnly': None},
rfc2109=False)
self.get_configuration().get_cookiejar().set_cookie(cookie)
def performLogin(self, url):
params = {}
if self.password:
params['username'] = self.username
params['password'] = self.password
else:
params['username'] = self.getConfig("username")
params['password'] = self.getConfig("password")
params['keep_logged_in'] = '1'
if params['username'] and params['password']:
loginUrl = 'https://' + self.getSiteDomain() + '/ajax/login'
logger.info("Will now login to URL (%s) as (%s)" % (loginUrl,
params['username']))
d = self.post_request(loginUrl, params)
if "signing_key" not in d :
logger.info("Failed to login to URL %s as %s" % (loginUrl,
params['username']))
raise exceptions.FailedToLogin(url,params['username'])
def make_soup(self,data):
soup = super(FimFictionNetSiteAdapter, self).make_soup(data)
for img in soup.select('img.lazy-img, img.user_image'):
## FimF has started a 'camo' mechanism for images that
## gets block by CF. attr data-source is original source.
if img.has_attr('data-source'):
img['src'] = img['data-source']
elif img.has_attr('data-src'):
img['src'] = img['data-src']
return soup
self.get_cookiejar().set_cookie(cookie)
def doExtractChapterUrlsAndMetadata(self,get_cover=True):
if self.is_adult or self.getConfig("is_adult"):
self.set_adult_cookie()
## Only needed with password protected stories, which you have
## to have logged into in the website using this account.
if self.getConfig("always_login"):
self.performLogin(self.url)
##---------------------------------------------------------------------------------------------------
## Get the story's title page. Check if it exists.
# don't use cache if manual is_adult--should only happen
# if it's an adult story and they don't have is_adult in ini.
data = self.do_fix_blockquotes(self.get_request(self.url,
usecache=(not self.is_adult)))
soup = self.make_soup(data)
try:
# don't use cache if manual is_adult--should only happen
# if it's an adult story and they don't have is_adult in ini.
data = self.do_fix_blockquotes(self._fetchUrl(self.url,
usecache=(not self.is_adult)))
soup = self.make_soup(data)
except urllib2.HTTPError, e:
if e.code == 404:
raise exceptions.StoryDoesNotExist(self.url)
else:
raise e
if "Warning: mysql_fetch_array(): supplied argument is not a valid MySQL result resource" in data:
raise exceptions.StoryDoesNotExist(self.url)
@ -135,6 +109,18 @@ class FimFictionNetSiteAdapter(BaseSiteAdapter):
if "This story has been marked as having adult content. Please click below to confirm you are of legal age to view adult material in your country." in data:
raise exceptions.AdultCheckRequired(self.url)
if self.password:
params = {}
params['password'] = self.password
data = self._postUrl(self.url, params)
soup = self.make_soup(data)
if not (soup.find('form', {'id' : 'password_form'}) == None):
if self.getConfig('fail_on_password'):
raise exceptions.FailedToDownload("%s requires story password and fail_on_password is true."%self.url)
else:
raise exceptions.FailedToLogin(self.url,"Story requires individual password",passwdonly=True)
##----------------------------------------------------------------------------------------------------
## Extract metadata
@ -145,14 +131,11 @@ class FimFictionNetSiteAdapter(BaseSiteAdapter):
self.story.setMetadata('title',stripHTML(title))
# Author
author = soup.find('div', {'class':'info-container'}).find('a')
author = storyContentBox.find('div', {'class':'author'}).find('a')
self.story.setMetadata("author", stripHTML(author))
# /user/288866/Stryker-Shadowpony-Blade
self.story.setMetadata("authorId", author['href'].split('/')[2])
self.story.setMetadata("authorUrl", "https://%s/user/%s/%s" % (self.getSiteDomain(),
self.story.getMetadata('authorId'),
# meta entry author can be changed by the user.
stripHTML(author)))
#No longer seems to be a way to access Fimfiction's internal author ID
self.story.setMetadata("authorId", self.story.getMetadata("author"))
self.story.setMetadata("authorUrl", "http://%s/user/%s" % (self.getSiteDomain(), stripHTML(author)))
#Rating text is replaced with full words for historical compatibility after the site changed
#on 2014-10-27
@ -161,9 +144,10 @@ class FimFictionNetSiteAdapter(BaseSiteAdapter):
self.story.setMetadata("rating", rating)
# Chapters
for chapter in soup.find('ul',{'class':'chapters'}).find_all('a',{'class':'chapter-title'}):
self.add_chapter(chapter, 'https://'+self.host+chapter['href'])
for chapter in storyContentBox.find_all('a',{'class':'chapter_link'}):
self.chapterUrls.append((stripHTML(chapter), 'http://'+self.host+chapter['href']))
self.story.setMetadata('numChapters',len(self.chapterUrls))
# Status
# In the case of Fimfiction, possible statuses are 'Completed', 'Incomplete', 'On Hiatus' and 'Cancelled'
@ -174,53 +158,51 @@ class FimFictionNetSiteAdapter(BaseSiteAdapter):
status = status.replace("Incomplete", "In-Progress").replace("Complete", "Completed")
self.story.setMetadata("status", status)
# Genres and Warnings
# warnings were folded into general categories in the 2014-10-27 site update
categories = storyContentBox.find_all('a', {'class':re.compile(r'.*\bstory_category\b.*')})
for category in categories:
category = stripHTML(category)
if category == "Gore" or category == "Sex":
self.story.addToList('warnings', category)
else:
self.story.addToList('genre', category)
# Word count
wordCountText = stripHTML(storyContentBox.find('div', {'class':'chapters-footer'}).find('div', {'class':'word_count'}))
wordCountText = stripHTML(storyContentBox.find('li', {'class':'bottom'}).find('div', {'class':'word_count'}))
self.story.setMetadata("numWords", re.sub(r'[^0-9]', '', wordCountText))
# Cover image
if get_cover:
storyImage = soup.select_one('div.story_container__story_image img')
if storyImage:
coverurl = storyImage['data-fullsize']
# try setting from data-fullsize, if fails, try using data-src
cover_set = self.setCoverImage(self.url,coverurl)[0]
if not cover_set or cover_set.startswith("failedtoload"):
coverurl = storyImage['src']
self.setCoverImage(self.url,coverurl)
storyImage = storyContentBox.find('div', {'class':'story_image'})
if storyImage:
coverurl = storyImage.find('a')['href']
if coverurl.startswith('//'): # fix for img urls missing 'http:'
coverurl = "http:"+coverurl
if get_cover:
# try setting from href, if fails, try using the img src
if self.setCoverImage(self.url,coverurl)[0] == "failedtoload":
img = storyImage.find('img')
# try src, then data-src, then leave None.
coverurl = img.get('src',img.get('data-src',None))
if coverurl:
self.setCoverImage(self.url,coverurl)
coverSource = storyImage.parent.find('a', {'class':'source'})
if coverSource:
self.story.setMetadata('coverSourceUrl', coverSource['href'])
# There's no text associated with the cover source
# link, so just reuse the URL. Makes it clear it's
# an external link leading outside of the fanfic
# site, at least.
self.story.setMetadata('coverSource', coverSource['href'])
coverSource = storyImage.find('a', {'class':'source'})
if coverSource:
self.story.setMetadata('coverSourceUrl', coverSource['href'])
#There's no text associated with the cover source link, so just
#reuse the URL. Makes it clear it's an external link leading
#outside of the fanfic site, at least.
self.story.setMetadata('coverSource', coverSource['href'])
# fimf has started including extra stuff inside the description div.
# specifically, the prequel link
description = storyContentBox.find("span", {"class":"description-text"})
description.name='div' # change to div, technically, spans
# aren't supposed to contain <p>'s.
descdivstr = u"%s"%description # string, but not stripHTML'ed
#The link to the prequel is embedded in the description text, so erring
#on the side of caution and wrapping this whole thing in a try block.
#If anything goes wrong this probably wasn't a valid prequel link.
try:
if "This story is a sequel to" in stripHTML(description):
link = description.find('a') # assume first link.
self.story.setMetadata("prequelUrl", 'https://'+self.host+link["href"])
self.story.setMetadata("prequel", stripHTML(link))
if not self.getConfig('keep_prequel_in_description',False):
hrstr=u"<hr/>"
descdivstr = u'<div class="description">'+descdivstr[descdivstr.index(hrstr)+len(hrstr):]
except:
logger.info("Prequel parsing failed...")
descdivstr = u"%s"%storyContentBox.find("div", {"class":"description"})
hrstr=u"<hr/>"
descdivstr = u'<div class="description">'+descdivstr[descdivstr.index(hrstr)+len(hrstr):]
self.setDescription(self.url,descdivstr)
# Find the newest and oldest chapter dates
storyData = storyContentBox.find('ul', {'class':'chapters'})
storyData = storyContentBox.find('div', {'class':'story_data'})
oldestChapter = None
newestChapter = None
self.newestChapterNum = None # save for comparing during update.
@ -248,7 +230,7 @@ class FimFictionNetSiteAdapter(BaseSiteAdapter):
# Date published
# falls back to oldest chapter date for stories that haven't been officially published yet
pubdatetag = storyContentBox.find('span', {'class':'approved-date'})
pubdatetag = storyContentBox.find('span', {'class':'date_approved'})
if pubdatetag is None:
if oldestChapter is None:
#this will only be true when updating metadata for stories that have 0 chapters
@ -258,25 +240,16 @@ class FimFictionNetSiteAdapter(BaseSiteAdapter):
else:
self.story.setMetadata("datePublished", oldestChapter)
else:
pubDate = self.date_span_tag_to_date(pubdatetag)
pubDate = self.ordinal_date_string_to_date(pubdatetag('span')[1].text)
self.story.setMetadata("datePublished", pubDate)
# Characters
tags = storyContentBox.find("ul", {"class":"story-tags"})
for character in tags.find_all("a", {"class":"tag-character"}):
self.story.addToList("characters", stripHTML(character))
for genre in tags.find_all("a", {"class":"tag-genre"}):
self.story.addToList("genre", stripHTML(genre))
for series in tags.find_all("a", {"class":"tag-series"}):
#using 'fandoms' as the identifier to standardize with archiveofourown.org
self.story.addToList("fandoms", stripHTML(series))
for warning in tags.find_all("a", {"class":"tag-warning"}):
self.story.addToList("warnings", stripHTML(warning))
for content in tags.find_all("a", {"class":"tag-content"}):
self.story.addToList("content", stripHTML(content))
chars = storyContentBox.find("div", {"class":"extra_story_data"})
for character in chars.find_all("a", {"class":"character_icon"}):
self.story.addToList("characters", character['title'])
# Likes and dislikes
storyToolbar = soup.find('div', {'class':'story-top-toolbar'})
storyToolbar = soup.find('div', {'class':'story-toolbar'})
likes = storyToolbar.find('span', {'class':'likes'})
if not likes is None:
self.story.setMetadata("likes", stripHTML(likes))
@ -286,9 +259,8 @@ class FimFictionNetSiteAdapter(BaseSiteAdapter):
# Highest view for a chapter and total views
viewSpan = storyToolbar.find('span', {'title':re.compile(r'.*\btotal views\b.*')})
viewResults = re.search(r'([0-9]*) views \/ ([0-9]*)', viewSpan['title'].replace(',',''))
self.story.setMetadata("views", viewResults.group(1))
self.story.setMetadata("total_views", viewResults.group(2))
self.story.setMetadata("views", re.sub(r'[^0-9]', '', stripHTML(viewSpan)))
self.story.setMetadata("total_views", re.sub(r'[^0-9]', '', viewSpan['title']))
# Comment count
commentSpan = storyToolbar.find('span', {'title':re.compile(r'.*\bcomments\b.*')})
@ -298,68 +270,59 @@ class FimFictionNetSiteAdapter(BaseSiteAdapter):
descriptionMeta = soup.find('meta', {'property':'og:description'})
self.story.setMetadata("short_description", stripHTML(descriptionMeta['content']))
# groups.
# If there are more than X groups, there's a 'Show all' button
# that calls for a JSON containing HTML with the full list.
# But it doesn't work reliably with FlareSolverr.
groupList = None
groupButton = soup.find('button', {'data-click':'showAll'})
if groupButton != None and groupButton.find('i', {'class':'fa-search-plus'}):
try:
groupResponse = self.get_request("https://www.fimfiction.net/ajax/stories/%s/groups" % (self.story.getMetadata("storyId")))
groupData = json.loads(groupResponse)
groupList = self.make_soup(groupData["content"])
except Exception as e:
logger.warning("Collecting 'groups' (AKA 'Featured In') from JSON failed:%s"%e)
logger.warning("Only 'groups' initially shown on the page will be collected.")
logger.warning("This is a known issue with JSON and FlareSolverr. See #1122")
if not groupList:
#groups
if soup.find('button', {'id':'button-view-all-groups'}):
groupResponse = self._fetchUrl("https://www.fimfiction.net/ajax/stories/%s/groups" % (self.story.getMetadata("storyId")))
groupData = json.loads(groupResponse)
groupList = self.make_soup(groupData["content"])
else:
groupList = soup.find('ul', {'id':'story-groups-list'})
if groupList:
for groupContent in groupList.find_all('a'):
self.story.addToList("groupsUrl", 'https://'+self.host+groupContent["href"])
groupName = groupContent.find('span', {"class":"group-name"})
if groupName != None:
self.story.addToList("groups",stripHTML(groupName).replace(',', ';'))
else:
self.story.addToList("groups",stripHTML(groupContent).replace(',', ';'))
if not (groupList == None):
for groupName in groupList.find_all('a'):
self.story.addToList("groupsUrl", 'http://'+self.host+groupName["href"])
self.story.addToList("groups",stripHTML(groupName).replace(',', ';'))
#sequels
for header in soup.find_all('h1', {'class':'header-stories'}):
# I don't know why using string=re.compile with find() wouldn't work, but it didn't.
# I don't know why using text=re.compile with find() wouldn't work, but it didn't.
if header.text.startswith('Sequels'):
sequelContainer = header.parent
for sequel in sequelContainer.find_all('a', {'class':'story_link'}):
self.story.addToList("sequelsUrl", 'https://'+self.host+sequel["href"])
self.story.addToList("sequelsUrl", 'http://'+self.host+sequel["href"])
self.story.addToList("sequels", stripHTML(sequel).replace(',', ';'))
#author last login
userPageHeader = soup.find('div', {'class':'user-page-header'})
userPageHeader = soup.find('div', {'class':re.compile(r'\buser-page-header\b')})
if not userPageHeader == None:
infoContainer = userPageHeader.find('ul', {'class':'mini-info-box'})
infoContainer = userPageHeader.find('div', {'class':re.compile(r'\binfo-container\b')})
listItems = infoContainer.find_all('li')
lastLoginString = stripHTML(listItems[1])
lastLogin = None
if "online" in lastLoginString:
lastLogin = date.today()
elif "offline" in lastLoginString:
lastLogin = self.date_span_tag_to_date(listItems[1])
#this regex extracts the number of weeks and the number of days from the last login string.
#durations under a day are ignored.
#group 1 is weeks, group 2 is days
durationGroups = re.match(r"(?:[^0-9]*(\d+?)w)?[^0-9]*(?:(\d+?)d)?", lastLoginString)
lastLogin = date.today() - timedelta(days=int(durationGroups.group(2) or 0), weeks=int(durationGroups.group(1) or 0))
self.story.setMetadata("authorLastLogin", lastLogin)
def date_span_tag_to_date(self, containingtag):
## <span data-time="1435421997" title="Saturday 27th of June 2015 @4:19pm">Jun 27th, 2015</span>
## No timezone adjustment is done.
span = containingtag.find('span',{'data-time':re.compile(r'^\d+$')})
if span != None:
return datetime.fromtimestamp(float(span['data-time']))
## Sometimes, for reasons that are unclear, data-time is not present. Parse the date out of the title instead.
else:
span = containingtag.find('span', title=True)
dateRegex = re.search('([a-zA-Z ]+)([0-9]+)(st of|th of|nd of|rd of)([a-zA-Z ]+[0-9]+)', span['title'])
dateString = dateRegex.group(2) + dateRegex.group(4)
return makeDate(dateString, "%d %B %Y")
#The link to the prequel is embedded in the description text, so erring
#on the side of caution and wrapping this whole thing in a try block.
#If anything goes wrong this probably wasn't a valid prequel link.
try:
description = soup.find('div', {'class':'description'})
firstHR = description.find("hr")
nextSib = firstHR.nextSibling
if "This story is a sequel to" in nextSib.string:
link = nextSib.nextSibling
if link.name == "a":
self.story.setMetadata("prequelUrl", 'http://'+self.host+link["href"])
self.story.setMetadata("prequel", stripHTML(link))
except:
pass
def ordinal_date_string_to_date(self, datestring):
datestripped=re.sub(r"(\d+)(st|nd|rd|th)", r"\1", datestring.strip())
@ -383,58 +346,21 @@ class FimFictionNetSiteAdapter(BaseSiteAdapter):
def getChapterText(self, url):
logger.debug('Getting chapter text from: %s' % url)
data = self.get_request(url)
data = self._fetchUrl(url)
soup = self.make_soup(data)
if not (soup.find('form', {'id' : 'password_form'}) == None):
if self.password:
params = {}
params['password'] = self.password
data = self._postUrl(url, params)
else:
logger.error("Chapter %s needed password but no password was present" % url)
data = self.do_fix_blockquotes(data)
if self.getConfig("include_author_notes",True):
soup = self.make_soup(data).find_all('div', {'class':re.compile(r'(.*\bauthors-note\b.*|.*\bchapter-body\b.*)')})
if soup == None:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
chapter_divs = [unicode(div) for div in soup]
soup = self.make_soup(" ".join(chapter_divs))
else:
soup = self.make_soup(data).find('div', {'id' : 'chapter-body'})
if soup == None:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
soup = self.make_soup(data).find('div', {'class' : 'chapter_content'})
if soup == None:
raise exceptions.FailedToDownload("Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url,soup)
def before_get_urls_from_page(self,url,normalize):
## Unlike most that show the links to 'adult' stories, but protect
## them, FimF doesn't even show them if not logged in.
# data = self.get_request(url)
if self.getConfig("is_adult"):
self.set_adult_cookie()
def get_urls_from_page(self,url,normalize):
iterate = self.getConfig('scrape_bookshelf', default=False)
if not re.search(r'fimfiction\.net/bookshelf/(?P<listid>.+?)/',url) or iterate == 'legacy':
return super().get_urls_from_page(url,normalize)
self.before_get_urls_from_page(url,normalize)
final_urls = list()
while True:
data = self.get_request(url,usecache=True)
soup = self.make_soup(data)
paginator = soup.select_one('div.paginator-container > div.page_list > ul').find_all('li')
logger.debug("Paginator: " + str(len(paginator)))
stories_container = soup.select_one('div.content > div.two-columns > div.left').find_all('article', recursive=False)
x = 0
logger.debug("Container "+str(len(stories_container)))
for story_raw in stories_container:
x += 1
story_url = story_raw.select_one('div.story_content_box > header.title > div > a.story_name').get('href')
url_story = ('https://' + self.getSiteDomain() + story_url)
#logger.debug(url_story)
final_urls.append(url_story)
logger.debug("Discovered %s new stories."%str(x))
next_button = paginator[-1].select_one('a')
logger.debug("Next button: " + next_button.get_text())
if next_button.get_text() or not iterate:
return {'urllist': final_urls}
url = ('https://' + self.getSiteDomain() + next_button.get('href'))

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2013 Fanficdownloader team, 2020 FanFicFare team
# Copyright 2013 Fanficdownloader team, 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,26 +15,32 @@
# limitations under the License.
#
from __future__ import absolute_import
import logging
logger = logging.getLogger(__name__)
# py2 vs py3 transition
from .adapter_storiesonlinenet import StoriesOnlineNetAdapter
from adapter_storiesonlinenet import StoriesOnlineNetAdapter
def getClass():
return SciFiStoriesComAdapter
return FineStoriesComAdapter
# Class name has to be unique. Our convention is camel case the
# sitename with Adapter at the end. www is skipped.
class SciFiStoriesComAdapter(StoriesOnlineNetAdapter):
class FineStoriesComAdapter(StoriesOnlineNetAdapter):
@classmethod
def getSiteAbbrev(cls):
return 'sfst'
def getSiteAbbrev(self):
return 'fnst'
@staticmethod # must be @staticmethod, don't remove it.
def getSiteDomain():
# The site domain. Does have www here, if it uses it.
return 'scifistories.com'
return 'finestories.com'
## Login seems to be reasonably standard across eFiction sites.
def needToLoginCheck(self, data):
if 'Free Registration' in data \
or "Log In" in data \
or "Invalid Password!" in data \
or "Invalid User Name!" in data:
return True
else:
return False

View file

@ -1,172 +0,0 @@
# -*- coding: utf-8 -*-
# Copyright 2011 Fanficdownloader team, 2018 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
####################################################################################################
# Adapted by GComyn - December 10, 2016
####################################################################################################
from __future__ import absolute_import
''' This adapter will download the stories from the www.fireflyfans.net forum pages '''
import logging
import re
# py2 vs py3 transition
from ..six import text_type as unicode
from .base_adapter import BaseSiteAdapter, makeDate
from .. import exceptions as exceptions
from ..htmlcleanup import stripHTML
logger = logging.getLogger(__name__)
####################################################################################################
def getClass():
return FireFlyFansNetSiteAdapter
####################################################################################################
class FireFlyFansNetSiteAdapter(BaseSiteAdapter):
def __init__(self, config, url):
BaseSiteAdapter.__init__(self, config, url)
self.story.setMetadata('siteabbrev', 'fffans')
self.is_adult = False
# get storyId from url--url validation guarantees query is only
# sid=1234
self.story.setMetadata('storyId', self.parsedUrl.query.split('=',)[1])
# normalized story URL.
self._setURL('http://' + self.getSiteDomain() +
'/bluesun.aspx?bid=' + self.story.getMetadata('storyId'))
# The date format will vary from site to site.
# http://docs.python.org/library/datetime.html#strftime-strptime-behavior
self.dateformat = "%B %d, %Y"
################################################################################################
@staticmethod
def getSiteDomain():
return 'www.fireflyfans.net'
################################################################################################
@classmethod
def getSiteExampleURLs(cls):
return "http://" + cls.getSiteDomain() + "/bluesun.aspx?bid=1234"
################################################################################################
def getSiteURLPattern(self):
return re.escape("http://" + self.getSiteDomain() + "/bluesun.aspx?bid=") + r"\d+$"
################################################################################################
def extractChapterUrlsAndMetadata(self):
url = self.url
logger.debug("URL: " + url)
data = self.get_request(url)
if 'Something bad happened, but hell if I know what it is.' in data:
raise exceptions.StoryDoesNotExist(
'{0} says: GORAMIT!!! SOMETHING WENT WRONG! Something bad happened, but hell if I know what it is.'.format(self.url))
soup = self.make_soup(data)
# Title
a = soup.find('span', {'id': 'MainContent_txtItemName'})
self.story.setMetadata('title', stripHTML(a))
# Find authorid and URL from... author url.
a = soup.find('a', href=re.compile(r"profileshow.aspx\?u="))
self.story.setMetadata('authorId', a['href'].split('=')[1])
if not self.story.getMetadata('authorId'):
logger.warning("Site authorUrl missing authorId, using SiteMissingAuthorId")
self.story.setMetadata('authorId', 'SiteMissingAuthorId')
self.story.setMetadata('authorUrl', 'http://' +
self.host + '/' + a['href'])
self.story.setMetadata('author', a.string)
# This site has all "chapters" on one page. Also, there is no easy systematic
# way to determine if there are other chapters to the same story, so you have
# to download them one at a time yourself. I'm also setting the status to
# complete
self.add_chapter(self.story.getMetadata('title'), self.url)
self.story.setMetadata('status', 'Completed')
## some stories do not have a summary listed, so I'm setting it here.
summary = soup.find('span', {'id': 'MainContent_txtItemDescription'})
summary = stripHTML(summary)
if not summary:
self.setDescription(url, '>>>>>>>>>> No Summary Given <<<<<<<<<<')
else:
self.setDescription(url, summary)
# There is not alot of Metadata with this site, so we get what we can.
pubdate = soup.find('span', {'id': 'MainContent_txtItemInfo'})
pubdate = stripHTML(pubdate)
pubdate = pubdate[pubdate.find(', ') + 1:]
self.story.setMetadata('datePublished', makeDate(
pubdate.strip(), self.dateformat))
# The only Metadata that I can find is the Category (usually Fiction) and the series
# which is usualy FireFly on this site, but I'm going to get them
# anyway.a
category = soup.find('span', {'id': 'MainContent_txtItemDetails'})
category = stripHTML(unicode(category).replace(u"\xa0", u' '))
metad = category.split(' ')
for meta in metad:
if ":" in meta:
label = meta.split(':')[0].strip()
value = meta.split(':')[1].strip()
if label == 'CATEGORY':
self.story.setMetadata('category', value)
elif label == 'SERIES':
# There is no easy way to determine which number the current 'story' is
# in the total story, so I'm just going to set the series
# name here
self.story.setMetadata('series', value)
else:
# This catches the elements I am not interested
# in, such as Times Read and Rating (which is a
# Number, not a determination on the content)
zzzzzzz = 0
# The genre is contained in a tag that has 'BLUE SUN ROOM FAN FICTION - ' as part of
# the text, so we get it, then remove that text
genre = soup.find('span', {'id': 'MainContent_txtBlueSunHeader'})
genre = stripHTML(genre).replace('BLUE SUN ROOM FAN FICTION - ', '')
self.story.setMetadata('genre', genre.title())
# since the 'story' is one page, I am going to save the soup here, so we can use iter
# to get the story text in the getChapterText function, instead of having to retrieve
# it again.
self.html = soup
################################################################################################
def getChapterText(self, url):
logger.debug('Using the html retrieved previously from: %s' % url)
soup = self.html
span = soup.find('div', {'class': 'fanfic'})
if not span:
raise exceptions.FailedToDownload(
"Error downloading Chapter: %s! Missing required element!" % url)
return self.utf8FromSoup(url, span)

View file

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
# Copyright 2024 FanFicFare team
# Copyright 2015 FanFicFare team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -15,18 +15,15 @@
# limitations under the License.
#
from __future__ import absolute_import
import re
from .base_xenforo2forum_adapter import BaseXenForo2ForumAdapter
from base_xenforoforum_adapter import BaseXenForoForumAdapter
def getClass():
return QuestionablequestingComAdapter
class QuestionablequestingComAdapter(BaseXenForo2ForumAdapter):
class QuestionablequestingComAdapter(BaseXenForoForumAdapter):
def __init__(self, config, url):
BaseXenForo2ForumAdapter.__init__(self, config, url)
BaseXenForoForumAdapter.__init__(self, config, url)
# Each adapter needs to have a unique site abbreviation.
self.story.setMetadata('siteabbrev','qq')
@ -36,12 +33,3 @@ class QuestionablequestingComAdapter(BaseXenForo2ForumAdapter):
# The site domain. Does have www here, if it uses it.
return 'forum.questionablequesting.com'
@classmethod
def getAcceptDomains(cls):
return [cls.getSiteDomain(),
cls.getSiteDomain().replace('forum.','')]
def getSiteURLPattern(self):
## QQ accepts forum.questionablequesting.com and questionablequesting.com
## We will use forum. as canonical for all
return super(QuestionablequestingComAdapter, self).getSiteURLPattern().replace(re.escape("forum."),r"(forum\.)?")

Some files were not shown because too many files have changed in this diff Show more