This required the introduction of a track_distance method on plugins. We'll also
need to add an album_distance method as well as a mechanism for extending the
search routine (so we can search for albums in MusicBrainz even when they have
no tags). This commit also adds the '-v' flag for printing debug logs (something
we should do more of).
(I'm not sure why, but the weight for track index mismatches was set to 0.0.
This way, the tagger will be slightly more reluctant to frivolously reorder.)
When computing track destination paths, we now look for album-level values when
they're available. This has the effect of making albums go into a single
directory even when their tracks have heterogeneous metadata. We will need to
revisit this once we start explicitly supporting non-album tracks.
In the end, after all of this, it turns out that we basically need to abandon
the temptation of dealing with unicode paths altogether. The POSIX filesystem
API has no notion of unicode and is very much a bytes-only interface. This
means that undecodable pathnames are a reality we must deal with. This new
approach stores all paths as buffers (blobs) in SQLite and -- as transparently
as possible -- presents them as str objects to the Python code. Legacy
databases will have their paths automatically encoded into str objects, and
will lazily have their unicodes in the database replaced with buffers.
Decoding a path as latin1 when it appears undecodable is a non-solution
because, the next time we want to actually *use* the path, it will be encoded
differently and the file won't be found. Death to undecodable paths!
In the multithreaded version, the "directory done" state was written before
other progress states, causing it to be overwritten. This was because I had
stupidly put the "done" message in the initial generator, which of course
finishes before the entire pipeline finished. This manifested as two problems:
the tagger would always want to "resume" even when it had finished the last
time; "aBort"ing the process would not cause the next run to resume.
So. Apparently, os.listdir() will *try* to give you Unicode when you give it
Unicode, but will occasionally give you bytestrings when it can't decode a
filename. Also, I've now had two separate reports from users whose filesystems
report a UTF-8 filesystem encoding but whose files contain latin1 characters.
The choices were to (a) switch over to bytestrings entirely for filenames or
(b) just deal with the badly-encoded filenames. Option (a) is very unattractive
because it requires me to store bytestrings in sqlite (which is not only
complicated but would require more code to deal with legacy databases) and
complicates the construction of pathnames from (Unicode) metadata. Therefore,
I've implemented a static fallback to latin1 if the default pathname decode
fails. Furthermore, if that also fails, the _sorted_walk function just ignores
the badly-encoded file (and logs an error).
Previously, we tried to shut down everything very nicely by sending along a
channel poison message when an exception occurred. That, of course, was
disastrous because some of the pipeline was no longer running and the poison
was unlikely to get all the way through. Now we just abort every thread and
clear every queue (to force the abort even when blocking on enqueues). This
problem manifested as a deadlock when an exception occurred in the final
stage.
Previously, the producer thread (i.e., the first stage) would continue running
to completion even when an exception was raised! And, depending on the size of
the queue, deadlock was even possible if the next stage was no longer consuming
the produced values.
This makes the apply_choices coroutine run even for albums that are skipped or
still in the library. This (along with making things more predictable) lets the
apply_choices stage write the progress value as albums are retired even if they
are skipped.