deal with invalid pathname encodings

So. Apparently, os.listdir() will *try* to give you Unicode when you give it Unicode, but will occasionally give you bytestrings when it can't decode a filename. Also, I've now had two separate reports from users whose filesystems report a UTF-8 filesystem encoding but whose files contain latin1 characters. The choices were to (a) switch over to bytestrings entirely for filenames or (b) just deal with the badly-encoded filenames. Option (a) is very unattractive because it requires me to store bytestrings in sqlite (which is not only complicated but would require more code to deal with legacy databases) and complicates the construction of pathnames from (Unicode) metadata. Therefore, I've implemented a static fallback to latin1 if the default pathname decode fails. Furthermore, if that also fails, the _sorted_walk function just ignores the badly-encoded file (and logs an error).
2025-12-15 21:14:19 +01:00 · 2010-08-04 11:06:28 -07:00 · 2010-08-04 11:06:28 -07:00 · 0c87e2470a
commit 0c87e2470a
parent 959c6e55c3
3 changed files with 25 additions and 3 deletions
--- a/2
+++ b/2
@ -35,6 +35,8 @@
  Windows users can now just type "beet" at the prompt to run beets.
 * Fixed an occasional bug where Mutagen would complain that a tag was
  already present.
+* Fixed some errors with filenames that have badly encoded special
+  characters.

 1.0b3
 -----
--- a/beets/autotag/init.py
+++ b/beets/autotag/init.py
@ -87,7 +87,16 @@ def _sorted_walk(path):
    dirs = []
    files = []
    for base in os.listdir(path):
-        base = library._unicode_path(base)
+        # While os.listdir() will try to give us unicode output (as
+        # we gave it unicode input), it may fail to decode some
+        # filenames.
+        try:
+            base = library._unicode_path(base)
+        except UnicodeError:
+            # Log and ignore undecodeable filenames.
+            log.error(u'invalid filename in %s' % path)
+            continue
+
        cur = os.path.join(path, base)
        if os.path.isdir(cur):
            dirs.append(base)
@ -101,7 +110,6 @@ def _sorted_walk(path):

    # Recurse into directories.
    for base in dirs:
-        base = library._unicode_path(base)
        cur = os.path.join(path, base)
        # yield from _sorted_walk(cur)
        for res in _sorted_walk(cur):
--- a/beets/library.py
+++ b/beets/library.py
@ -164,7 +164,19 @@ def _unicode_path(path):
    """Ensures that a path string is in Unicode."""
    if isinstance(path, unicode):
        return path
-    return path.decode(sys.getfilesystemencoding())
+    encoding = sys.getfilesystemencoding() or sys.getdefaultencoding()
+    try:
+        out = path.decode(encoding)
+    except UnicodeError:
+        # This is of course extremely hacky, but I've received several
+        # reports of filesystems misrepresenting their encoding as
+        # UTF-8 and actually providing Latin-1 strings. This helps
+        # handle those cases. All this is the cost of dealing
+        # exclusively with Unicode pathnames internally (which
+        # simplifies their construction from metadata and storage in
+        # SQLite).
+        out = path.decode('latin1')
+    return out

 # Note: POSIX actually supports \ and : -- I just think they're
 # a pain. And ? has caused problems for some.