AP
AP

Reputation:

Find Non-UTF8 Filenames on Linux File System

I have a number of files hiding in my LANG=en_US:UTF-8 filesystem that have been uploaded with unrecognisable characters in their filename.

I need to search the filesystem and return all filenames that have at least one character that is not in the standard range (a-zA-Z0-9 and .-_ etc.)

I have been trying to following but no luck.

find . | egrep [^a-zA-Z0-9_\.\/\-\s]

I'm using Fedora Code 9.

Upvotes: 9

Views: 14247

Answers (4)

Fedir RYKHTIK
Fedir RYKHTIK

Reputation: 9974

find . | perl -ne 'print if /[^[:ascii:]]/'

Upvotes: 8

asoundmove
asoundmove

Reputation: 1322

I had a similar problem to the OP for which I was given a solution on Superuser (see also further comments) that I found more satisfactory than the "convmv solution", although I appreciate to have discovered comvmv too.

Upvotes: -1

bobince
bobince

Reputation: 536379

find . | egrep [^a-zA-Z0-9_./-\s]

Danger, shell escaping!

bash will be interpreting that last parameter, removing one level of backslash-escaping. Try putting double quotes around the "[^group]" expression.

Also of course this disallows a lot more than UTF-8. It is possible to construct a regex to match valid UTF-8 strings, but it's rather ugly. If you have Python 2.x available you could take advantage of that:

import os.path
def walk(dir):
    for child in os.listdir(dir):
        child= os.path.join(dir, child)
        if os.path.isdir(child):
            for descendant in walk(child):
                yield descendant
        yield child

for path in walk('.'):
    try:
        u= unicode(path, 'utf-8')
    except UnicodeError:
        # print path, or attempt to rename file

Upvotes: 2

Joachim Sauer
Joachim Sauer

Reputation: 308031

convmv might be interesting to you. It doesn't just find those files, but also supports renaming them to correct file names (if it can guess what went wrong).

Upvotes: 17

Related Questions