AYR
AYR

Reputation: 1179

Extracting a set of variables from a character array in C [linux]

I have a program which, in short, needs to take a directory of files and write, into a file, the meta data and content of each file. The second step is to recover the directory from the file.

I am unable to think of a way to separate the meta-data in the file so that it can be extracted easily under any circumstance. This is mainly because Linux allows almost every character to be used to name a directory or file (except for / or \0). Therefore any other character could just be part of the name of a file or part of its content.

EXAMPLE of shortened file entry:

dir_name/sub_directory/file_name[separator]9999[separator]1234[separator]content

Any ideas would be greatly appreciated.

Upvotes: 0

Views: 93

Answers (3)

Nominal Animal
Nominal Animal

Reputation: 39366

There are at least four basic approaches:

  1. Encoding the file names

    There are various encodings you can use to encode the file names so that the encoded version only contains portable acceptable characters.

    Directory entries in Linux are basically just non-empty sequences of 8-bit bytes, terminated with a zero (\0), that may not contain a forward slash (/). Sequence . is reserved for the current directory, and .. for the parent directory.

    There are various possible encodings you can use. Wikipedia Binary-to-text category and the Binary-to-text encoding page contains some of the more common examples you might wish to check out.

  2. Escaping

    Similar to how C uses backslash escapes for control characters (such as \n referring to ASCII LF, or newline in Unix/Linux environments), you can use a special character to escape the characters you use as separator or that are treated otherwise specially. (Note that for portability, you should then treat these files as binary -- not, say, UTF-8 encoded --, except that specific bytes have specific meanings.)

    Although you are basically unlimited in how to do the escaping, one of the easiest to implement is to use one escape character, say %, followed by two hexadecimal characters, to specify the escaped character.

  3. Structured text

    You can use a minimal markup language, or even something like XML, to describe each directory entry.

    Although the markup will increase the length per directory entry, it is trivial to extend. For example, you might wish to add support for extended attributes at some point; these would be trivial to add in a backwards-compatible fashion.

    Of course, instead of a full markup language, you can instead just use logically an associative array for each directory entry, and have your file be an array of those associative arrays. One of the associative array keys would specify the directory entry name, one would specify the data part, and so on.

    A minimal implementation of the logical array of associative arrays is to use fixed-width keys at the start of each field. In fact, this is quite common: file formats such as JFIF (most common JPEG file format), TIFF, and PNG. Indeed, the EXIF data cameras add to JPEG images uses this exact extensibility.

  4. Binary data structures

    Instead of relying specific bytes to be separators, you can use binary data structures. Similar to aforementioned JFIF et al. file formats, the archive file is comprised of one or more segments. Each segments contains a length (specifying the length of the segment in bytes), and a type identifier. The contents of the segment are further structured based on the type.

    File names would similarly be described using a segment (inside a "file segment"). Therefore, the file name could consist of any byte values, including \0 and /, although of course your application should verify that the file name is acceptable for the current operating system, and perhaps apply suitable conversions if necessary. (Similar tools like tar do.)

    There are two additional wrinkles you should be aware of. One is byte order; you cannot just say that four bytes encode a word, you also must say in which order, which byte is most/least significant. The other is file lengths. Many old utilities assumed file lengths would never exceed 232 bytes, so all lengths could be encoded in four bytes. This is no longer true. Fortunately, you can assume that in the foreseeable future, file lengths will not exceed 264 bytes; i.e. that using eight bytes to encode lengths, should suffice. (No because it is inconceivable -- it isn't --, but simply because everyone else does, too.)

There are practical effects based on which approach you use. Mainly, binary data structures are thought to be less robust against data corruption, but they do allow faster scanning (as things like file data segments can be skipped, not retrieved at all from storage). Also, humans can parse escaped and structured text, but rarely (fully) encoded or binary data; special tools are often needed for encoded and binary data.

Personally, I do prefer a binary approach, but I have used structured text especially for cases where human examination of the stored data has been useful.

Questions?

Upvotes: 1

aisbaa
aisbaa

Reputation: 10633

I would suggest structuring your file into two sections, header and body. Where header would contain file names meta data (including file content start end positions in body section). Body would contain only content of file.

Upvotes: 0

user3159253
user3159253

Reputation: 17455

You may encode file names with e.g. Base64 encoding

Upvotes: 0

Related Questions