Reputation: 963
I have a script to pull in a variety of gzip and bz2 compressed files. After I pull them in, I am looking to write a script to write the file and add an extension based on the file type contained within.
The file formats I am concern about include xml, csv, and txt files, although I am not really concerned about delineating between csv and txt files (adding the txt extension is alright for both).
I have been using the python-magic library to determine which decompression library to use (bz2 vs gzip) but want to know what the easiest way to determine the file type. Using python-magic I got:
>>> ftype = m.from_file("xml_test.xml")
>>> ftype
'ASCII text'
>>> ftype = m.from_file("csv_test.csv")
>>> ftype
'ASCII text'
My current plan is to read in the first line of each file and make the determination based on that. Is there an easier way?
In response to @phihag's answer showing me how poorly I originally worded this question: What I want is something that will check first if a file is valid XML, if not then check if it is valid CSV, and finally if it is not valid CSV but is valid plain text, return that as a response
Note: There was a partial answer here but this solution only describes csv check, not an xml, txt, etc.
Upvotes: 1
Views: 3099
Reputation: 287815
You cannot reliably distinguish XML and csv, as the following file is both a valid XML as well as a valid CSV document:
<r>,</r>
Therefore, all you can do is apply a heuristic, for example return xml if the first character is <
, and csv otherwise.
Similarily, all CSV and XML files are also valid plain text files.
To check whether a file forms a valid XML or CSV document, you can simply parse it. If you're out for performance, simply skip the construction of an actual document tree, for example with sax or by ignoring the items of csv.reader:
import xml.sax,csv
def getType(filename):
with open(filename, 'rb') as fh:
try:
xml.sax.parse(fh, xml.sax.ContentHandler())
return 'xml'
except: # SAX' exceptions are not public
pass
fh.seek(0)
try:
for line in csv.reader(fh):
pass
return 'csv'
except csv.Error:
pass
return 'txt'
Upvotes: 5