Caustic
Caustic

Reputation: 969

read mixed data types in text file Python

I have been given some 'reports' from another piece of software that contains data that I need to use. The file is quite simple. It has a description line that starts with a # that is the variable name/description. Followed by comma seperated data on the next line.

eg

    #wavelength,'<a comment describing the data>'
    400.0,410.0,420.0, <and so on>
    #reflectance,'<a comment describing the data>'
    0.001,0.002,0.002, <and so on>
    #date,'time file was written'
    2012-03-06 13:12:36.694597  < this is the bit that stuffs me up!! >

When I first typed up some code I expected all the data to be read as floats. But I have discovered some dates and strings. For my purposes All I care about is the data that should be arrays of floats. Everything else I read in (such as dates) can be treated as a strings (even if they are technically a date for example).

My first attempt - which worked until I found non-floats - basically ignores the # then grabs the chars proceeding it making a dictionary with the Key that is the chars it just read. Then I made the entry for the key an array by splitting on the commas and stacking on rows for 2-d data. Similar to the next section of code.

    data = f.readlines()
    dataLines = data.split('\n')

    for i in range(0,len(dataLines)-1):
        if dataLines[i][0] == '#':
            key,comment = dataLines[i].split(',')
            keyList.append(key[1:])
            k+=1
        else: # it must be data
            d+=1
            dataList.append(dataLines[i])

        for j in range(0,len(dataList)):
            tmp = dataList[j]

            x = map(float,tmp.split(','))
            tempData = vstack((tempData,asarray(x)))

    self.__report[keyList[k]] = tempData  

When I find a non-float in my file the line "x = map(float,tmp.split(','))" fails (there are no commas in the line of data). I thought I would try and test if it is a string or not using isinstance but the file reader treats all of the data coming in from the file as a string (of course). I tried trying to convert the line from the file to a float array, thinking if it fails then just treat it as an array of strings - like this.

     try:
         scipy.array(tmp,dtype=float64)  #try to convert
         x = map(float,tmp.split(','))

     except:# ValueError: # must be a string
         x = zeros((1,1))
         x = asarray([tmp])
         #tempData = vstack((tempData,asarray(x)),dtype=str)
         if 'tempData' in locals():
             pass
         else:
             tempData = zeros((len(x)))

         tempData = vstack((tempData,asarray(x)))

This however results as EVERYTHING being read in as a character array and as such, I cannot index the data as a numpy array. All of the data is there in the dictionary but the dtype is s|8, for example. It seems the try block is going straight to the exception.

I would appreciate any advice on getting this to work so I can discriminate between floats and strings. I don't know the order of the data before I get the report.

Also, the big files can take quite a long time to load in to memory, any advice on how to make this more efficient would also be appreciated.

Thanks

Upvotes: 3

Views: 7783

Answers (3)

VISHAKHA AGARWAL
VISHAKHA AGARWAL

Reputation: 1

Write a Python program to create a file of elements of any data type (mixed data type, i.e. some elements maybe of type int, some elements of type float and some elements of type string). Split this file into three file containing elements of same data type (i.e. 1st file of integers only, 2nd file of float only and 3rd file of strings only). Take input from the user to create the file.

f = open('MixedFile.txt','w')
while True :
    user = input("Enter Any Data Type Element :: ")
    if user == 'end':
        print('!!!!!!!! EXIT !!!!!!!!!!!!')
        break
    else :
        f.write(user + '\n')
f.close()
f = open('MixedFile.txt')
a = []
a = f.read().split()
f.close()
fs = open ('StringFile.txt','w')
ff = open ('FloatFile.txt','w')
fn = open ('NumberFile.txt','w')
for i in a :
    try:
        int(i)
        fn.write(i + '\n')
    except:
            try:
                float(i)
                ff.write(i + '\n')
            except:
                fs.write(i + '\n')
f.close()
fs.close()
fn.close()
ff.close()

print("reading................")
fs = open ('StringFile.txt','r')
ff = open ('FloatFile.txt','r')
fn = open ('NumberFile.txt','r')
print(fs.read())
print(fn.read())
print(ff.read())

Upvotes: 0

ronakg
ronakg

Reputation: 4212

I'm assuming that finally you are interested in the x which should be in the format [400.0, 410.0, 420.0].

One way to handle this is separating the splitting by command and converting to float operations in two different statements, so that you can catch ValueError when you get string elements instead of float or int.

keyList = []
dataList = []
with open('sample_data','r') as f:
    for line in f.readline():
        if line.startswith("#"):
            key, comment = line.split(',')
            keyList.append(key[1:])
        else: # it must be data
            dataList.append(line)

for data in dataList:
    data_list = data.split(',')
    try:
        x = map(float, data_list)
    except ValueError:
        pass

Also notice other minor changes that I've done to your code which makes it more pythonic in nature.

Upvotes: 3

Cameron
Cameron

Reputation: 1725

this might be a stupid suggestion, but could you just do an additional check

if ',' in dataLines[i]

before adding the line to your data list? Or, if not, write a regular expression to check for a comma-separated list of floating point numbers?

(\d(\.\d+)?)(,\d(\.\d+)?)*

might do the trick (allows integers too).

Upvotes: 0

Related Questions