Reputation: 189

Extracting information from unconventional text files? (Python)

I am trying to extract some information from a set of files sent to me by a collaborator. Each file contains some python code which names a sequence of lists. They look something like this:

#PHASE = 0
x = np.array(1,2,...)
y = np.array(3,4,...)
z = np.array(5,6,...)

#PHASE = 30
x = np.array(1,4,...)
y = np.array(2,5,...)
z = np.array(3,6,...)

#PHASE = 40
...

And so on. There are 12 files in total, each with 7 phase sets. My goal is to convert each phase into it's own file which can then be read by ascii.read() as a Table object for manipulation in a different section of code.

My current method is extremely inefficient, both in terms of resources and time/energy required to assemble. It goes something like this: Start with a function

def makeTable(a,b,c):
   output = Table()
   output['x'] = a
   output['y'] = b
   output['z'] = c
   return output

Then for each phase, I have manually copy-pasted the relevant part of the text file into a cell and appended a line of code

fileName_phase = makeTable(a,b,c)

Repeat ad nauseam. It would take 84 iterations of this to process all the data, and naturally each would need some minor adjustments to match the specific fileName and phase.

Finally, at the end of my code, I have a few lines of code set up to ascii.write each of the tables into .dat files for later manipulation.

This entire method is extremely exhausting to set up. If it's the only way to handle the data, I'll do it. I'm hoping I can find a quicker way to set it up, however. Is there one you can suggest?

Upvotes: 1

Answers (3)

Julien

Reputation: 15226

To avoid the safety issue of using exec as suggested by @Ajay Brahmakshatriya, but keeping his first processing step, you can create your own minimal 'phase parser', something like:

VARS = 'xyz'
def makeTable(phase):
    assert len(phase) >= 3
    output = Table()
    for i in range(3):
        line = [s.strip() for s in phase[i].split('=')]
        assert len(line) == 2
        var, arr = line
        assert var == VARS[i]
        assert arr[:10]=='np.array([' and arr[-2:]=='])'
        output[var] = np.fromstring(arr[10:-2], sep=',')
    return output

and then call

table = makeTable(phase)

instead of

exec(phase)
table = makeTable(x, y, z)

You could also skip all these assert statements without compromising safety, if the file is corrupted or not formatted as expected the error that will be thrown might just be harder to understand...

Upvotes: 1

Ajay Brahmakshatriya

Reputation: 9213

I will suggest a way which will be scorned by many but will get your work done.

So apologies to every one.

The prerequisites for this method is that you absolutely trust the correctness of the input files. Which I guess you do. (After all he is your collaborator).

So the key point here is that the text in the file is code which means it can be executed.

So you can do something like this

import re
import numpy as np # this is for the actual code in the files. You might have to install numpy library for this to work.
file = open("xyz.txt")
content = file.read()

Now that you have all the content, you have to separate it by phase. For this we will use the re.split function.

phase_data = re.split("#PHASE = .*\n", content)

Now we have the content of each phase in an array.

Now comes for the part of executing it.

for phase in phase_data:
    if len(phase.strip()) == 0:
        continue
    exec(phase)
    table = makeTable(x, y, z) # the x, y and z are defined by the exec. 
    # do whatever you want with the table.

I will reiterate that you have to absolutely trust the contents of the file. Since you are executing it as code.

But your work seems like a scripting one and I believe this will get your work done.

PS : The other "safer" alternative to exec is to have a sandboxing library which takes the string and executes it without affecting the parent scope.

Upvotes: 1

Mikael

Reputation: 554

If efficiency and code reuse instead of copy is the goal, I think that Classes might provide a good way. I'm going to sleep now, but I'll edit later. Here's my thoughts: create a class called FileWithArrays and use a parser to read the lines and put them inside the object FileWithArrays you will create using the class. Once that's done, you can then create a method to transform the object in a table.

P.S. A good idea for the parser is to store all the lines in a list and parse them one by one, using list.pop() to auto shrink the list. Hope it helps, tomorrow I'll look more on it if this doesn't help a lot. Try to rewrite/reformat the question if I misunderstood anything, it's not very easy to read.

Upvotes: 1

Extracting information from unconventional text files? (Python)

Answers (3)

Related Questions