Reputation: 39
This is the first line of my txt.file
0.112296E+02-.121994E-010.158164E-030.158164E-030.000000E+000.340000E+030.328301E-010.000000E+00
There should be 8 columns, sometimes separated with '-', sometimes with '.'. It's very confusing, I just have to work with the file, I didn't generate it.
And second question: How can I work with the different columns? There is no header, so maybe:
df.iloc[:,0]
.. ?
Upvotes: 3
Views: 1215
Reputation: 5425
As stated in comments, this is likely a list of numbers in scientific notation, that aren't separated by anything but simply glued together. It could be interpreted as:
0.112296E+02
-.121994E-010
.158164E-030
.158164E-030
.000000E+000
.340000E+030
.328301E-010
.000000E+00
or as
0.112296E+02
-.121994E-01
0.158164E-03
0.158164E-03
0.000000E+00
0.340000E+03
0.328301E-01
0.000000E+00
Assuming the second interpretation is better, the trick is to split evenly every 12 characters.
data = [line[i:i+12] for i in range(0, len(line), 12)]
If really the first interpretation is better, then I'd use a REGEX
import re
line = '0.112296E+02-.121994E-010.158164E-030.158164E-030.000000E+000.340000E+030.328301E-010.000000E+00'
pattern = '[+-]?\d??\.\d+E[+-]\d+'
data = re.findall(pattern, line)
Edit
Obviously, you'd need to iterate over each line in the file, and add it to your dataframe. This is a rather inefficient thing to do in Pandas. Therefore, if your preferred interpretation is the fixed width one, I'd go with @Ev. Kounis ' answer: df = pd.read_fwf(myfile, widths=[12]*8)
Otherwise, the inefficient way is:
df = pd.DataFrame(columns=range(8))
with open(myfile, 'r') as f_in:
for i, lines in enumerate(f_in):
data = re.findall(pattern, line)
df.loc[i] = [float(d) for d in data]
The two things to notice here is that the DataFrame must be initialized with column names (here [0, 1, 2, 3..7] but perhaps you know of better identifiers); and that the regex gave us strings that must be casted to floats.
Upvotes: 4
Reputation: 46
A possible solution is the following:
row = '0.112296E+02-.121994E-010.158164E-030.158164E-030.000000E+000.340000E+030.328301E-010.000000E+00'
chunckLen = 12
for i in range(0, len(row), chunckLen):
print(row[0+i:chunckLen+i])
You can easly extend the code to handle more general cases.
Upvotes: 1
Reputation: 15204
As i said in the comments, it is not a case of multiple separators, it is just a fixed width format. Pandas
has a method to read such files. try this:
df = pd.read_fwf(myfile, widths=[12]*8)
print(df) # prints -> [0.112296E+02, -.121994E-01, 0.158164E-03, 0.158164E-03.1, 0.000000E+00, 0.340000E+03, 0.328301E-01, 0.000000E+00.1]
for the widths you have to provide the cell width which looks like its 12 and the number of columns which as you say must be 8.
As you might notice the results of the read are not perfect (notice the .1
just before the comma in the 4th and last element) but i am working on it.
Alternatively, you can do it "manually" like so:
myfile = r'C:\Users\user\Desktop\PythonScripts\a_file.csv'
width = 12
my_content = []
with open(myfile, 'r') as f_in:
for lines in f_in:
data = [float(lines[i * width:(i + 1) * width]) for i in range(len(lines) // width)]
my_content.append(data)
print(my_content) # prints -> [[11.2296, -0.0121994, 0.000158164, 0.000158164, 0.0, 340.0, 0.0328301, 0.0]]
and every row would be a nested list.
Upvotes: 3