Reputation: 25
I have been struggling to convert a text file to a pandas Dataframe, so I can subsequently do calculations on the values and plot the coordinates.
The text file has the following format with a long header and then many rows. I put an example of part of the header and one row below. I wrote a small script to get the start and final line in the text file table part that I'm interested in.
starfile_name:
# version 30001
data_particles
loop_
_rlnTomoParticleName #1
_rlnTomoName #2
_rlnNormCorrection #21
_rlnLogLikeliContribution #22
_rlnMaxValueProbDistribution #23
_rlnNrOfSignificantSamples #24
TS_002/1 TS_002 1 2 1 1733.000000 3485.000000 938.000000 -1.08872 -1.08872 0.411277 131.760000 89.920000 97.200000 PseudoSubtomo/job052/Subtomograms/TS_002/1_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/1_weights.mrc 1 92.905599 28.438417 57.199867 1.000000 1.128367e+06 0.017733 224
TS_002/2 TS_002 1 1 1 1124.000000 693.000000 1096.000000 0.411277 -1.08872 -1.08872 79.270000 86.780000 100.730000 PseudoSubtomo/job052/Subtomograms/TS_002/2_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/2_weights.mrc 1 159.849821 4.120413 101.904501 1.000000 1.126854e+06 0.183934 37
TS_002/3 TS_002 1 2 1 1694.000000 2329.000000 1378.000000 5.955277 -6.63272 -1.08872 -140.62000 88.860000 99.000000 PseudoSubtomo/job052/Subtomograms/TS_002/3_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/3_weights.mrc 1 127.794678 4.085294 168.730698 1.000000 1.124178e+06 0.184649 18
I used the following lines to turn this into a DataFrame
#skip is the line number where the header and irrelevant part of the table ends
#foot is the number of rows at the end of the table that I'm not interested in
pandas_table = pd.read_csv(starfile_name, engine='python', index_col=False, header=None,skiprows=int(skip), skipfooter=int(foot), sep="\t")
print(pandas_table)
df = pd.DataFrame(data=pandas_table)
df
It appears that the whole table is read as if it is just one column. I tried providing column tags, but they don't line up with the actual data. I also played around with the str.split() and squeeze() options, but I keep getting errors.
output:
0
0 TS_002/1 TS_002 1 2 ...
1 TS_002/2 TS_002 1 1 ...
2 TS_002/3 TS_002 1 2 ...
3 TS_002/4 TS_002 1 1 ...
4 TS_002/5 TS_002 1 2 ...
... ...
1423 TS_002/1424 TS_002 1 ...
1424 TS_002/1425 TS_002 1 ...
1425 TS_002/1426 TS_002 1 ...
1426 TS_002/1427 TS_002 1 ...
1427 TS_002/1428 TS_002 1 ...
[1428 rows x 1 columns]
0
0 TS_002/1 TS_002 1 2 ...
1 TS_002/2 TS_002 1 1 ...
2 TS_002/3 TS_002 1 2 ...
3 TS_002/4 TS_002 1 1 ...
4 TS_002/5 TS_002 1 2 ...
... ...
1423 TS_002/1424 TS_002 1 ...
1424 TS_002/1425 TS_002 1 ...
1425 TS_002/1426 TS_002 1 ...
1426 TS_002/1427 TS_002 1 ...
1427 TS_002/1428 TS_002 1 ...
1428 rows × 1 columns
Upvotes: 0
Views: 561
Reputation: 1116
I think this would help you split columns by variable lenght spaces: use sep='\s+'
df = pd.read_csv(starfile_name, ...., sep='\s+')
print(df)
>>>
0 1 2 3 4 5 6 7 8 9 \
0 TS_002/1 TS_002 1 2 1 1733.0 3485.0 938.0 -1.088720 -1.08872
1 TS_002/2 TS_002 1 1 1 1124.0 693.0 1096.0 0.411277 -1.08872
2 TS_002/3 TS_002 1 2 1 1694.0 2329.0 1378.0 5.955277 -6.63272
... 14 \
0 ... PseudoSubtomo/job052/Subtomograms/TS_002/1_dat...
1 ... PseudoSubtomo/job052/Subtomograms/TS_002/2_dat...
2 ... PseudoSubtomo/job052/Subtomograms/TS_002/3_dat...
15 16 17 \
0 PseudoSubtomo/job052/Subtomograms/TS_002/1_wei... 1 92.905599
1 PseudoSubtomo/job052/Subtomograms/TS_002/2_wei... 1 159.849821
2 PseudoSubtomo/job052/Subtomograms/TS_002/3_wei... 1 127.794678
18 19 20 21 22 23
0 28.438417 57.199867 1.0 1128367.0 0.017733 224
1 4.120413 101.904501 1.0 1126854.0 0.183934 37
2 4.085294 168.730698 1.0 1124178.0 0.184649 18
[3 rows x 24 columns]
Upvotes: 1