Reputation: 355
I have a data file called cleaned_coords.txt which I am extracting into a pandas dataframe. This is how cleaned_coords.txt looks like:
1 1 8.40871 3.96667 1.62792 -0.0242897 0.0217222 -0.0202916
2 2 8.01328 4.85028 1.64181 0.00616535 -2.50788e-06 0.00506225
3 1 7.38052 4.83557 0.908616 0.00891202 -0.00286839 -0.0224226
4 1 4.80607 4.99439 8.90799 0.02527 0.0136878 0.00715371
5 2 4.7497 4.77019 7.96784 -0.00495311 -0.0033607 0.00586672
6 1 5.59201 5.07857 7.60231 0.00295874 -0.0148703 -0.0234462
7 1 7.20491 1.66118 13.9242 -0.0267906 -0.0263551 -0.0270223
8 2 8.00486 1.60311 14.4665 -0.00277453 0.00193951 0.00670056
9 1 8.718 1.48953 13.8209 0.00832425 -0.0201998 -0.00663991
10 1 1.28567 5.5535 11.6424 0.00428372 -0.00289222 -0.0182445
11 2 1.56894 4.74096 11.1987 0.00614187 0.00526615 0.000498092
12 1 2.40718 4.98336 10.7782 -0.0155901 -0.0217939 0.00378399
13 1 15.325 7.97006 8.07627 0.010599 0.0244346 -0.0146411
It is just columns of data, going up to line 22875.
This is the python script I am running:
import numpy as np
import pandas as pd
#data = pd.read_csv('thermo_.txt', sep=" ", header=None)
data = pd.read_csv('cleaned_coords.txt', sep=" ", header=None)
df = pd.DataFrame(data)
print(df)
And this is the result I am seeing:
~/Desktop/Work/Simulations/water_tip3p $ python3 data_parser.py
0 1 2 3 4 5 6 7 8
0 1 1 8.40871 3.96667 1.627920 -0.024290 0.021722 -0.020292 NaN
1 2 2 8.01328 4.85028 1.641810 0.006165 -0.000003 0.005062 NaN
2 3 1 7.38052 4.83557 0.908616 0.008912 -0.002868 -0.022423 NaN
3 4 1 4.80607 4.99439 8.907990 0.025270 0.013688 0.007154 NaN
4 5 2 4.74970 4.77019 7.967840 -0.004953 -0.003361 0.005867 NaN
... ... .. ... ... ... ... ... ... ..
22870 371 2 5.99702 15.28940 8.358520 -0.002494 -0.001633 -0.005420 NaN
22871 372 1 6.44791 14.62470 7.770960 0.009041 0.005236 0.000318 NaN
22872 373 1 1.75756 1.61822 3.808440 0.011774 -0.018397 -0.000799 NaN
22873 374 2 2.29898 1.02413 4.403190 -0.003222 -0.002237 0.003255 NaN
22874 375 1 1.82499 15.54560 4.115290 -0.001409 0.018354 -0.001956 NaN
[22875 rows x 9 columns]
My question is, why am I seeing column index 8 with NaN
entries? Why is it even seeing a column index of 8?
Upvotes: 1
Views: 396
Reputation: 153460
One way to hand this is to use "positive lookahead" in regex. This indicates separate on any space that is followed by any character. Thus, the spaces at the end of the lines aren't followed by a character, so don't separate avoiding the creation of the empty column.
df = pd.read_csv(stxt, sep="\s(?=.)", header=None)
MVCE:
from io import StringIO
import pandas as pd
stxt = StringIO("""1 1 8.40871 3.96667 1.62792 -0.0242897 0.0217222 -0.0202916
2 2 8.01328 4.85028 1.64181 0.00616535 -2.50788e-06 0.00506225
3 1 7.38052 4.83557 0.908616 0.00891202 -0.00286839 -0.0224226
4 1 4.80607 4.99439 8.90799 0.02527 0.0136878 0.00715371
5 2 4.7497 4.77019 7.96784 -0.00495311 -0.0033607 0.00586672
6 1 5.59201 5.07857 7.60231 0.00295874 -0.0148703 -0.0234462
7 1 7.20491 1.66118 13.9242 -0.0267906 -0.0263551 -0.0270223
8 2 8.00486 1.60311 14.4665 -0.00277453 0.00193951 0.00670056
9 1 8.718 1.48953 13.8209 0.00832425 -0.0201998 -0.00663991
10 1 1.28567 5.5535 11.6424 0.00428372 -0.00289222 -0.0182445
11 2 1.56894 4.74096 11.1987 0.00614187 0.00526615 0.000498092
12 1 2.40718 4.98336 10.7782 -0.0155901 -0.0217939 0.00378399
13 1 15.325 7.97006 8.07627 0.010599 0.0244346 -0.0146411 """)
df = pd.read_csv(stxt, sep="\s(?=.)", header=None, engine='python')
# Note if you just did sep=" ", you get the 9th column 8 with NaNs.
df
Output:
0 1 2 3 4 5 6 7
0 1 1 8.40871 3.96667 1.627920 -0.024290 0.021722 -0.020292
1 2 2 8.01328 4.85028 1.641810 0.006165 -0.000003 0.005062
2 3 1 7.38052 4.83557 0.908616 0.008912 -0.002868 -0.022423
3 4 1 4.80607 4.99439 8.907990 0.025270 0.013688 0.007154
4 5 2 4.74970 4.77019 7.967840 -0.004953 -0.003361 0.005867
5 6 1 5.59201 5.07857 7.602310 0.002959 -0.014870 -0.023446
6 7 1 7.20491 1.66118 13.924200 -0.026791 -0.026355 -0.027022
7 8 2 8.00486 1.60311 14.466500 -0.002775 0.001940 0.006701
8 9 1 8.71800 1.48953 13.820900 0.008324 -0.020200 -0.006640
9 10 1 1.28567 5.55350 11.642400 0.004284 -0.002892 -0.018245
10 11 2 1.56894 4.74096 11.198700 0.006142 0.005266 0.000498
11 12 1 2.40718 4.98336 10.778200 -0.015590 -0.021794 0.003784
12 13 1 15.32500 7.97006 8.076270 0.010599 0.024435 -0.014641
Upvotes: 1
Reputation: 619
For straight up simplicity I would go with @ALollz answer (just drop the NaN cols after reading as csv). But if you wanted to clean the text file first for some reason, you could do this:
with open("thermo_.txt", "r") as f:
a = [i.rstrip() for i in f] # removes extra ' ' spaces from right hand side
a = [i.split(' ') for i in a] # splits list into list of lists by line
df = pd.DataFrame(a).set_index(0)
Upvotes: 1
Reputation: 59519
Your delimiter is a space (' '
) and every line has an additional space at the end. pandas will split on that final space and, as there is nothing left after that space, create a column of all NaN
values.
Ideally you should fix the how the text file is generated and remove the extra delimiter at the end of each line. In case that is too much work, or not possible, a simple solution is to read in everything and ignore the last column:
df = pd.read_csv('cleaned_coords.txt', sep=' ', header=None).iloc[:, :-1]
If you are unsure whether there is a problem beforehand you can read everything and drop completely useless NaN
columns:
df = pd.read_csv('cleaned_coords.txt', sep=' ', header=None).dropna(how='all', axis=1)
*pd.read_csv
already returns a DataFrame
. No need to call pd.DataFrame on it again.
Upvotes: 3