megamence
megamence

Reputation: 355

Non-existent column showing up in pandas dataframe

I have a data file called cleaned_coords.txt which I am extracting into a pandas dataframe. This is how cleaned_coords.txt looks like:

1 1 8.40871 3.96667 1.62792 -0.0242897 0.0217222 -0.0202916 
2 2 8.01328 4.85028 1.64181 0.00616535 -2.50788e-06 0.00506225 
3 1 7.38052 4.83557 0.908616 0.00891202 -0.00286839 -0.0224226 
4 1 4.80607 4.99439 8.90799 0.02527 0.0136878 0.00715371 
5 2 4.7497 4.77019 7.96784 -0.00495311 -0.0033607 0.00586672 
6 1 5.59201 5.07857 7.60231 0.00295874 -0.0148703 -0.0234462 
7 1 7.20491 1.66118 13.9242 -0.0267906 -0.0263551 -0.0270223 
8 2 8.00486 1.60311 14.4665 -0.00277453 0.00193951 0.00670056 
9 1 8.718 1.48953 13.8209 0.00832425 -0.0201998 -0.00663991 
10 1 1.28567 5.5535 11.6424 0.00428372 -0.00289222 -0.0182445 
11 2 1.56894 4.74096 11.1987 0.00614187 0.00526615 0.000498092 
12 1 2.40718 4.98336 10.7782 -0.0155901 -0.0217939 0.00378399 
13 1 15.325 7.97006 8.07627 0.010599 0.0244346 -0.0146411 

It is just columns of data, going up to line 22875.

This is the python script I am running:

import numpy as np
import pandas as pd

#data = pd.read_csv('thermo_.txt', sep=" ", header=None)
data = pd.read_csv('cleaned_coords.txt', sep=" ", header=None)
df = pd.DataFrame(data)

print(df)

And this is the result I am seeing:

~/Desktop/Work/Simulations/water_tip3p $ python3 data_parser.py
         0  1        2         3         4         5         6         7   8
0        1  1  8.40871   3.96667  1.627920 -0.024290  0.021722 -0.020292 NaN
1        2  2  8.01328   4.85028  1.641810  0.006165 -0.000003  0.005062 NaN
2        3  1  7.38052   4.83557  0.908616  0.008912 -0.002868 -0.022423 NaN
3        4  1  4.80607   4.99439  8.907990  0.025270  0.013688  0.007154 NaN
4        5  2  4.74970   4.77019  7.967840 -0.004953 -0.003361  0.005867 NaN
...    ... ..      ...       ...       ...       ...       ...       ...  ..
22870  371  2  5.99702  15.28940  8.358520 -0.002494 -0.001633 -0.005420 NaN
22871  372  1  6.44791  14.62470  7.770960  0.009041  0.005236  0.000318 NaN
22872  373  1  1.75756   1.61822  3.808440  0.011774 -0.018397 -0.000799 NaN
22873  374  2  2.29898   1.02413  4.403190 -0.003222 -0.002237  0.003255 NaN
22874  375  1  1.82499  15.54560  4.115290 -0.001409  0.018354 -0.001956 NaN

[22875 rows x 9 columns]

My question is, why am I seeing column index 8 with NaN entries? Why is it even seeing a column index of 8?

Upvotes: 1

Views: 396

Answers (3)

Scott Boston
Scott Boston

Reputation: 153460

One way to hand this is to use "positive lookahead" in regex. This indicates separate on any space that is followed by any character. Thus, the spaces at the end of the lines aren't followed by a character, so don't separate avoiding the creation of the empty column.

df = pd.read_csv(stxt, sep="\s(?=.)", header=None)

MVCE:

from io import StringIO
import pandas as pd

stxt = StringIO("""1 1 8.40871 3.96667 1.62792 -0.0242897 0.0217222 -0.0202916 
2 2 8.01328 4.85028 1.64181 0.00616535 -2.50788e-06 0.00506225 
3 1 7.38052 4.83557 0.908616 0.00891202 -0.00286839 -0.0224226 
4 1 4.80607 4.99439 8.90799 0.02527 0.0136878 0.00715371 
5 2 4.7497 4.77019 7.96784 -0.00495311 -0.0033607 0.00586672 
6 1 5.59201 5.07857 7.60231 0.00295874 -0.0148703 -0.0234462 
7 1 7.20491 1.66118 13.9242 -0.0267906 -0.0263551 -0.0270223 
8 2 8.00486 1.60311 14.4665 -0.00277453 0.00193951 0.00670056 
9 1 8.718 1.48953 13.8209 0.00832425 -0.0201998 -0.00663991 
10 1 1.28567 5.5535 11.6424 0.00428372 -0.00289222 -0.0182445 
11 2 1.56894 4.74096 11.1987 0.00614187 0.00526615 0.000498092 
12 1 2.40718 4.98336 10.7782 -0.0155901 -0.0217939 0.00378399 
13 1 15.325 7.97006 8.07627 0.010599 0.0244346 -0.0146411 """)

df = pd.read_csv(stxt, sep="\s(?=.)", header=None, engine='python')
# Note if you just did sep=" ", you get the 9th column 8 with NaNs.

df

Output:

     0  1         2        3          4         5         6         7
0    1  1   8.40871  3.96667   1.627920 -0.024290  0.021722 -0.020292
1    2  2   8.01328  4.85028   1.641810  0.006165 -0.000003  0.005062
2    3  1   7.38052  4.83557   0.908616  0.008912 -0.002868 -0.022423
3    4  1   4.80607  4.99439   8.907990  0.025270  0.013688  0.007154
4    5  2   4.74970  4.77019   7.967840 -0.004953 -0.003361  0.005867
5    6  1   5.59201  5.07857   7.602310  0.002959 -0.014870 -0.023446
6    7  1   7.20491  1.66118  13.924200 -0.026791 -0.026355 -0.027022
7    8  2   8.00486  1.60311  14.466500 -0.002775  0.001940  0.006701
8    9  1   8.71800  1.48953  13.820900  0.008324 -0.020200 -0.006640
9   10  1   1.28567  5.55350  11.642400  0.004284 -0.002892 -0.018245
10  11  2   1.56894  4.74096  11.198700  0.006142  0.005266  0.000498
11  12  1   2.40718  4.98336  10.778200 -0.015590 -0.021794  0.003784
12  13  1  15.32500  7.97006   8.076270  0.010599  0.024435 -0.014641

Upvotes: 1

footfalcon
footfalcon

Reputation: 619

For straight up simplicity I would go with @ALollz answer (just drop the NaN cols after reading as csv). But if you wanted to clean the text file first for some reason, you could do this:

with open("thermo_.txt", "r") as f:
    a = [i.rstrip() for i in f]  # removes extra ' ' spaces from right hand side
a = [i.split(' ') for i in a]    # splits list into list of lists by line

df = pd.DataFrame(a).set_index(0)

Upvotes: 1

ALollz
ALollz

Reputation: 59519

Your delimiter is a space (' ') and every line has an additional space at the end. pandas will split on that final space and, as there is nothing left after that space, create a column of all NaN values.

Ideally you should fix the how the text file is generated and remove the extra delimiter at the end of each line. In case that is too much work, or not possible, a simple solution is to read in everything and ignore the last column:

df = pd.read_csv('cleaned_coords.txt', sep=' ', header=None).iloc[:, :-1]

If you are unsure whether there is a problem beforehand you can read everything and drop completely useless NaN columns:

df = pd.read_csv('cleaned_coords.txt', sep=' ', header=None).dropna(how='all', axis=1)

*pd.read_csv already returns a DataFrame. No need to call pd.DataFrame on it again.

Upvotes: 3

Related Questions