Mrowkacala
Mrowkacala

Reputation: 153

.dat file import in pandas

I want to import this publicly available file using pandas. Simply as csv (I have renamed simply .dat to .csv):

clinton = pd.read_csv("C:/Users/Mateusz/Downloads/ML_DS-20180523T193457Z-001/ML_DS/clinton1.csv")

However in some cases country name is composed of two words, not just one. In those cases shifts my data frame to the right. This looks like (name hot springs is in two columns): enter image description here How to fix it for the entire dataset at once?

Upvotes: 2

Views: 17336

Answers (2)

Tim Nixon
Tim Nixon

Reputation: 45

You can use a regular expression as a separator. In your specific case, all the delimiters are more than one space whereas the spaces in the names are just single spaces.

import pandas as pd

clinton = pd.read_csv("clinton1.csv", sep='\s{2,}', header=None, engine='python')

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153460

No need to rename the .dat to .csv. Instead you can use a regex that matches two or more spaces as a column separator.

Try use sep parameter:

pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
            header=None, sep='\s\s+', engine='python')

Output:

            0      1     2      3      4     5      6      7     8     9    10
0  Autauga, AL  30.92  31.7  57623  15768  15.2  10.74  51.41  60.4  2.36  457
1  Baldwin, AL  26.24  35.5  84935  16954  13.6   9.73  51.34  66.5  5.40  282
2  Barbour, AL  46.36  32.8  83656  15532  25.0   8.82  53.03  28.8  7.02   47
3   Blount, AL  32.92  34.5  61249  14820  15.0   9.67  51.15  62.4  2.36  185
4  Bullock, AL  67.67  31.7  75725  11120  33.0   7.08  50.76  17.6  2.91  141

If you want your state as a seperate column you can use this sep='\s\s+|,' which means seperate columns on two spaces or more OR a comma.

pd.read_csv('http://users.stat.ufl.edu/~winner/data/clinton1.dat',
            header=None, sep='\s\s+|,', engine='python')

Output:

        0    1      2     3      4        5     6      7      8     9     10     11
0  Autauga   AL  30.92  31.7  57623  15768.0  15.2  10.74  51.41  60.4  2.36  457.0
1  Baldwin   AL  26.24  35.5  84935  16954.0  13.6   9.73  51.34  66.5  5.40  282.0
2  Barbour   AL  46.36  32.8  83656  15532.0  25.0   8.82  53.03  28.8  7.02   47.0
3   Blount   AL  32.92  34.5  61249  14820.0  15.0   9.67  51.15  62.4  2.36  185.0
4  Bullock   AL  67.67  31.7  75725  11120.0  33.0   7.08  50.76  17.6  2.91  141.0

Upvotes: 9

Related Questions