heracho
heracho

Reputation: 610

pandas read_csv. How to ignore delimiter before line break

I'm reading a file with numerical values.

data = pd.read_csv('data.dat', sep=' ', header=None)

In the text file, each row end with a space, So pandas wait for a value that is not there and add a "nan" at the end of each row. For example:

2.343 4.234

is read as: [2.343, 4.234, nan]

I can avoid it using , usecols = [0 1] but I would prefer a more general solution

Upvotes: 4

Views: 5319

Answers (4)

Algorithmatic
Algorithmatic

Reputation: 1892

You can simply use:

data = pd.read_csv('data.dat', sep=' ', header=None,
                   index_col=False  # < fixes file with delimiters at the end of each line
)

From pandas documentation.

Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.

This should work regardless of what type of delimiter you have (e.g., whitespaces, tabs, commas, etc)

Upvotes: 1

Harry
Harry

Reputation: 131

You can use regular expressions in your sep argument.

Instead of specifying the separator to be one space, you can ask it to use as a separator any number of spaces until it finds the next value. You can do this by using the regular expression \s+:

data = pd.read_csv('data.dat', sep='\s+', header=None)

Upvotes: 9

Renate van Kempen
Renate van Kempen

Reputation: 134

Can you change the seperator in the csv file to be something else than a space? As this might be the reason why each row ends with a nan. If you use:

    data = pd.read_csv('data.dat', sep=',', header=None)

For example, this problem might be solved without having to clean the data.

Upvotes: 0

secretive
secretive

Reputation: 2112

Specifying which columns to read using usecols will be a cleaner approach or you can drop the column once you have read the data but this comes with an overhead of reading data that you don't need. The generic approach will require you the create a regex parser which will be more time consuming and more messy.

Upvotes: 0

Related Questions