Moritz
Moritz

Reputation: 5408

Ignore character while importing with pandas

I could not find such an option in the documentation. A measuring device spits out everything in Excel:

    <>    A    B    C
 1
 2
 3

When I delete the "<>" characters manually everything works fine. Is there a way to circumvent that (without conversion to csv)?

I do:

import pandas as pd 
df = pd.read_excel(filename,sheetname,skiprows=0,header=0,index_col=0)

skiprow = 1 does not do the trick since pandas uses the first row as names. If I supply names = list(range(1, 4)) the first data row is lost.

Upvotes: 3

Views: 10737

Answers (5)

Aritra
Aritra

Reputation: 430

Expanding on Peruz's answer:-

For your case, using regex

df = pd.read_csv(filename, sep="(?<!<>)\s+", engine='python')

This should read in the columns properly, except that the first column would be named <> A

To change this, simply alter the first column name

df.columns = pd.Series(df.columns.str.replace("<>\s", ""))

In the regex expression, \s+ matches any number of space characters except when preceded by whatever is mentioned in the negative lookaround denoted by (?<!charceters_to_ignore)

Upvotes: 4

Peruz
Peruz

Reputation: 433

Pandas read_csv() supports regex. You can avoid matching the white space if it is preceded by something (in your case #). Just as an example, avoiding "!":

sep='(?<!\\!)\s+'

if you want you could rename the column to remove the initial character and white space.

cheers

Upvotes: 1

Jeffrey Chan
Jeffrey Chan

Reputation: 43

I have the same problem. My first line is

# id x y ...

So pandas header keyword doesn't work. I did the following by reading it twice:

cos_phot_header = pd.read_csv(table, delim_whitespace=True, header=None, engine='python', nrows=1)
cos_plot_text_header = cos_phot_header.drop(0, axis=1).values.tolist()
cos_phot_data = pd.read_csv(table, skip_blank_lines=True, comment='#', 
               delim_whitespace=True, header=None, engine='python', names=cos_plot_text_header[0])

I don't understand why there is no such option in pandas to do this, it is a very common problem that everyone encounters. You can also read the table with no lines (nrows=0) and use .columns, but honestly I think it is an equally ugly solution to the problem.

Upvotes: 1

Gandalf Saxe
Gandalf Saxe

Reputation: 391

I have the same problem. My first line is:

# id ra dec ...

Where # is the commenting-character in Python. import_csv thinks that # is a column header, but it's not. The workaround I used was to define the headers manually:

headerlist = ['id', 'ra', 'dec', ...]  
df = pd.read_csv('data.txt', index_col=False, header=0, names=headerlist)

Note that index_col is optional in regards to this problem.

If there is any option to ignore a certain character in header line, I haven't found it. Hope this solution can be improved upon.

Upvotes: 0

wander95
wander95

Reputation: 1366

Another option would be:

f = open(fname, 'r')
line1 = f.readline()
data1 = pd.read_csv(f, sep='\s+', names=line1.replace(' #', '').split(), dtype=np.float)

You might have a different separator though.

Upvotes: 1

Related Questions