Nobody
Nobody

Reputation: 77

Pandas reading scientific data

I have a csv-file with many columns containing something like

"4.2515014131285567e-001"

Pandas reads it as an object, therefore calculation doesn't make sense.

For example *2 gives me:

"4.2515014131285567e-0014.2515014131285567e-001"

How can I use it as a number and doing some math stuff?

I tried to set "dtype=str" "dtype=float" and such things but nothing worked.

Upvotes: 0

Views: 127

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 31011

Try the following test:

Create the following DataFrame, using read_csv, but from a text buffer:

txt = '''c1,c2,c3
Xxxxx,4.2515014131285567e-001,4.2515014131285555e-001
Yyyyy,4.2515014131284444e-001,4.2515014131283333e-001
Zzzzz,4.2515014131282222e-001,4.2515014131281111e-001'''
df = pd.read_csv(pd.compat.StringIO(txt))

Then check types of columns with df.info(). For both c2 and c3 columns you should receive float64 type.

If you execute df.c2 * 2, you should receive doubled values. Don't bother about smaller number of decimal digits. It is the matter of Pandas options.

You can display an individual number with almost full precision, using df.loc[0, 'c2'] (I got 0.4251501413128557).

The same results should be even if numbers were surrounded with e.g. double quotes.

Up to now it was OK, but now try the second test:

In row 3, c2 column, remove e in front of -001, so this value is now 4.2515014131282222-001 and read_csv again.

The value changed is not any properly formatted float, so read_csv assumes for c2 column object type, actually a string (you can confirm it with df.info()).

My assumption is that somewhere in your text file the format of a number is somehow "corrupted" and just this prevents read_csv from reading this column as float.

To find the place - source of this error, run:

df.c2 = pd.to_numeric(df.c2, errors='coerce')

(replacing c2 with the proper column name) and then look in this column for NaN values.

Then look at the corresponding row in the input file and correct the error.

Alternative: df.dropna(inplace=True) removes each row containig NaN in any column. You may also add subset=['column_name'] parameter, to drop rows with NaN in just this one column.

Upvotes: 1

krewsayder
krewsayder

Reputation: 446

With some pre-processing, you can convert the data on import and remove non-float records prior to importing if they exist.

Initial dataset in test.txt:

Math
4.2515014131285567e-001
asdas
123123
asdasd124
123
125423414asd

This tests if float, and if so it'll return true/false while creating a list of values to skip.

def isFloat(val):

    try:
        float(val)
        return True

    except:
        return False

with open('test.txt','r') as f:

    skiplines=[]

    for i, v in enumerate(f.readlines()):

        if not isFloat(v.split(',')[0]):

            skiplines.append(i)

# we want to maintain the column header.
    del skiplines[0]



converter = {'NumberColName':lambda x: float(x)}

df = pd.read_csv('test.txt', converters = converter, skiprows= skiplines)

The lambda function can also just be declaring a data type. I like demonstrating the converters because you can easily round or apply logic here if you need it.

The final dataframe looks as expected (note that there are 0's because I have not set my format.

print(df)
           Math
0       0.42515
1  123123.00000
2     123.00000

Upvotes: 1

Related Questions