ARJ
ARJ

Reputation: 2080

Convert the string into a float value

I have copied a table with three columns from a pdf file. I am attaching the screenshot from the PDF here:

enter image description here

The values in the column padj are exponential values, however, when you copy from the pdf to an excel and then open it with pandas, these are strings or object data types. Hence, these values cannot be parsed as floats or numeric values. I need these values as floats, not as strings. Can someone help me with some suggestions? So far this is what I have tried.

The excel or the csv file is then opened in python using the escape_unicode encoding in order to circumvent the UnicodeDecodeError

## open the file
df       = pd.read_csv("S2_GSE184956.csv",header=0,sep=',',encoding='unicode_escape')[["DEGs","LFC","padj"]]

   df.head()
        DEGs   padj          LFC
    0   JUNB    1.5 ×10-8   -1.273329
    1   HOOK2   2.39×10-7   -1.109320
    2   EGR1    3.17×10-6   -4.187828
    3   DUSP1   3.95×10-6   -3.251030
    4   IL6     3.95×10-6   -3.415500
    5   ARL4C   5.06×10-6   -2.147519
    6   NR4A2   2.94×10-4   -3.001167
    7   CCL3L1  4.026×10-4  -5.293694


# Convert the string to float by replacing the x10- with exponential sign 
    df['padj'] = df['padj'].apply(lambda x: (unidecode(x).replace('x10-','x10-e'))).astype(float)
That threw an error,

    ValueError: could not convert string to float: '1.5 x10-e8'

Any suggestions would be appreciated. Thanks

Upvotes: 1

Views: 361

Answers (4)

mozway
mozway

Reputation: 262484

If you want a numerical vectorial solution, you can use:

df['float'] = (df['padj'].str.extract(r'(\d+(?:\.\d+))\s*×10(.?\d+)')
 .apply(pd.to_numeric).pipe(lambda d: d[0].mul(10.**d[1]))
)

output:

     DEGs        padj       LFC         float
0    JUNB   1.5 ×10-8 -1.273329  1.500000e-08
1   HOOK2   2.39×10-7 -1.109320  2.390000e-07
2    EGR1   3.17×10-6 -4.187828  3.170000e-06
3   DUSP1   3.95×10-6 -3.251030  3.950000e-06
4     IL6   3.95×10-6 -3.415500  3.950000e-06
5   ARL4C   5.06×10-6 -2.147519  5.060000e-06
6   NR4A2   2.94×10-4 -3.001167  2.940000e-04
7  CCL3L1  4.026×10-4 -5.293694  4.026000e-04

Intermediate:

df['padj'].str.extract('(\d+(?:\.\d+))\s*×10(.?\d+)')

       0   1
0    1.5  -8
1   2.39  -7
2   3.17  -6
3   3.95  -6
4   3.95  -6
5   5.06  -6
6   2.94  -4
7  4.026  -4

Upvotes: 1

Daniel Schneider
Daniel Schneider

Reputation: 2046

Updated based on the data you provided above. The most significant thing being that the x is actually a times symbol:

import pandas as pd
DEGs = ["JUNB", "HOOK2", "EGR1", "DUSP1", "IL6", "ARL4C", "NR4A2", "CCL3L1"]
padj = ["1.5 ×10-8", "2.39×10-7", "3.17×10-6", "3.95×10-6", "3.95×10-6", "5.06×10-6", "2.94×10-4", "4.026×10-4"]
LFC = ["-1.273329", "-1.109320", "-4.187828", "-3.251030", "-3.415500", "-2.147519", "-3.001167", "-5.293694"]
df = pd.DataFrame({'DEGs': DEGs, 'padj': padj, 'LFC': LFC})
# change to python-friendly float format
df['padj'] = df['padj'].str.replace(' ×10-', 'e-', regex=False)
df['padj'] = df['padj'].str.replace('×10-', 'e-', regex=False)
# convert padj from string to float
df['padj'] = df['padj'].astype(float)

will give you this dataframe: enter image description here

Upvotes: 1

Gonçalo Peres
Gonçalo Peres

Reputation: 13622

With the dataframe shared in the question on this last edit, the following using pandas.Series.str.replace and pandas.Series.astype will do the work:

df['padj'] = df['padj'].str.replace('×10','e').str.replace(' ', '').astype(float)

The goal is to get the cells to look like the following 1.560000e-08.


Notes:

  • Depending on the rest of the dataframe, additional adjustments might still be required, such as, removing the spaces ' that might exist in one of the cells. For that one can use pandas.Series.str.replace as follows

    df['padj'] = df['padj'].str.replace("'", '')
    

Upvotes: 1

Adrien Derouene
Adrien Derouene

Reputation: 79

Considering your sample (column padj), the code below should work:

f_value = eval(str_float.replace('x10', 'e').replace(' ', ''))

Upvotes: 1

Related Questions