Reputation: 93

Why dropna() is dropping extra rows?

I want to remove all rows that have atleast one 'NaN'. The dataframe looks like below but the actual one has around 1000004 rows.

The full CSV file: file

Time (µs)	ChannelA (mV)	ChannelB (mV)	ChannelC (mV)	ChannelD (mV)
1	0.1	0.2	0.3	0.4
2	0.5	NaN	0.4	0.5
3	0.6	NaN	0.3	0.11
4	0.8	0.6	NaN	0.6
5	0.7	0.1	0.4	NaN

The code I have written is as follows:

def fft(x, Plot_ShareY=True):

    #loads the csv files 
    dfs = pd.read_csv(x, delimiter=";", skiprows=(1,2), decimal=",", na_values='NaN')
 
    #removes spaces in column names   
    dfs = pd.DataFrame(dfs)  
    dfs.columns = dfs.columns.str.replace(' ', '') #removes spaces from col names
    
    #removes infinity symbol and converts non-numeric values to NaN
    for c in dfs:
        dfs[c] = pd.to_numeric(dfs[c], errors='coerce') 
    
    #drops NaN
    dfs = dfs.dropna(axis=0, how='any')
    print(dfs)

I expect to have atleast 300000 rows but I get only around 200000 rows. When I check in the actual CSV file, the first NaN does not occur atleast until the 380000 th row. So why is dropna() removing extra rows ?

Upvotes: 0

Answers (3)

Karn Kumar

Reputation: 8826

Updated code:

You can use stack and unstack with vectorised str.replace to convert , to . and you are done:

dfs = pd.read_csv("Pipe03_2.csv", delimiter=";", skiprows=(1,2), decimal=",", na_values=['(NA)'], low_memory=False)
dfs = dfs.rename(columns=lambda x: x.strip())
dfs = dfs.replace({'-∞': np.nan, '∞': np.nan})
dfs = dfs.dropna(axis=0, how='any')

dfs = dfs[['Channel A', 'Channel B', 'Channel C', 'Channel D']].stack().str.replace(',','.').unstack()
#dfs = dfs[['Channel A', 'Channel B', 'Channel C', 'Channel D']].stack().str.replace(',','.').unstack().astype(float)   <-- this change the datatype to float

Or Just to be a little more handy, you can define the column names the run the replace function.

cols = ['Channel A', 'Channel B', 'Channel C', 'Channel D']
dfs = dfs[cols].stack().str.replace(',','.').unstack()

dfs = dfs.loc[:, dfs.columns != 'Time'].stack().str.replace(',','.').unstack()

dfs = dfs[dfs.columns.difference(['Time'])].stack().str.replace(',','.').unstack()

dfs = dfs.loc[:, ~dfs.columns.isin(['Time'])].stack().str.replace(',','.').unstack()

dfs = dfs[dfs.columns.drop('Time')].stack().str.replace(',','.').unstack()

dfs = dfs.drop('Time', axis=1).stack().str.replace(',','.').unstack()
# dfs = dfs.drop(columns='Time').stack().str.replace(',','.').unstack()

Result:

dfs.head():

        Time  Channel A  Channel B  Channel C  Channel D
0 -20.000096  -0.366256  -1.098767   0.000000  -0.366256
1 -20.000046  -0.366256  -1.465022  -0.366256   0.305213
2 -19.999996   0.305213  -0.732511   0.000000  -0.366256
3 -19.999946  -0.366256  -1.098767   0.000000   0.000000
4 -19.999896   0.671469  -0.366256   0.305213   0.000000

dfs.tail()

             Time  Channel A  Channel B  Channel C  Channel D
944234  27.211601  -358.7474   499.8168  -169.2101  -236.1738
944237  27.211751  -358.7474   499.8168  -169.5764  -235.4413
944239  27.211851  -358.7474   499.8168  -169.9426  -235.8076
944240  27.211901  -358.7474   499.8168  -170.3089  -235.8076
944241  27.211951  -359.1136   499.8168  -170.3089  -236.1738

Update

While thoroughly reading of your csv file and looking at the data, this needs cleaning at the onset itself..

You literally have 5 columns underneath on the excel file (ms);(mV);(mV);(mV);(mV) for some kind of conductivity measures? ...

New code will be:

dfs = pd.read_csv("Pipe03_2.csv", skiprows=(0,1,2), sep=',',names=['Numbers','Time','Channel A','Channel B','Channel C','Channel D'], dtype='unicode')
dfs = dfs.dropna(axis=0, how='any')
print(dfs.head(200))

Results with modified solution:

       Numbers           Time      Channel A      Channel B      Channel C Channel D
0          -20    00009598;-0    36625560;-1     09876700;0    00000000;-0  36625560
1          -20    00004598;-0    36625560;-1    46502200;-0     36625560;0  30521300
2          -19     99999598;0    30521300;-0     73251120;0    00000000;-0  36625560
3          -19    99994598;-0    36625560;-1     09876700;0     00000000;0  00000000
4          -19     99989598;0    67146860;-0     36625560;0     30521300;0  00000000
5          -19    99984598;-0    73251120;-0     73251120;0     00000000;0  00000000
6          -19     99979598;0    00000000;-0     73251120;0     00000000;0  00000000
7          -19    99974598;-0    36625560;-1     09876700;0    00000000;-0  36625560
8          -19     99969598;0    00000000;-1    09876700;-0    36625560;-0  36625560
9          -19    99964598;-0    36625560;-0    73251120;-0    36625560;-0  36625560
10         -19    99959598;-0    73251120;-0    73251120;-0    36625560;-0  36625560
11         -19     99954598;0    30521300;-0     73251120;0     00000000;0  00000000
12         -19     99949598;0    00000000;-0     73251120;0     00000000;0  30521300
13         -19     99944598;0    30521300;-0     36625560;0     30521300;0  30521300
14         -19     99939598;0    00000000;-1     09876700;0     00000000;0  00000000
15         -19     99934598;0    30521300;-0     73251120;0     00000000;0  00000000
16         -19     99929598;0    00000000;-1     09876700;0    00000000;-0  36625560
17         -19     99924598;0    30521300;-1     09876700;0    00000000;-0  36625560
18         -19     99919598;0    30521300;-0    73251120;-0    36625560;-0  36625560
19         -19     99914598;0    00000000;-0     36625560;0     30521300;0  30521300
20         -19     99909598;0    30521300;-1     09876700;0     00000000;0  30521300
21         -19    99904598;-0    36625560;-0    36625560;-0     36625560;0  00000000
22         -19     99899598;0    00000000;-1     09876700;0     00000000;0  00000000
23         -19    99894598;-0    36625560;-0    73251120;-0     36625560;0  67146860
24         -19    99889598;-0    73251120;-1     46502200;0    00000000;-0  36625560
25         -19     99884598;0    30521300;-1    09876700;-0    73251120;-0  36625560
26         -19     99879598;0    00000000;-0    36625560;-0     36625560;0  30521300
27         -19     99874598;0    00000000;-1     09876700;0    00000000;-0  36625560
28         -19     99869598;0    30521300;-0    73251120;-0    73251120;-0  36625560
29         -19     99864598;0    00000000;-0    36625560;-0    36625560;-0  36625560
30         -19    99859598;-0    36625560;-0     36625560;0     00000000;0  00000000
31         -19    99854598;-0    36625560;-1    09876700;-0     36625560;0  00000000
32         -19     99849598;0    67146860;-1    09876700;-0    36625560;-0  36625560
33         -19     99844598;0    00000000;-0     73251120;0     30521300;0  67146860
34         -19     99839598;0    67146860;-0     36625560;0     30521300;0  30521300
35         -19     99834598;0    00000000;-0    36625560;-0     36625560;0  00000000
36         -19     99829598;0    00000000;-1     09876700;0     00000000;0  00000000
37         -19     99824598;0    00000000;-0    73251120;-1     09876700;0  00000000
38         -19     99819598;0    30521300;-1    09876700;-0    36625560;-0  73251120
39         -19    99814598;-0    73251120;-0    36625560;-0     36625560;0  00000000
40         -19    99809598;-0    36625560;-0    73251120;-0    36625560;-0  36625560
41         -19     99804598;0    00000000;-0    36625560;-0    36625560;-0  36625560
42         -19    99799598;-0    36625560;-1    09876700;-0     36625560;0  00000000
43         -19     99794598;0    30521300;-0     36625560;0     00000000;0  00000000
44         -19     99789598;0    00000000;-0     73251120;0     00000000;0  00000000
45         -19    99784598;-0    73251120;-0    73251120;-0     73251120;0  30521300
46         -19     99779598;0    00000000;-1    09876700;-0    73251120;-0  36625560
47         -19     99774598;0    30521300;-1     46502200;0    67146860;-0  73251120
48         -19     99769598;0    00000000;-0     36625560;0    00000000;-0  36625560
49         -19     99764598;0    30521300;-1     09876700;0     00000000;0  00000000
50         -19     99759598;0    00000000;-0     73251120;0    00000000;-0  36625560
51         -19     99754598;0    00000000;-0     36625560;0     00000000;0  00000000
52         -19     99749598;0    00000000;-0     73251120;0     30521300;0  00000000
53         -19     99744598;0    00000000;-0    73251120;-0    73251120;-0  36625560
54         -19     99739598;0    30521300;-1    09876700;-0     36625560;0  00000000
55         -19     99734598;0    00000000;-1    09876700;-0     36625560;0  00000000
56         -19     99729598;0    00000000;-0    36625560;-0     73251120;0  00000000
57         -19     99724598;0    30521300;-0     73251120;0     00000000;0  30521300
58         -19     99719598;0    30521300;-0    73251120;-0     36625560;0  30521300
59         -19     99714598;0    00000000;-0    36625560;-0     36625560;0  00000000
60         -19     99709598;0    30521300;-0     73251120;0     00000000;0  00000000
61         -19     99704598;0    30521300;-0     73251120;0     00000000;0  00000000
62         -19     99699598;0    00000000;-0     73251120;0    00000000;-0  36625560
63         -19    99694598;-0    36625560;-0    73251120;-0     36625560;0  00000000
64         -19     99689598;0    30521300;-0     73251120;0     00000000;0  00000000
65         -19     99684598;0    00000000;-0     73251120;0     00000000;0  30521300
66         -19     99679598;0     30521300;0     00000000;0     30521300;0  00000000
67         -19    99674598;-0    36625560;-0    73251120;-0     36625560;0  00000000
68         -19     99669598;0    00000000;-1     46502200;0     00000000;0  00000000
69         -19     99664598;0    00000000;-0     36625560;0     30521300;0  00000000
70         -19     99659598;0    00000000;-0    73251120;-0     36625560;0  00000000
71         -19    99654598;-0    73251120;-1     46502200;0    00000000;-0  36625560
72         -19     99649598;0    00000000;-0    73251120;-0    73251120;-0  36625560
73         -19     99644598;0    30521300;-0     73251120;0    30521300;-0  36625560
74         -19    99639598;-0    36625560;-0     36625560;0     00000000;0  30521300
75         -19    99634598;-0    36625560;-0     73251120;0     00000000;0  00000000
76         -19     99629598;0    00000000;-0    73251120;-0     36625560;0  67146860
77         -19     99624598;0    30521300;-1     46502200;0    00000000;-0  36625560
78         -19     99619598;0    30521300;-0    36625560;-0     73251120;0  30521300
79         -19     99614598;0    67146860;-1     09876700;0     30521300;0  67146860
80         -19    99609598;-0    36625560;-1    46502200;-0     36625560;0  00000000
81         -19     99604598;0    30521300;-0    73251120;-0     73251120;0  30521300
82         -19     99599598;0    30521300;-0     36625560;0     67146860;0  30521300
83         -19    99594598;-0    36625560;-0     36625560;0     00000000;0  00000000
84         -19     99589598;0    30521300;-0     36625560;0     30521300;0  00000000
85         -19     99584598;0    00000000;-0    73251120;-0     36625560;0  00000000
86         -19     99579598;0    00000000;-0     73251120;0     30521300;0  00000000
87         -19     99574598;0    30521300;-1     09876700;0    00000000;-0  73251120
88         -19     99569598;0    00000000;-0    73251120;-0     36625560;0  00000000
89         -19     99564598;0     00000000;0     00000000;0    00000000;-0  36625560
90         -19     99559598;0    30521300;-1     09876700;0    00000000;-0  73251120
91         -19     99554598;0    30521300;-0     36625560;0     00000000;0  67146860
92         -19     99549598;0    00000000;-0     73251120;0     00000000;0  30521300
93         -19    99544598;-0    36625560;-0    36625560;-0     36625560;0  00000000
94         -19     99539598;0    00000000;-0    73251120;-0    73251120;-0  36625560
95         -19    99534598;-0    36625560;-1     09876700;0    00000000;-0  36625560
96         -19    99529598;-0    36625560;-1    09876700;-0     36625560;0  00000000
97         -19     99524598;0    00000000;-0     73251120;0     00000000;0  00000000
98         -19    99519598;-0    36625560;-0    73251120;-0     36625560;0  00000000
99         -19     99514598;0    30521300;-0     73251120;0     00000000;0  30521300
100        -19    99509598;-0    36625560;-0    73251120;-0     73251120;0  00000000

Refined solution:

In case want to get rid of anything after ;* including ;, then you can use below..

dfs = pd.read_csv("Pipe03_2.csv", skiprows=(0,1,2), sep=',',names=['Numbers','Time','Channel A','Channel B','Channel C','Channel D'], dtype='unicode')
dfs = dfs.dropna(axis=0, how='any')
dfs['Time'] = dfs['Time'].str.split(";", expand=True, n=1)[0]
dfs['Channel A'] = dfs['Channel A'].str.split(";", expand=True, n=1)[0]
dfs['Channel B'] = dfs['Channel B'].str.split(";", expand=True, n=1)[0]
dfs['Channel C'] = dfs['Channel C'].str.split(";", expand=True, n=1)[0]
dfs['Channel D'] = dfs['Channel D'].str.split(";", expand=True, n=1)[0]
# or in single lambda function
# dfs = dfs.apply(lambda x: x.str.split(";", expand=True, n=1)[0])

Refined results:

   Numbers      Time Channel A Channel B Channel C Channel D
0      -20  00009598  36625560  09876700  00000000  36625560
1      -20  00004598  36625560  46502200  36625560  30521300
2      -19  99999598  30521300  73251120  00000000  36625560
3      -19  99994598  36625560  09876700  00000000  00000000
4      -19  99989598  67146860  36625560  30521300  00000000
5      -19  99984598  73251120  73251120  00000000  00000000
6      -19  99979598  00000000  73251120  00000000  00000000
7      -19  99974598  36625560  09876700  00000000  36625560
8      -19  99969598  00000000  09876700  36625560  36625560
9      -19  99964598  36625560  73251120  36625560  36625560
10     -19  99959598  73251120  73251120  36625560  36625560
11     -19  99954598  30521300  73251120  00000000  00000000
12     -19  99949598  00000000  73251120  00000000  30521300
13     -19  99944598  30521300  36625560  30521300  30521300
14     -19  99939598  00000000  09876700  00000000  00000000
15     -19  99934598  30521300  73251120  00000000  00000000
16     -19  99929598  00000000  09876700  00000000  36625560
17     -19  99924598  30521300  09876700  00000000  36625560
18     -19  99919598  30521300  73251120  36625560  36625560
19     -19  99914598  00000000  36625560  30521300  30521300
20     -19  99909598  30521300  09876700  00000000  30521300

Just to show the simplified and extracted version tested on the given dataset:

Just clean the data as max during feed ...

Solution

dfs = pd.read_csv("Pipe03_2.csv", skiprows=(0,1,2), sep=',',names=['Numbers','Time','Channel A','Channel B','Channel C','Channel D'], dtype='unicode')
dfs = dfs.dropna(axis=0, how='any')
dfs = dfs.apply(lambda x: x.str.split(";", expand=True, n=1)[0])

Result:

   Numbers      Time Channel A Channel B Channel C Channel D
0      -20  00009598  36625560  09876700  00000000  36625560
1      -20  00004598  36625560  46502200  36625560  30521300
2      -19  99999598  30521300  73251120  00000000  36625560
3      -19  99994598  36625560  09876700  00000000  00000000
4      -19  99989598  67146860  36625560  30521300  00000000
5      -19  99984598  73251120  73251120  00000000  00000000
6      -19  99979598  00000000  73251120  00000000  00000000
7      -19  99974598  36625560  09876700  00000000  36625560
8      -19  99969598  00000000  09876700  36625560  36625560
9      -19  99964598  36625560  73251120  36625560  36625560
10     -19  99959598  73251120  73251120  36625560  36625560
11     -19  99954598  30521300  73251120  00000000  00000000
12     -19  99949598  00000000  73251120  00000000  30521300
13     -19  99944598  30521300  36625560  30521300  30521300
14     -19  99939598  00000000  09876700  00000000  00000000
15     -19  99934598  30521300  73251120  00000000  00000000
16     -19  99929598  00000000  09876700  00000000  36625560
17     -19  99924598  30521300  09876700  00000000  36625560
18     -19  99919598  30521300  73251120  36625560  36625560
19     -19  99914598  00000000  36625560  30521300  30521300
20     -19  99909598  30521300  09876700  00000000  30521300

Upvotes: 0

Karn Kumar

Reputation: 8826

I took the different approach as I see using pd.to_numeric(dfs[c], errors='coerce') making it all Nan in my testing and returning empty DataFrame while applying dfs.dropna(axis=0, how='any'), So, best bet it to replace symbols(∞ & -∞) to np.nan and then remove Nan rows.

While trimming all the Whitespace, better use strip to remove both ends space on all the columns.

Solution:

You will be getting 387060 rows.

dfs = pd.read_csv("Pipe03_2.csv", delimiter=";", skiprows=(1,2), decimal=",", na_values=['(NA)'], low_memory=False)
dfs = dfs.rename(columns=lambda x: x.strip())
dfs = dfs.replace({'-∞': np.nan, '∞': np.nan})
dfs = dfs.dropna(axis=0, how='any')
print(dfs)

Results:

dfs.head()

        Time    Channel A    Channel B    Channel C    Channel D
0 -20.000096  -0,36625560  -1,09876700   0,00000000  -0,36625560
1 -20.000046  -0,36625560  -1,46502200  -0,36625560   0,30521300
2 -19.999996   0,30521300  -0,73251120   0,00000000  -0,36625560
3 -19.999946  -0,36625560  -1,09876700   0,00000000   0,00000000
4 -19.999896   0,67146860  -0,36625560   0,30521300   0,00000000

dfs.tail()
             Time      Channel A     Channel B      Channel C      Channel D
944234  27.211601  -358,74740000  499,81680000  -169,21010000  -236,17380000
944237  27.211751  -358,74740000  499,81680000  -169,57640000  -235,44130000
944239  27.211851  -358,74740000  499,81680000  -169,94260000  -235,80760000
944240  27.211901  -358,74740000  499,81680000  -170,30890000  -235,80760000
944241  27.211951  -359,11360000  499,81680000  -170,30890000  -236,17380000

Upvotes: 1

alparslan mimaroğlu

Reputation: 1490

You have improper cell values. You have to clean your data before converting it to numeric

dfs = dfs.loc[0:380_000].copy()
for c in dfs:
    dfs[c] = dfs[c].astype(float)

#drops NaN
dfs = dfs.dropna(axis=0, how='any')
print(dfs)

Upvotes: 0

Why dropna() is dropping extra rows?

Answers (3)

Solution:

Results:

Related Questions