Reputation: 93
I want to remove all rows that have atleast one 'NaN'. The dataframe looks like below but the actual one has around 1000004 rows.
The full CSV file: file
Time (µs) | ChannelA (mV) | ChannelB (mV) | ChannelC (mV) | ChannelD (mV) |
---|---|---|---|---|
1 | 0.1 | 0.2 | 0.3 | 0.4 |
2 | 0.5 | NaN | 0.4 | 0.5 |
3 | 0.6 | NaN | 0.3 | 0.11 |
4 | 0.8 | 0.6 | NaN | 0.6 |
5 | 0.7 | 0.1 | 0.4 | NaN |
The code I have written is as follows:
def fft(x, Plot_ShareY=True):
#loads the csv files
dfs = pd.read_csv(x, delimiter=";", skiprows=(1,2), decimal=",", na_values='NaN')
#removes spaces in column names
dfs = pd.DataFrame(dfs)
dfs.columns = dfs.columns.str.replace(' ', '') #removes spaces from col names
#removes infinity symbol and converts non-numeric values to NaN
for c in dfs:
dfs[c] = pd.to_numeric(dfs[c], errors='coerce')
#drops NaN
dfs = dfs.dropna(axis=0, how='any')
print(dfs)
I expect to have atleast 300000 rows but I get only around 200000 rows. When I check in the actual CSV file, the first NaN does not occur atleast until the 380000 th row. So why is dropna()
removing extra rows ?
Upvotes: 0
Views: 668
Reputation: 8826
Updated code:
You can use stack
and unstack
with vectorised str.replace
to convert ,
to .
and you are done:
dfs = pd.read_csv("Pipe03_2.csv", delimiter=";", skiprows=(1,2), decimal=",", na_values=['(NA)'], low_memory=False)
dfs = dfs.rename(columns=lambda x: x.strip())
dfs = dfs.replace({'-∞': np.nan, '∞': np.nan})
dfs = dfs.dropna(axis=0, how='any')
dfs = dfs[['Channel A', 'Channel B', 'Channel C', 'Channel D']].stack().str.replace(',','.').unstack()
#dfs = dfs[['Channel A', 'Channel B', 'Channel C', 'Channel D']].stack().str.replace(',','.').unstack().astype(float) <-- this change the datatype to float
Or Just to be a little more handy, you can define the column names the run the replace function.
cols = ['Channel A', 'Channel B', 'Channel C', 'Channel D']
dfs = dfs[cols].stack().str.replace(',','.').unstack()
OR
dfs = dfs.loc[:, dfs.columns != 'Time'].stack().str.replace(',','.').unstack()
OR
dfs = dfs[dfs.columns.difference(['Time'])].stack().str.replace(',','.').unstack()
OR
dfs = dfs.loc[:, ~dfs.columns.isin(['Time'])].stack().str.replace(',','.').unstack()
OR
dfs = dfs[dfs.columns.drop('Time')].stack().str.replace(',','.').unstack()
OR
dfs = dfs.drop('Time', axis=1).stack().str.replace(',','.').unstack()
# dfs = dfs.drop(columns='Time').stack().str.replace(',','.').unstack()
Result:
dfs.head():
Time Channel A Channel B Channel C Channel D
0 -20.000096 -0.366256 -1.098767 0.000000 -0.366256
1 -20.000046 -0.366256 -1.465022 -0.366256 0.305213
2 -19.999996 0.305213 -0.732511 0.000000 -0.366256
3 -19.999946 -0.366256 -1.098767 0.000000 0.000000
4 -19.999896 0.671469 -0.366256 0.305213 0.000000
dfs.tail()
Time Channel A Channel B Channel C Channel D
944234 27.211601 -358.7474 499.8168 -169.2101 -236.1738
944237 27.211751 -358.7474 499.8168 -169.5764 -235.4413
944239 27.211851 -358.7474 499.8168 -169.9426 -235.8076
944240 27.211901 -358.7474 499.8168 -170.3089 -235.8076
944241 27.211951 -359.1136 499.8168 -170.3089 -236.1738
Update
While thoroughly reading of your csv file and looking at the data, this needs cleaning at the onset itself..
You literally have 5 columns underneath on the excel file (ms);(mV);(mV);(mV);(mV)
for some kind of conductivity measures? ...
New code will be:
dfs = pd.read_csv("Pipe03_2.csv", skiprows=(0,1,2), sep=',',names=['Numbers','Time','Channel A','Channel B','Channel C','Channel D'], dtype='unicode')
dfs = dfs.dropna(axis=0, how='any')
print(dfs.head(200))
Results with modified solution:
Numbers Time Channel A Channel B Channel C Channel D
0 -20 00009598;-0 36625560;-1 09876700;0 00000000;-0 36625560
1 -20 00004598;-0 36625560;-1 46502200;-0 36625560;0 30521300
2 -19 99999598;0 30521300;-0 73251120;0 00000000;-0 36625560
3 -19 99994598;-0 36625560;-1 09876700;0 00000000;0 00000000
4 -19 99989598;0 67146860;-0 36625560;0 30521300;0 00000000
5 -19 99984598;-0 73251120;-0 73251120;0 00000000;0 00000000
6 -19 99979598;0 00000000;-0 73251120;0 00000000;0 00000000
7 -19 99974598;-0 36625560;-1 09876700;0 00000000;-0 36625560
8 -19 99969598;0 00000000;-1 09876700;-0 36625560;-0 36625560
9 -19 99964598;-0 36625560;-0 73251120;-0 36625560;-0 36625560
10 -19 99959598;-0 73251120;-0 73251120;-0 36625560;-0 36625560
11 -19 99954598;0 30521300;-0 73251120;0 00000000;0 00000000
12 -19 99949598;0 00000000;-0 73251120;0 00000000;0 30521300
13 -19 99944598;0 30521300;-0 36625560;0 30521300;0 30521300
14 -19 99939598;0 00000000;-1 09876700;0 00000000;0 00000000
15 -19 99934598;0 30521300;-0 73251120;0 00000000;0 00000000
16 -19 99929598;0 00000000;-1 09876700;0 00000000;-0 36625560
17 -19 99924598;0 30521300;-1 09876700;0 00000000;-0 36625560
18 -19 99919598;0 30521300;-0 73251120;-0 36625560;-0 36625560
19 -19 99914598;0 00000000;-0 36625560;0 30521300;0 30521300
20 -19 99909598;0 30521300;-1 09876700;0 00000000;0 30521300
21 -19 99904598;-0 36625560;-0 36625560;-0 36625560;0 00000000
22 -19 99899598;0 00000000;-1 09876700;0 00000000;0 00000000
23 -19 99894598;-0 36625560;-0 73251120;-0 36625560;0 67146860
24 -19 99889598;-0 73251120;-1 46502200;0 00000000;-0 36625560
25 -19 99884598;0 30521300;-1 09876700;-0 73251120;-0 36625560
26 -19 99879598;0 00000000;-0 36625560;-0 36625560;0 30521300
27 -19 99874598;0 00000000;-1 09876700;0 00000000;-0 36625560
28 -19 99869598;0 30521300;-0 73251120;-0 73251120;-0 36625560
29 -19 99864598;0 00000000;-0 36625560;-0 36625560;-0 36625560
30 -19 99859598;-0 36625560;-0 36625560;0 00000000;0 00000000
31 -19 99854598;-0 36625560;-1 09876700;-0 36625560;0 00000000
32 -19 99849598;0 67146860;-1 09876700;-0 36625560;-0 36625560
33 -19 99844598;0 00000000;-0 73251120;0 30521300;0 67146860
34 -19 99839598;0 67146860;-0 36625560;0 30521300;0 30521300
35 -19 99834598;0 00000000;-0 36625560;-0 36625560;0 00000000
36 -19 99829598;0 00000000;-1 09876700;0 00000000;0 00000000
37 -19 99824598;0 00000000;-0 73251120;-1 09876700;0 00000000
38 -19 99819598;0 30521300;-1 09876700;-0 36625560;-0 73251120
39 -19 99814598;-0 73251120;-0 36625560;-0 36625560;0 00000000
40 -19 99809598;-0 36625560;-0 73251120;-0 36625560;-0 36625560
41 -19 99804598;0 00000000;-0 36625560;-0 36625560;-0 36625560
42 -19 99799598;-0 36625560;-1 09876700;-0 36625560;0 00000000
43 -19 99794598;0 30521300;-0 36625560;0 00000000;0 00000000
44 -19 99789598;0 00000000;-0 73251120;0 00000000;0 00000000
45 -19 99784598;-0 73251120;-0 73251120;-0 73251120;0 30521300
46 -19 99779598;0 00000000;-1 09876700;-0 73251120;-0 36625560
47 -19 99774598;0 30521300;-1 46502200;0 67146860;-0 73251120
48 -19 99769598;0 00000000;-0 36625560;0 00000000;-0 36625560
49 -19 99764598;0 30521300;-1 09876700;0 00000000;0 00000000
50 -19 99759598;0 00000000;-0 73251120;0 00000000;-0 36625560
51 -19 99754598;0 00000000;-0 36625560;0 00000000;0 00000000
52 -19 99749598;0 00000000;-0 73251120;0 30521300;0 00000000
53 -19 99744598;0 00000000;-0 73251120;-0 73251120;-0 36625560
54 -19 99739598;0 30521300;-1 09876700;-0 36625560;0 00000000
55 -19 99734598;0 00000000;-1 09876700;-0 36625560;0 00000000
56 -19 99729598;0 00000000;-0 36625560;-0 73251120;0 00000000
57 -19 99724598;0 30521300;-0 73251120;0 00000000;0 30521300
58 -19 99719598;0 30521300;-0 73251120;-0 36625560;0 30521300
59 -19 99714598;0 00000000;-0 36625560;-0 36625560;0 00000000
60 -19 99709598;0 30521300;-0 73251120;0 00000000;0 00000000
61 -19 99704598;0 30521300;-0 73251120;0 00000000;0 00000000
62 -19 99699598;0 00000000;-0 73251120;0 00000000;-0 36625560
63 -19 99694598;-0 36625560;-0 73251120;-0 36625560;0 00000000
64 -19 99689598;0 30521300;-0 73251120;0 00000000;0 00000000
65 -19 99684598;0 00000000;-0 73251120;0 00000000;0 30521300
66 -19 99679598;0 30521300;0 00000000;0 30521300;0 00000000
67 -19 99674598;-0 36625560;-0 73251120;-0 36625560;0 00000000
68 -19 99669598;0 00000000;-1 46502200;0 00000000;0 00000000
69 -19 99664598;0 00000000;-0 36625560;0 30521300;0 00000000
70 -19 99659598;0 00000000;-0 73251120;-0 36625560;0 00000000
71 -19 99654598;-0 73251120;-1 46502200;0 00000000;-0 36625560
72 -19 99649598;0 00000000;-0 73251120;-0 73251120;-0 36625560
73 -19 99644598;0 30521300;-0 73251120;0 30521300;-0 36625560
74 -19 99639598;-0 36625560;-0 36625560;0 00000000;0 30521300
75 -19 99634598;-0 36625560;-0 73251120;0 00000000;0 00000000
76 -19 99629598;0 00000000;-0 73251120;-0 36625560;0 67146860
77 -19 99624598;0 30521300;-1 46502200;0 00000000;-0 36625560
78 -19 99619598;0 30521300;-0 36625560;-0 73251120;0 30521300
79 -19 99614598;0 67146860;-1 09876700;0 30521300;0 67146860
80 -19 99609598;-0 36625560;-1 46502200;-0 36625560;0 00000000
81 -19 99604598;0 30521300;-0 73251120;-0 73251120;0 30521300
82 -19 99599598;0 30521300;-0 36625560;0 67146860;0 30521300
83 -19 99594598;-0 36625560;-0 36625560;0 00000000;0 00000000
84 -19 99589598;0 30521300;-0 36625560;0 30521300;0 00000000
85 -19 99584598;0 00000000;-0 73251120;-0 36625560;0 00000000
86 -19 99579598;0 00000000;-0 73251120;0 30521300;0 00000000
87 -19 99574598;0 30521300;-1 09876700;0 00000000;-0 73251120
88 -19 99569598;0 00000000;-0 73251120;-0 36625560;0 00000000
89 -19 99564598;0 00000000;0 00000000;0 00000000;-0 36625560
90 -19 99559598;0 30521300;-1 09876700;0 00000000;-0 73251120
91 -19 99554598;0 30521300;-0 36625560;0 00000000;0 67146860
92 -19 99549598;0 00000000;-0 73251120;0 00000000;0 30521300
93 -19 99544598;-0 36625560;-0 36625560;-0 36625560;0 00000000
94 -19 99539598;0 00000000;-0 73251120;-0 73251120;-0 36625560
95 -19 99534598;-0 36625560;-1 09876700;0 00000000;-0 36625560
96 -19 99529598;-0 36625560;-1 09876700;-0 36625560;0 00000000
97 -19 99524598;0 00000000;-0 73251120;0 00000000;0 00000000
98 -19 99519598;-0 36625560;-0 73251120;-0 36625560;0 00000000
99 -19 99514598;0 30521300;-0 73251120;0 00000000;0 30521300
100 -19 99509598;-0 36625560;-0 73251120;-0 73251120;0 00000000
Refined solution:
In case want to get rid of anything after ;*
including ;
, then you can use below..
dfs = pd.read_csv("Pipe03_2.csv", skiprows=(0,1,2), sep=',',names=['Numbers','Time','Channel A','Channel B','Channel C','Channel D'], dtype='unicode')
dfs = dfs.dropna(axis=0, how='any')
dfs['Time'] = dfs['Time'].str.split(";", expand=True, n=1)[0]
dfs['Channel A'] = dfs['Channel A'].str.split(";", expand=True, n=1)[0]
dfs['Channel B'] = dfs['Channel B'].str.split(";", expand=True, n=1)[0]
dfs['Channel C'] = dfs['Channel C'].str.split(";", expand=True, n=1)[0]
dfs['Channel D'] = dfs['Channel D'].str.split(";", expand=True, n=1)[0]
# or in single lambda function
# dfs = dfs.apply(lambda x: x.str.split(";", expand=True, n=1)[0])
Refined results:
Numbers Time Channel A Channel B Channel C Channel D
0 -20 00009598 36625560 09876700 00000000 36625560
1 -20 00004598 36625560 46502200 36625560 30521300
2 -19 99999598 30521300 73251120 00000000 36625560
3 -19 99994598 36625560 09876700 00000000 00000000
4 -19 99989598 67146860 36625560 30521300 00000000
5 -19 99984598 73251120 73251120 00000000 00000000
6 -19 99979598 00000000 73251120 00000000 00000000
7 -19 99974598 36625560 09876700 00000000 36625560
8 -19 99969598 00000000 09876700 36625560 36625560
9 -19 99964598 36625560 73251120 36625560 36625560
10 -19 99959598 73251120 73251120 36625560 36625560
11 -19 99954598 30521300 73251120 00000000 00000000
12 -19 99949598 00000000 73251120 00000000 30521300
13 -19 99944598 30521300 36625560 30521300 30521300
14 -19 99939598 00000000 09876700 00000000 00000000
15 -19 99934598 30521300 73251120 00000000 00000000
16 -19 99929598 00000000 09876700 00000000 36625560
17 -19 99924598 30521300 09876700 00000000 36625560
18 -19 99919598 30521300 73251120 36625560 36625560
19 -19 99914598 00000000 36625560 30521300 30521300
20 -19 99909598 30521300 09876700 00000000 30521300
Just to show the simplified and extracted version tested on the given dataset:
Just clean the data as max during feed ...
Solution
dfs = pd.read_csv("Pipe03_2.csv", skiprows=(0,1,2), sep=',',names=['Numbers','Time','Channel A','Channel B','Channel C','Channel D'], dtype='unicode')
dfs = dfs.dropna(axis=0, how='any')
dfs = dfs.apply(lambda x: x.str.split(";", expand=True, n=1)[0])
Result:
Numbers Time Channel A Channel B Channel C Channel D
0 -20 00009598 36625560 09876700 00000000 36625560
1 -20 00004598 36625560 46502200 36625560 30521300
2 -19 99999598 30521300 73251120 00000000 36625560
3 -19 99994598 36625560 09876700 00000000 00000000
4 -19 99989598 67146860 36625560 30521300 00000000
5 -19 99984598 73251120 73251120 00000000 00000000
6 -19 99979598 00000000 73251120 00000000 00000000
7 -19 99974598 36625560 09876700 00000000 36625560
8 -19 99969598 00000000 09876700 36625560 36625560
9 -19 99964598 36625560 73251120 36625560 36625560
10 -19 99959598 73251120 73251120 36625560 36625560
11 -19 99954598 30521300 73251120 00000000 00000000
12 -19 99949598 00000000 73251120 00000000 30521300
13 -19 99944598 30521300 36625560 30521300 30521300
14 -19 99939598 00000000 09876700 00000000 00000000
15 -19 99934598 30521300 73251120 00000000 00000000
16 -19 99929598 00000000 09876700 00000000 36625560
17 -19 99924598 30521300 09876700 00000000 36625560
18 -19 99919598 30521300 73251120 36625560 36625560
19 -19 99914598 00000000 36625560 30521300 30521300
20 -19 99909598 30521300 09876700 00000000 30521300
Upvotes: 0
Reputation: 8826
I took the different approach as I see using pd.to_numeric(dfs[c], errors='coerce')
making it all Nan
in my testing and returning empty DataFrame while applying dfs.dropna(axis=0, how='any')
, So, best bet it to replace symbols(∞ & -∞
) to np.nan
and then remove Nan
rows.
While trimming all the Whitespace, better use strip
to remove both ends space on all the columns.
You will be getting 387060
rows.
dfs = pd.read_csv("Pipe03_2.csv", delimiter=";", skiprows=(1,2), decimal=",", na_values=['(NA)'], low_memory=False)
dfs = dfs.rename(columns=lambda x: x.strip())
dfs = dfs.replace({'-∞': np.nan, '∞': np.nan})
dfs = dfs.dropna(axis=0, how='any')
print(dfs)
dfs.head()
Time Channel A Channel B Channel C Channel D
0 -20.000096 -0,36625560 -1,09876700 0,00000000 -0,36625560
1 -20.000046 -0,36625560 -1,46502200 -0,36625560 0,30521300
2 -19.999996 0,30521300 -0,73251120 0,00000000 -0,36625560
3 -19.999946 -0,36625560 -1,09876700 0,00000000 0,00000000
4 -19.999896 0,67146860 -0,36625560 0,30521300 0,00000000
dfs.tail()
Time Channel A Channel B Channel C Channel D
944234 27.211601 -358,74740000 499,81680000 -169,21010000 -236,17380000
944237 27.211751 -358,74740000 499,81680000 -169,57640000 -235,44130000
944239 27.211851 -358,74740000 499,81680000 -169,94260000 -235,80760000
944240 27.211901 -358,74740000 499,81680000 -170,30890000 -235,80760000
944241 27.211951 -359,11360000 499,81680000 -170,30890000 -236,17380000
Upvotes: 1
Reputation: 1490
You have improper cell values. You have to clean your data before converting it to numeric
dfs = dfs.loc[0:380_000].copy()
for c in dfs:
dfs[c] = dfs[c].astype(float)
#drops NaN
dfs = dfs.dropna(axis=0, how='any')
print(dfs)
Upvotes: 0