Reputation: 93
run the following code:
df.info()
df.to_csv(file,index=False,sep=',',encoding='utf_8_sig')
df=pd.DataFrame(pd.read_csv(file,encoding='utf_8_sig'))
df.info()
And the output is:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6723 entries, 0 to 6722
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cmt_id 6723 non-null int64
1 info_id 6723 non-null int64
2 cmt_user_id 6723 non-null int64
3 publish_time 6723 non-null object
4 cmt_time 6723 non-null object
5 diff_time 6723 non-null float64
6 platform 6723 non-null object
7 post 6723 non-null object
8 cmt_content 6723 non-null object
9 post_view 4098 non-null object
10 post_like 6723 non-null object
11 post_dislike 6723 non-null object
12 post_cmt 6559 non-null object
13 post_repost 1965 non-null object
14 cmt_like 5630 non-null object
15 cmt_dislike 5630 non-null object
16 cmt_reply 4336 non-null object
17 cmt_repost 4460 non-null object
18 user_gender 3091 non-null object
19 user_score 4572 non-null object
20 user_post_star 4048 non-null object
21 user_reply 993 non-null object
22 user_post 4572 non-null object
23 user_friend 993 non-null object
dtypes: float64(1), int64(3), object(20)
memory usage: 1.2+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6723 entries, 0 to 6722
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cmt_id 6723 non-null int64
1 info_id 6723 non-null int64
2 cmt_user_id 6723 non-null int64
3 publish_time 6723 non-null object
4 cmt_time 6723 non-null object
5 diff_time 6723 non-null float64
6 platform 6723 non-null object
7 post 6419 non-null object
8 cmt_content 6723 non-null object
9 post_view 4098 non-null float64
10 post_like 2081 non-null float64
11 post_dislike 1767 non-null float64
12 post_cmt 6544 non-null float64
13 post_repost 1446 non-null float64
14 cmt_like 0 non-null float64
15 cmt_dislike 0 non-null float64
16 cmt_reply 0 non-null float64
17 cmt_repost 124 non-null float64
18 user_gender 2622 non-null object
19 user_score 4065 non-null float64
20 user_post_star 3579 non-null float64
21 user_reply 519 non-null float64
22 user_post 4103 non-null float64
23 user_friend 148 non-null float64
dtypes: float64(15), int64(3), object(6)
memory usage: 1.2+ MB
It’s obvious to see some data missing in a few columns when I use Pandas to save my dataframe. There might be some bugs regards to the df.to_csv() function, cuz data is also missing in the csv file. (some online references describe issues about encoding, but that don't work in my settings) Why would that happen? How can I fix it?
Upvotes: 1
Views: 1973
Reputation: 26211
As I said in the comments, CSV is a non-typed and sometimes pretty lousy format. IMHO, Pandas' read_csv()
is one of the best (and fastest) CSV readers out there, and does a great job at inferring data types from sniffing a few lines of the input file. But sometimes, the task is just too hard or impossible.
For example, a column full of empty strings (dtype: object
in the first df
) might become interpreted as full of NaN
when reading back (dtype: float64
).
Other examples include saving hexadecimal values as strings. This has happened to us: in some rare cases, the values all looked like int
(no 'a'..'f'
digits), and were then read and interpreted as such (which of course was completely wrong and cost us hours of debugging).
Tips
To get an idea of precisely what values are not loaded back the way they were initially, try:
df2 = pd.read_csv(...) # re-read into a different dataframe
mask = ((df != df2) & ~df.isna()).any(1)
df.loc[mask].head()
This will show some of the rows where the values are different (in at least one column).
Save your data into a typed format, for example parquet. This will avoid many of these CSV pitfalls and be generally safer.
Upvotes: 1