Haorui He
Haorui He

Reputation: 93

Pandas df.to_csv but data missing?

run the following code:

df.info()
df.to_csv(file,index=False,sep=',',encoding='utf_8_sig')
df=pd.DataFrame(pd.read_csv(file,encoding='utf_8_sig'))
df.info()

And the output is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6723 entries, 0 to 6722
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   cmt_id          6723 non-null   int64  
 1   info_id         6723 non-null   int64  
 2   cmt_user_id     6723 non-null   int64  
 3   publish_time    6723 non-null   object 
 4   cmt_time        6723 non-null   object 
 5   diff_time       6723 non-null   float64
 6   platform        6723 non-null   object 
 7   post            6723 non-null   object 
 8   cmt_content     6723 non-null   object 
 9   post_view       4098 non-null   object 
 10  post_like       6723 non-null   object 
 11  post_dislike    6723 non-null   object 
 12  post_cmt        6559 non-null   object 
 13  post_repost     1965 non-null   object 
 14  cmt_like        5630 non-null   object 
 15  cmt_dislike     5630 non-null   object 
 16  cmt_reply       4336 non-null   object 
 17  cmt_repost      4460 non-null   object 
 18  user_gender     3091 non-null   object 
 19  user_score      4572 non-null   object 
 20  user_post_star  4048 non-null   object 
 21  user_reply      993 non-null    object 
 22  user_post       4572 non-null   object 
 23  user_friend     993 non-null    object 
dtypes: float64(1), int64(3), object(20)
memory usage: 1.2+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6723 entries, 0 to 6722
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   cmt_id          6723 non-null   int64  
 1   info_id         6723 non-null   int64  
 2   cmt_user_id     6723 non-null   int64  
 3   publish_time    6723 non-null   object 
 4   cmt_time        6723 non-null   object 
 5   diff_time       6723 non-null   float64
 6   platform        6723 non-null   object 
 7   post            6419 non-null   object 
 8   cmt_content     6723 non-null   object 
 9   post_view       4098 non-null   float64
 10  post_like       2081 non-null   float64
 11  post_dislike    1767 non-null   float64
 12  post_cmt        6544 non-null   float64
 13  post_repost     1446 non-null   float64
 14  cmt_like        0 non-null      float64
 15  cmt_dislike     0 non-null      float64
 16  cmt_reply       0 non-null      float64
 17  cmt_repost      124 non-null    float64
 18  user_gender     2622 non-null   object 
 19  user_score      4065 non-null   float64
 20  user_post_star  3579 non-null   float64
 21  user_reply      519 non-null    float64
 22  user_post       4103 non-null   float64
 23  user_friend     148 non-null    float64
dtypes: float64(15), int64(3), object(6)
memory usage: 1.2+ MB

It’s obvious to see some data missing in a few columns when I use Pandas to save my dataframe. There might be some bugs regards to the df.to_csv() function, cuz data is also missing in the csv file. (some online references describe issues about encoding, but that don't work in my settings) Why would that happen? How can I fix it?

Upvotes: 1

Views: 1973

Answers (1)

Pierre D
Pierre D

Reputation: 26211

As I said in the comments, CSV is a non-typed and sometimes pretty lousy format. IMHO, Pandas' read_csv() is one of the best (and fastest) CSV readers out there, and does a great job at inferring data types from sniffing a few lines of the input file. But sometimes, the task is just too hard or impossible.

For example, a column full of empty strings (dtype: object in the first df) might become interpreted as full of NaN when reading back (dtype: float64).

Other examples include saving hexadecimal values as strings. This has happened to us: in some rare cases, the values all looked like int (no 'a'..'f' digits), and were then read and interpreted as such (which of course was completely wrong and cost us hours of debugging).

Tips

  1. To get an idea of precisely what values are not loaded back the way they were initially, try:

    df2 = pd.read_csv(...)  # re-read into a different dataframe
    mask = ((df != df2) & ~df.isna()).any(1)
    
    df.loc[mask].head()
    

    This will show some of the rows where the values are different (in at least one column).

  2. Save your data into a typed format, for example parquet. This will avoid many of these CSV pitfalls and be generally safer.

Upvotes: 1

Related Questions