TTT
TTT

Reputation: 4434

Loss of strings when creates a Numpy Array from a Pandas Dataframe

I am sorry if this is too basic... Essentially, I am using pandas to load a huge CSV file and then convert it to a numpy array for post processing. I appreciate any help!

The issue is that some of the strings were missing during the transformation (from pandas dataframe to numpy array). For example, strings in the column "abstract" was complete see below print datafile["abstract"][0]. However, once I converted them to a numpy array, only a few strings left. see below print df_all[0,3]

import pandas as pd
import csv
import numpy as np

datafile = pd.read_csv(path, header=0)
df_all = pd.np.array(datafile, dtype='string')
header_t = list(datafile.columns.values)

Strings were complete in pandas dataframe`

print datafile["abstract"][0]
 In order to test the widely held assumption that homeopathic medicines contain negligible quantities of their major ingredients, six such medicines labeled in Latin as containing arsenic were purchased over the counter and by mail order and their arsenic contents measured. Values determined were similar to those expected from label information in only two of six and were markedly at variance in the remaining four. Arsenic was present in notable quantities in two preparations. Most sales personnel interviewed could not identify arsenic as being an ingredient in these preparations and were therefore incapable of warning the general public of possible dangers from ingestion. No such warnings appeared on the labels.

Strings were incomplete in numpy`

print df_all[0,3]
In order to test the widely held assumption that homeopathic me

Upvotes: 2

Views: 840

Answers (1)

CT Zhu
CT Zhu

Reputation: 54380

I think when you specify dtype='string', you are essentially specifying the default S64 type, which will truncate you string to 64 chars. Just skip that dtype='string' part you should be good to go (and the dtype will become object).

Better yet, don't convert a DataFrame to an array, use the build-in df.values.

Upvotes: 3

Related Questions