Reputation: 4434
I am sorry if this is too basic... Essentially, I am using pandas
to load a huge CSV
file and then convert it to a numpy
array for post processing. I appreciate any help!
The issue is that some of the strings were missing during the transformation (from pandas dataframe
to numpy array
). For example, strings in the column "abstract" was complete see below print datafile["abstract"][0]
. However, once I converted them to a numpy array
, only a few strings left. see below print df_all[0,3]
import pandas as pd
import csv
import numpy as np
datafile = pd.read_csv(path, header=0)
df_all = pd.np.array(datafile, dtype='string')
header_t = list(datafile.columns.values)
print datafile["abstract"][0]
In order to test the widely held assumption that homeopathic medicines contain negligible quantities of their major ingredients, six such medicines labeled in Latin as containing arsenic were purchased over the counter and by mail order and their arsenic contents measured. Values determined were similar to those expected from label information in only two of six and were markedly at variance in the remaining four. Arsenic was present in notable quantities in two preparations. Most sales personnel interviewed could not identify arsenic as being an ingredient in these preparations and were therefore incapable of warning the general public of possible dangers from ingestion. No such warnings appeared on the labels.
print df_all[0,3]
In order to test the widely held assumption that homeopathic me
Upvotes: 2
Views: 840
Reputation: 54380
I think when you specify dtype='string'
, you are essentially specifying the default S64
type, which will truncate you string to 64 chars. Just skip that dtype='string'
part you should be good to go (and the dtype
will become object
).
Better yet, don't convert a DataFrame
to an array
, use the build-in df.values
.
Upvotes: 3