Simd
Simd

Reputation: 21223

Questions about read_csv and str dtype

I have a large text file where the columns are of the following form:

1255 32627 some random stuff which might have numbers 1245

1.I would like to use read_csv to give me a data frame with three columns. The first two columns should be dtype uint32 and the third just has everything afterwards in a string. That is the line above should be split into 1255, 32627 and some random stuff which might have numbers 1245. This for example does not do it but at least shows the dtypes:

    pd.read_csv("foo.txt", sep=' ', header=None, dtype={0:np.uint32, 1:np.uint32, 2:np.str})

2.My second question is about the str dtype.How much RAM does it use and if I know the max length of a string can I reduce that?

Upvotes: 1

Views: 231

Answers (2)

A Magoon
A Magoon

Reputation: 1210

  1. Is there a reason you need to use pd.read_csv()? The code below is straightforward and easily modifies your column values to your requirements.

    from numpy import uint32
    from csv import reader
    from pandas import DataFrame
    
    file = 'path/to/file.csv'
    with open(file, 'r') as f:
        r = reader(f)
        for row in r:
            column_1 = uint32(row[0])
            column_2 = uint32(row[1])
            column_3 = ' '.join([str(col) for col in row[2::]])
    
        data = [column_1, column_2, column_3]
        frame = DataFrame(data)
    
  2. I don't understand the question. Do you expect your strings to be extremely long? A 32-bit Python installation is limited to a string 2-3GB long. A 64-bit installation is much much larger, limited only by the amount of RAM you can stuff into your system.

Upvotes: 1

Gaurav Dhama
Gaurav Dhama

Reputation: 1336

You can use the Series.str.cat method, documentation for which is available here:

df = pd.read_csv("foo.txt", sep=' ', header=None)

# Create a new column which concatenates all columns
df['new'] = df.apply(lambda row: row.iloc[2:].apply(str).str.cat(sep = ' '),axis=1)
df = df[[0,1,'new']]

Not sure exactly what you mean by your second question but if you want to check the size of a string in memory you can use

import sys
print (sys.getsizeof('some string'))

Sorry, I have no idea how knowing the maximum length will help you in saving memory and whether that is even possible

Upvotes: 1

Related Questions