V22
V22

Reputation: 157

Reading certain columns from multiple data files into one file in python

I have about 30 data files and I need to extract the 4th, 5th, and 6th columns. Then skip 14 columns and grab the next 3 columns and so on till the end of the file. Each data file is about 400 rows and 17000 columns. So far I have this:

file_list = glob.glob('*.dat')

with open("result.dat", "wb") as outfile:
    for f in file_list:
        with open(f, "rb") as infile:
            outfile.write(infile.read())

data = np.loadtxt('result.dat')

arr = np.array(data)
a = arr[:, 4:-1:17]
b = arr[:, 5:-1:17]
c = arr[:, 6:-1:17]

This is writing a file called result.dat that contains all of the data from the multiple files and then I extract the columns I need. However, this is taking a long time to create the array because it is writing all of the information that I do not need as well. Is there a way to only read in the specific columns I am interested instead into the result.dat file? This should cut down the time significantly.

Upvotes: 3

Views: 2217

Answers (3)

hpaulj
hpaulj

Reputation: 231385

loadtxt takes any iterable, including a generator. You could iterate through the files, but feed them directly to loadtxt rather than write an intermediate file. No guarantee that it will save much time, but it may be worth the experiment.

Here's my test:

def foo(filelist):
    for name in filelist:
        with open(name) as f:
            for line in f:
                yield line

A simple test file

In [71]: cat urls.txt
one.com
two.url
three.four

Read it 2 times with foo:

In [72]: list(foo(['urls.txt','urls.txt']))
Out[72]: 
['one.com\n',
 'two.url\n',
 'three.four\n',
 'one.com\n',
 'two.url\n',
 'three.four\n']

use it in loadtxt:

In [73]: np.loadtxt(foo(['urls.txt','urls.txt']),dtype=str,delimiter='.',usecols=[1])
Out[73]: 
array(['com', 'url', 'four', 'com', 'url', 'four'], 
      dtype='|S4')

The other approach is to load each file, collect the arrays in a list, and concatenate them.

Your 'usecols' with 'use 3 skip 14' is a little awkward to construct. Using 3 slices as you do is a nice idea, but you don't want to do that with loadtxt.

np.r_ might make the task easier:

In [81]: np.r_[4:100:17, 5:100:17, 6:100:17]
Out[81]: 
array([ 4, 21, 38, 55, 72, 89,  5, 22, 39, 56, 73, 90,  6, 23, 40, 57, 74,
       91])

In [82]: np.sort(np.r_[4:100:17, 5:100:17, 6:100:17])
Out[82]: 
array([ 4,  5,  6, 21, 22, 23, 38, 39, 40, 55, 56, 57, 72, 73, 74, 89, 90,
       91])

usecols does not have to be sorted, so you could use either.

Upvotes: 0

MaxNoe
MaxNoe

Reputation: 14997

numpy.loadtxt is a pure python implementation, which makes is somehow slow. Using pandas.read_csv() is way faster. You also do not need to write another file with the complete content (if you do not need this file for other purposes).

This is equivalent code using pandas.read_csv:

import glob
import pandas as pd

file_list = glob.glob('*.dat')
cols = [4, 21, 38] # add more columns here

df = pd.DataFrame()

for f in file_list:
    df = df.append(
        pd.read_csv(f, delimiter='\s+', header=None, usecols=cols),
        ignore_index=True,    
    )

arr = df.values

The equivalent numpy code would be:

import glob
import numpy as np

file_list = glob.glob('*.dat')
cols = [0, 1, 2]  # add more columns here

data = []
for f in file_list:
    data.append(np.loadtxt(f, usecols=cols))

arr = np.vstack(data)

If timed both with 10 files of random numbers with shape (10000, 10).

pandas solution: 0.95 s

numpy solution: 2.6 s

Upvotes: 1

mrucci
mrucci

Reputation: 4470

The numpy.loadtxt function accepts an optional usecols parameter.

You can generate column indices via:

usecols=set(xrange(4, num_col, 17)) | set(xrange(5, num_col, 17)) | set(xrange(6, num_col, 17))

Upvotes: 0

Related Questions