Reputation: 157
I have about 30 data files and I need to extract the 4th, 5th, and 6th columns. Then skip 14 columns and grab the next 3 columns and so on till the end of the file. Each data file is about 400 rows and 17000 columns. So far I have this:
file_list = glob.glob('*.dat')
with open("result.dat", "wb") as outfile:
for f in file_list:
with open(f, "rb") as infile:
outfile.write(infile.read())
data = np.loadtxt('result.dat')
arr = np.array(data)
a = arr[:, 4:-1:17]
b = arr[:, 5:-1:17]
c = arr[:, 6:-1:17]
This is writing a file called result.dat that contains all of the data from the multiple files and then I extract the columns I need. However, this is taking a long time to create the array because it is writing all of the information that I do not need as well. Is there a way to only read in the specific columns I am interested instead into the result.dat file? This should cut down the time significantly.
Upvotes: 3
Views: 2217
Reputation: 231385
loadtxt
takes any iterable, including a generator. You could iterate through the files, but feed them directly to loadtxt
rather than write an intermediate file. No guarantee that it will save much time, but it may be worth the experiment.
Here's my test:
def foo(filelist):
for name in filelist:
with open(name) as f:
for line in f:
yield line
A simple test file
In [71]: cat urls.txt
one.com
two.url
three.four
Read it 2 times with foo
:
In [72]: list(foo(['urls.txt','urls.txt']))
Out[72]:
['one.com\n',
'two.url\n',
'three.four\n',
'one.com\n',
'two.url\n',
'three.four\n']
use it in loadtxt
:
In [73]: np.loadtxt(foo(['urls.txt','urls.txt']),dtype=str,delimiter='.',usecols=[1])
Out[73]:
array(['com', 'url', 'four', 'com', 'url', 'four'],
dtype='|S4')
The other approach is to load each file, collect the arrays in a list, and concatenate them.
Your 'usecols' with 'use 3 skip 14' is a little awkward to construct. Using 3 slices as you do is a nice idea, but you don't want to do that with loadtxt
.
np.r_
might make the task easier:
In [81]: np.r_[4:100:17, 5:100:17, 6:100:17]
Out[81]:
array([ 4, 21, 38, 55, 72, 89, 5, 22, 39, 56, 73, 90, 6, 23, 40, 57, 74,
91])
In [82]: np.sort(np.r_[4:100:17, 5:100:17, 6:100:17])
Out[82]:
array([ 4, 5, 6, 21, 22, 23, 38, 39, 40, 55, 56, 57, 72, 73, 74, 89, 90,
91])
usecols
does not have to be sorted, so you could use either.
Upvotes: 0
Reputation: 14997
numpy.loadtxt
is a pure python implementation, which makes is somehow slow.
Using pandas.read_csv()
is way faster. You also do not need to write another file with the complete content (if you do not need this file for other purposes).
This is equivalent code using pandas.read_csv
:
import glob
import pandas as pd
file_list = glob.glob('*.dat')
cols = [4, 21, 38] # add more columns here
df = pd.DataFrame()
for f in file_list:
df = df.append(
pd.read_csv(f, delimiter='\s+', header=None, usecols=cols),
ignore_index=True,
)
arr = df.values
The equivalent numpy code would be:
import glob
import numpy as np
file_list = glob.glob('*.dat')
cols = [0, 1, 2] # add more columns here
data = []
for f in file_list:
data.append(np.loadtxt(f, usecols=cols))
arr = np.vstack(data)
If timed both with 10 files of random numbers with shape (10000, 10).
pandas solution: 0.95 s
numpy solution: 2.6 s
Upvotes: 1
Reputation: 4470
The numpy.loadtxt function accepts an optional usecols
parameter.
You can generate column indices via:
usecols=set(xrange(4, num_col, 17)) | set(xrange(5, num_col, 17)) | set(xrange(6, num_col, 17))
Upvotes: 0