victoryNap
victoryNap

Reputation: 133

Combining files in python using

I am attempting to combine a collection of 600 text files, each line looks like

Measurement title Measurement #1

ebv-miR-BART1-3p 4.60618701
....
evb-miR-BART1-200 12.8327289

with 250 or so rows in each file. Each file is formatted that way, with the same data headers. What I would like to do is combine the files such that it looks like this

Measurement title Measurement #1 Measurement #2

ebv-miR-BART1-3p 4.60618701 4.110878867
....
evb-miR-BART1-200 12.8327289 6.813287556

I was wondering if there is an easy way in python to strip out the second column of each file, then append it to a master file? I was planning on pulling each line out, then using regular expressions to look for the second column, and appending it to the corresponding line in the master file. Is there something more efficient?

Upvotes: 2

Views: 176

Answers (3)

Lisa
Lisa

Reputation: 3526

I don't have comment privileges yet, therefore a separate answer.

jsbueno's answer works really well as long as you're sure that the same measurement IDs occur in every file (order is not important, but the sets should be equal!).

In the following situation:

file1:
measID,meas1
a,1
b,2

file2:
measID,meas1
a,3
b,4
c,5

you would get:

outfile:
measID,meas1,meas2
a,1,3
b,2,4
c,5

instead of the desired:

outfile:
measID,meas1,meas2
a,1,3
b,2,4
c,,5        # measurement c was missing in file1!

I'm using commas instead of spaces as delimiters for better visibility.

Upvotes: 0

jsbueno
jsbueno

Reputation: 110476

It is a small amount of data for today's desktop computers (around 150000 measurements) - so keeping everything in memory, and dumping to a single file will be easier than an another strategy. If it would not fit in RAM, maybe using SQL would be a nice approach there - but as it is, you can create a single default dictionary, where each element is a list - read all your files and collect the measurements to this dictionary, and dump it to disk -

# create default list dictionary:
>>> from collections import defaultdict
>>> data = defaultdict(list)
# Read your data into it:
>>> from glob import glob
>>> import csv
>>> for filename in glob("my_directory/*csv"):
...    reader = csv.reader(open(filename))
...    # throw away header row:
...    reader.readrow()
...    for name, value in reader:
...       data[name].append(value)
... 
>>> # and record everything down in another file:
... 
>>> mydata = open("mydata.csv", "wt")
>>> writer = csv.writer(mydata)
>>> for name, values in sorted(data.items()):
...    writer.writerow([name] + values)
... 
>>> mydata.close()
>>> 

Upvotes: 3

Tom Dalton
Tom Dalton

Reputation: 6190

Use the csv module to read the files in, create a dictionary of the measurement names, and make the values in the dictionary a list of the values from the file.

Upvotes: 0

Related Questions