Reputation: 65
I have the following two types of txt files:
File1
Sample1012, Male, 36, Stinky, Bad Hair
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me
Sample23905, Female, 42, Cougar, Long Hair, Chub
Sample123, Male, 32, Party Guy
File2
DEAD, Sample123, Car Accident, Drunk, Dumb
ALIVE, Sample1012, Alone
ALIVE, Sample23905, STD
DEAD, Sample1043, Too Hot, Exploded
I just want to write a simply Python script to join these files based on the sample field but keep running into a problem with the random number of data columns. For instance, I end up with:
Sample1012, Male, 36, Stinky, Bad Hair, ALIVE, Sample1012, Alone
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me, DEAD, Sample1043, Too Hot, Exploded
Sample23905, Female, 42, Cougar, Long Hair, Chub, ALIVE, Sample23905, STD
Sample123, Male, 32, Party Guy, DEAD, Sample123, Car Accident, Drunk, Dumb
When what I want is:
Sample1012, Male, 36, Stinky, Bad Hair, EMPTY COLUMN, EMPTY COLUMN, ALIVE, Sample1012, Alone
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me, DEAD, Sample1043, Too Hot, Exploded
Sample23905, Female, 42, Cougar, Long Hair, Chub, EMPTY COLUMN, ALIVE, Sample23905, STD
Sample123, Male, 32, Party Guy, EMPTY COLUMN, EMPTY COLUMN, EMPTY COLUMN, DEAD, Sample123, Car Accident, Drunk, Dumb
I'm basically just reading in both files with .readlines() and then comparing the relevant column with the sample ID with a simple "==" and if true then it prints out the line from the first file and the the second.
Not sure how to use len() to determine the max number of columns in file1 so that I can account for that at the end of each line if it is not the max number of columns before appending the line from the other file (provided the "==" is true).
Any help greatly appreciated.
UPDATE:
This is what I got now:
import sys
import csv
usage = "usage: python Integrator.py <table_file> <project_file> <outfile>"
if len(sys.argv) != 4:
print usage
sys.exit(0)
project = open(sys.argv[1], "rb")
table = open(sys.argv[2], "rb").readlines()
outfile = open(sys.argv[3], "w")
table[0] = "Total Table Output \n"
newtablefile = open(sys.argv[2], "w")
for line in table:
newtablefile.write(line)
projectfile = csv.reader(project, delimiter="\t")
newtablefile = csv.reader(table, delimiter="\t")
result = []
for p in projectfile:
print p
for t in newtablefile:
#print t
if p[1].strip() == t[0].strip():
del t[0]
load = p + t
result.append(load)
for line in result:
outfile.write(line)
outfile.close()
Can't get the for loops to work together - don't mind the dumb stuff at the stop. one of the files has a blank first line.
Upvotes: 2
Views: 2736
Reputation: 9977
Not sure where the "empty columns" come from in your suggested output... if the columns are supposed to match up to a defined pattern, then you have to have blank spots in the input files. Otherwise, this will work...
import csv
f1 = open("test1.txt", 'rb')
reader1 = csv.reader(f1)
f2 = open("test2.txt", 'rb')
reader2 = csv.reader(f2)
result = []
for entry in reader1:
print entry
for row in reader2:
print row
if entry[0].strip() == row[1].strip():
del row[1]
load = entry + row
result.append(load)
for line in result:
print line
EDIT -
If you need to skip a line in one of the files you can just do reader1.next() which moves the pointer to the next row of input.
Your example you create an output file, you write data to it, and then try to read it without ever closing the file and reopening it, or opening it as readable and writeable... I couldn't swear to it, but I think that is likely your problem. Fortunately, you don't need to do all of that anyway with the .next() method.
Upvotes: 1
Reputation: 4069
You can get the whole file into a list of lists, then find the maximum number of fields using:
file1 = open("file1.txt")
list1 = [s.split(",") for s in file1]
file1.close()
maxlen1 = max([len(x) for x in list1])
A dictionary is the best structure for lookups on the second file
file2 = open("file2.txt")
dict2 = { }
for line2 in file2:
cols2 = line2.split(",")
dict2[cols2[1]] = cols2
file2.close()
Now, if cols1
is any list of columns from list1, then you can use:
cols3 = cols1 + (maxlen1 - len(cols1))*[" EMPTY COLUMN"] + dict2[cols1[0]]
...to create a list padded with " EMPTY COLUMN" values as needed. Now you can convert that back to a single string with:
",".join(cols3)
I didn't attempt to trim the strings, so you'll end up with the same spaces after commas as before. There's a small problem that with no space before "DEAD,", "ALIVE,", etc. You can make that change when creating dict2, or when extracting to form cols3.
No file I/O error handling, either. Snippets is snippets.
Upvotes: 0
Reputation: 250981
with open('file1') as f1, open('file2') as f2:
dic = {}
#Store the data from file2 in a dictionary, with second column as key
for line in f2:
data = line.strip().split(', ')
key = data[1]
dic[key] = data
#now iterate over each line in file1
for line in f1:
data = line.strip().split(', ')
#number of empty columns = `(7-len(data))`
data = data + ['EMPTY COLUMN']*(7-len(data))
print '{}, {}'.format(", ".join(data), ', '.join(dic[data[0]]))
output:
Sample1012, Male, 36, Stinky, Bad Hair, EMPTY COLUMN, EMPTY COLUMN, ALIVE, Sample1012, Alone
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me, DEAD, Sample1043, Too Hot, Exploded
Sample23905, Female, 42, Cougar, Long Hair, Chub, EMPTY COLUMN, ALIVE, Sample23905, STD
Sample123, Male, 32, Party Guy, EMPTY COLUMN, EMPTY COLUMN, EMPTY COLUMN, DEAD, Sample123, Car Accident,
Upvotes: 0
Reputation: 6978
hmm you should probably use a rdbms for the efficiency but you can use dictionaries to do this better.
when you use readline()
on the first one, just split everything before the first comma and use that as the key and the value be the list.
so something like
{'Sample1012': ['Sample1012', 'Male', 36, 'Stinky', 'Bad Hair']}
now what you can do is the same with the other file
and simply,
for key in dict1.keys:
dict1[key] += dict2.get(key, [])
and this will then have all the corresponding stuff appended to the first dictionary..
This just makes life easier for you
Upvotes: 0