user1265669
user1265669

Reputation: 65

Writing empty columns in python

I have the following two types of txt files:

File1

Sample1012, Male, 36, Stinky, Bad Hair
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me
Sample23905, Female, 42, Cougar, Long Hair, Chub
Sample123, Male, 32, Party Guy

File2

DEAD, Sample123, Car Accident, Drunk, Dumb
ALIVE, Sample1012, Alone
ALIVE, Sample23905, STD
DEAD, Sample1043, Too Hot, Exploded

I just want to write a simply Python script to join these files based on the sample field but keep running into a problem with the random number of data columns. For instance, I end up with:

Sample1012, Male, 36, Stinky, Bad Hair, ALIVE, Sample1012, Alone
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me, DEAD, Sample1043, Too Hot, Exploded
Sample23905, Female, 42, Cougar, Long Hair, Chub, ALIVE, Sample23905, STD
Sample123, Male, 32, Party Guy, DEAD, Sample123, Car Accident, Drunk, Dumb

When what I want is:

Sample1012, Male, 36, Stinky, Bad Hair, EMPTY COLUMN, EMPTY COLUMN, ALIVE, Sample1012, Alone
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me, DEAD, Sample1043, Too Hot, Exploded
Sample23905, Female, 42, Cougar, Long Hair, Chub, EMPTY COLUMN, ALIVE, Sample23905, STD
Sample123, Male, 32, Party Guy, EMPTY COLUMN, EMPTY COLUMN, EMPTY COLUMN, DEAD, Sample123, Car Accident, Drunk, Dumb

I'm basically just reading in both files with .readlines() and then comparing the relevant column with the sample ID with a simple "==" and if true then it prints out the line from the first file and the the second.

Not sure how to use len() to determine the max number of columns in file1 so that I can account for that at the end of each line if it is not the max number of columns before appending the line from the other file (provided the "==" is true).

Any help greatly appreciated.

UPDATE:

This is what I got now:

import sys
import csv

usage = "usage: python Integrator.py <table_file> <project_file> <outfile>"
if len(sys.argv) != 4:
    print usage
    sys.exit(0)

project = open(sys.argv[1], "rb")
table = open(sys.argv[2], "rb").readlines()
outfile = open(sys.argv[3], "w")

table[0] = "Total Table Output \n"

newtablefile = open(sys.argv[2], "w")
for line in table:
    newtablefile.write(line)

projectfile = csv.reader(project, delimiter="\t")
newtablefile = csv.reader(table, delimiter="\t")

result = []

for p in projectfile:
    print p
    for t in newtablefile:
        #print t
        if p[1].strip() == t[0].strip():
            del t[0]
            load = p + t
            result.append(load)


for line in result:
    outfile.write(line)

outfile.close()

Can't get the for loops to work together - don't mind the dumb stuff at the stop. one of the files has a blank first line.

Upvotes: 2

Views: 2736

Answers (4)

Paul Becotte
Paul Becotte

Reputation: 9977

Not sure where the "empty columns" come from in your suggested output... if the columns are supposed to match up to a defined pattern, then you have to have blank spots in the input files. Otherwise, this will work...

import csv


f1 = open("test1.txt", 'rb')
reader1 = csv.reader(f1)
f2 = open("test2.txt", 'rb')
reader2 = csv.reader(f2)
result = []

for entry in reader1:
    print entry
    for row in reader2:
        print row
        if entry[0].strip() == row[1].strip():
            del row[1]
            load = entry + row
            result.append(load)

for line in result:
    print line

EDIT -

If you need to skip a line in one of the files you can just do reader1.next() which moves the pointer to the next row of input.

Your example you create an output file, you write data to it, and then try to read it without ever closing the file and reopening it, or opening it as readable and writeable... I couldn't swear to it, but I think that is likely your problem. Fortunately, you don't need to do all of that anyway with the .next() method.

Upvotes: 1

Mike Housky
Mike Housky

Reputation: 4069

You can get the whole file into a list of lists, then find the maximum number of fields using:

file1 = open("file1.txt")
list1 = [s.split(",") for s in file1]
file1.close()
maxlen1 = max([len(x) for x in list1])

A dictionary is the best structure for lookups on the second file

file2 = open("file2.txt")
dict2 = { }
for line2 in file2:
    cols2 = line2.split(",")
    dict2[cols2[1]] = cols2
file2.close()

Now, if cols1 is any list of columns from list1, then you can use:

cols3 = cols1 + (maxlen1 - len(cols1))*[" EMPTY COLUMN"] + dict2[cols1[0]]

...to create a list padded with " EMPTY COLUMN" values as needed. Now you can convert that back to a single string with:

",".join(cols3)

I didn't attempt to trim the strings, so you'll end up with the same spaces after commas as before. There's a small problem that with no space before "DEAD,", "ALIVE,", etc. You can make that change when creating dict2, or when extracting to form cols3.

No file I/O error handling, either. Snippets is snippets.

Upvotes: 0

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 250981

with open('file1') as f1, open('file2') as f2:
    dic = {}
    #Store the data from file2 in a dictionary, with second column as key
    for line in f2:
        data = line.strip().split(', ')
        key = data[1]
        dic[key] = data
    #now iterate over each line in file1
    for line in f1:
        data = line.strip().split(', ')
        #number of empty columns = `(7-len(data))`
        data = data + ['EMPTY COLUMN']*(7-len(data))
        print '{}, {}'.format(", ".join(data), ', '.join(dic[data[0]]))

output:

Sample1012, Male, 36, Stinky, Bad Hair, EMPTY COLUMN, EMPTY COLUMN, ALIVE, Sample1012, Alone
Sample1043, Female, 28, Hot, Short Hair, Hot Body, Hates Me, DEAD, Sample1043, Too Hot, Exploded
Sample23905, Female, 42, Cougar, Long Hair, Chub, EMPTY COLUMN, ALIVE, Sample23905, STD
Sample123, Male, 32, Party Guy, EMPTY COLUMN, EMPTY COLUMN, EMPTY COLUMN, DEAD, Sample123, Car Accident, 

Upvotes: 0

adarsh
adarsh

Reputation: 6978

hmm you should probably use a rdbms for the efficiency but you can use dictionaries to do this better.

when you use readline() on the first one, just split everything before the first comma and use that as the key and the value be the list.

so something like

{'Sample1012': ['Sample1012', 'Male', 36, 'Stinky', 'Bad Hair']}

now what you can do is the same with the other file

and simply,

for key in dict1.keys:
    dict1[key] += dict2.get(key, [])

and this will then have all the corresponding stuff appended to the first dictionary..

This just makes life easier for you

Upvotes: 0

Related Questions