jacky
jacky

Reputation: 524

for loop file read line and filter based on list remove unnecessary empty lines

I am reading a file and getting the first element from each start of the line, and comparing it to my list, if found, then I append it to the new output file that is supposed to be exactly like the input file in terms of the structure.

my_id_list = [
4985439
5605471
6144703
]

input file:

4985439 16:0.0719814
5303698 6:0.09407 19:0.132581
5605471 5:0.0486076
5808678 8:0.130536
6144703 5:0.193785 19:0.0492507
6368619 3:0.242678 6:0.041733

my attempt:

output_file = []
input_file = open('input_file', 'r')

for line in input_file:

    my_line = np.array(line.split())

    id = str(my_line[0])
    if id in my_id_list:
        output_file.append(line)
np.savetxt("output_file", output_file, fmt='%s')

Question is:

It is currently adding an extra empty line after each line written to the output file. How can I fix it? or is there any other way to do it more efficiently?

update:

output file should be for this example:

4985439 16:0.0719814
5605471 5:0.0486076
6144703 5:0.193785 19:0.0492507

Upvotes: 0

Views: 294

Answers (2)

thebjorn
thebjorn

Reputation: 27311

I don't know what numpy does to the text when reading it, but this is how you could do it without numpy:

my_id_list = {4985439, 5605471, 6144703}  # a set is faster for membership testing

with open('input_file') as input_file:
    # Your problem is most likely related to line-endings, so here
    # we read the inputfile into an list of lines with intact line endings.
    # To preserve the input, exactly, you would need to open the files
    # in binary mode ('rb' for the input file, and 'wb' for the output
    # file below).
    lines = input_file.read().splitlines(keepends=True)

with open('output_file', 'w') as output_file:
    for line in lines:
        first_word = line.split()[0]
        if first_word in my_id_list:
            output_file.write(line)

getting the first word of each line is wasteful, since this:

first_word = line.split()[0]

creates a list of all "words" in the line when we just need the first one.

If you know that the columns are separated by spaces you can make it more efficient by only splitting on the first space:

first_word = line.split(' ', 1)[0]

Upvotes: 1

BioBot
BioBot

Reputation: 31

try something like this

# read lines and strip trailing newline characters
with open('input_file','r') as f:
    input_lines = [line.strip() for line in f.readlines()]

# collect all the lines that match your id list
output_file = [line for line in input_lines if line.split()[0] in my_id_list]

# write to output file
with open('output_file','w') as f:
    f.write('\n'.join(output_file))

Upvotes: 2

Related Questions