n1k31t4
n1k31t4

Reputation: 2874

Read and export a single column from a tab-separated file in Python

I have a many large tab-separated files saved as .txt, which each have seven columns with the following headers:

#column_titles = ["col1", "col2", "col3", "col4", "col5", "col6", "text"]    

I would like to simply extract the final column named text and save it into a new file with each row being a row from the original file, while are all strings.

EDIT: This is not a duplicate of a similar problem, as splitlines() was not necessary in my case. Only the order of things needed to be improved

Based on -several - other - posts, here is my current attempt:

import csv

# File names: to read in from and read out to
input_file = "tester_2014-10-30_til_2014-08-01.txt"
output_file = input_file + "-SA_input.txt"

## ==================== ##
##  Using module 'csv'  ##
## ==================== ##
with open(input_file) as to_read:
    reader = csv.reader(to_read, delimiter = "\t")

    desired_column = [6]        # text column

    for row in reader:
    myColumn = list(row[i] for i in desired_column)

with open(output_file, "wb") as tmp_file:
    writer = csv.writer(tmp_file)

for row in myColumn:
    writer.writerow(row)

What I am getting, is simply the text field from the 2624th row form my input file, with each of the letters in that string being separated out:

H,o,w, ,t,h,e, ,t.e.a.m, ,d,i,d, ,T,h,u,r,s,d,a,y, ,-, ,s,e,e , ,h,e,r,e

I know very little in the world of programming is random, but this is definitely strange!

This post is pretty similar to my needs, but misses the writing and saving parts, which I am also not sure about.

I have looked into using the pandas toolbox (as per one of those links above), but I am unable to due my Python installation, so please only solutions using csv or other built in modules!

Upvotes: 2

Views: 7046

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 148965

You must process the file one row at a time: read, parse and write.

import csv

# File names: to read in from and read out to
input_file = "tester_2014-10-30_til_2014-08-01.txt"
output_file = input_file + "-SA_input.txt"

## ==================== ##
##  Using module 'csv'  ##
## ==================== ##
with open(input_file) as to_read:
    with open(output_file, "wb") as tmp_file:
        reader = csv.reader(to_read, delimiter = "\t")
        writer = csv.writer(tmp_file)

        desired_column = [6]        # text column

        for row in reader:     # read one row at a time
            myColumn = list(row[i] for i in desired_column)   # build the output row (process)
            writer.writerow(myColumn) # write it

Upvotes: 2

pausag
pausag

Reputation: 146

I would go for this simple solution:

    text_strings = [] # empty array to store the last column text
    with open('my_file') as ff:
        ss = ff.readlines() # read all strings in a string array 

    for s in ss:
        text_strings.append(s.split('\t')[-1]) # last column to the text array



    with open('out_file') as outf:
        outf.write('\n'.join(text_strings)) # write everything to output file

Using the list comprehension, you can translate the last columns of ss strings to text_strings faster and in one line:

    text_strings = [k.split("\t")[-1] for k in ss]

There are other simplifications possible, you get the idea)

The problem in your code appears at these two lines:

        for row in reader:
        myColumn = list(row[i] for i in desired_column)

First, there is no indentation, so there is nothing happening. Actually, on my computer, it throws an error, so there is a possibility that it is a typo. But in this case, at each step of the for-loop, you overwrite the myColumn value with that coming from the new row, hence in the end you have a string from the last row of the file. Second, list applied to a string (as in your code), converts the string to the list of chars:

    In [5]: s = 'AAAA'

    In [6]: list(s)
    Out[6]: ['A', 'A', 'A', 'A']

which is exactly what you see in the output.

Upvotes: 1

Related Questions