Reputation: 2874
I have a many large tab-separated files saved as .txt
, which each have seven columns with the following headers:
#column_titles = ["col1", "col2", "col3", "col4", "col5", "col6", "text"]
I would like to simply extract the final column named text
and save it into a new file with each row being a row from the original file, while are all strings.
EDIT: This is not a duplicate of a similar problem, as splitlines()
was not necessary in my case. Only the order of things needed to be improved
Based on -several - other - posts, here is my current attempt:
import csv
# File names: to read in from and read out to
input_file = "tester_2014-10-30_til_2014-08-01.txt"
output_file = input_file + "-SA_input.txt"
## ==================== ##
## Using module 'csv' ##
## ==================== ##
with open(input_file) as to_read:
reader = csv.reader(to_read, delimiter = "\t")
desired_column = [6] # text column
for row in reader:
myColumn = list(row[i] for i in desired_column)
with open(output_file, "wb") as tmp_file:
writer = csv.writer(tmp_file)
for row in myColumn:
writer.writerow(row)
What I am getting, is simply the text field from the 2624th row form my input file, with each of the letters in that string being separated out:
H,o,w, ,t,h,e, ,t.e.a.m, ,d,i,d, ,T,h,u,r,s,d,a,y, ,-, ,s,e,e , ,h,e,r,e
I know very little in the world of programming is random, but this is definitely strange!
This post is pretty similar to my needs, but misses the writing and saving parts, which I am also not sure about.
I have looked into using the pandas
toolbox (as per one of those links above), but I am unable to due my Python installation, so please only solutions using csv
or other built in modules!
Upvotes: 2
Views: 7046
Reputation: 148965
You must process the file one row at a time: read, parse and write.
import csv
# File names: to read in from and read out to
input_file = "tester_2014-10-30_til_2014-08-01.txt"
output_file = input_file + "-SA_input.txt"
## ==================== ##
## Using module 'csv' ##
## ==================== ##
with open(input_file) as to_read:
with open(output_file, "wb") as tmp_file:
reader = csv.reader(to_read, delimiter = "\t")
writer = csv.writer(tmp_file)
desired_column = [6] # text column
for row in reader: # read one row at a time
myColumn = list(row[i] for i in desired_column) # build the output row (process)
writer.writerow(myColumn) # write it
Upvotes: 2
Reputation: 146
I would go for this simple solution:
text_strings = [] # empty array to store the last column text
with open('my_file') as ff:
ss = ff.readlines() # read all strings in a string array
for s in ss:
text_strings.append(s.split('\t')[-1]) # last column to the text array
with open('out_file') as outf:
outf.write('\n'.join(text_strings)) # write everything to output file
Using the list comprehension, you can translate the last columns of ss
strings to text_strings
faster and in one line:
text_strings = [k.split("\t")[-1] for k in ss]
There are other simplifications possible, you get the idea)
The problem in your code appears at these two lines:
for row in reader:
myColumn = list(row[i] for i in desired_column)
First, there is no indentation, so there is nothing happening. Actually, on my computer, it throws an error, so there is a possibility that it is a typo. But in this case, at each step of the for-loop, you overwrite the myColumn
value with that coming from the new row, hence in the end you have a string from the last row of the file.
Second, list
applied to a string (as in your code), converts the string to the list of chars:
In [5]: s = 'AAAA'
In [6]: list(s)
Out[6]: ['A', 'A', 'A', 'A']
which is exactly what you see in the output.
Upvotes: 1