Reputation: 3024
Input:
A B C
D E F
This file is NOT exclusively tab-delimited, some entries are space-delimited to look like they were tab-delimited (which is annoying). I tried reading in the file with the csv
module using the canonical tab delimited option hoping it wouldn't mind a few spaces (needless to say, my output came out botched with this code):
with open('file.txt') as f:
input = csv.reader(f, delimiter='\t')
for row in input:
print row
I then tried replacing the second line with csv.reader('\t'.join(f.split()))
to try to take advantage of Remove whitespace in Python using string.whitespace but my error was: AttributeError: 'file' object has no attribute 'split'
.
I also tried examining Can I import a CSV file and automatically infer the delimiter? but here the OP imported either semicolon-delimited or comma-delimited files, but not a file which was a random mixture of both kinds of delimiters.
Was wondering if the csv
module can handle reading in files with a mix of various delimiters or whether I should try a different approach (e.g., not use the csv
module)?
I am hoping that there exists a way to read in a file with a mixture of delimiters and automatically turn this file into a tab-delimited file.
Upvotes: 5
Views: 6360
Reputation: 127
.split() is an easy and nice solution for the situation that "consecutive, arbitrarily-mixed tabs and blanks as one delimiter"; However, this does not work while value with blank (enclosed by quote mark) appears.
First, we may replace each tab in the text file with one blank ' '
; This can simplify the situation to "consecutive, arbitrary-number of blanks as one delimiter".
There is a good example for replacing a pattern over a file:
https://www.safaribooksonline.com/library/view/python-cookbook/0596001673/ch04s04.html
Note 1: DO NOT replace with ''
(empty string), due to there may be a delimiter includes ONLY tabs.
Note 2: This approach DOES NOT work while you have tab character (/t) inside a value that enclosed by quote mark.
Then we can use Python's csv module, with delimiter as ' '
(one blank), and use skipinitialspace=True
to ignore consecutive blanks.
Upvotes: 0
Reputation: 103744
Just use .split():
csv='''\
A\tB\tC
D E F
'''
data=[]
for line in csv.splitlines():
data.append(line.split())
print data
# [['A', 'B', 'C'], ['D', 'E', 'F']]
Or, more succinctly:
>>> [line.split() for line in csv.splitlines()]
[['A', 'B', 'C'], ['D', 'E', 'F']]
For a file, something like:
with open(fn, 'r') as fin:
data=[line.split() for line in fin]
It works because str.split() will split on all whitespace between data elements even if more than 1 whitespace character or if mixed:
>>> '1\t\t\t2 3\t \t \t4'.split()
['1', '2', '3', '4']
Upvotes: 6
Reputation: 5373
Why not just roll your own splitter rather than the CSV module?
delimeters = [',', ' ', '\t']
unique = '[**This is a unique delimeter**]'
with open(fileName) as f:
for l in f:
for d in delimeters: l = unique.join(l.split(d))
row = l.split(unique)
Upvotes: 1