Rajeev
Rajeev

Reputation: 46919

Reading a tab separated file using delimiter

I am using the following to read a tab separated file. There are three columns in the file but the first column is being ignored when I print the column header only. How can I include the first column, too?

f = open("/tmp/data.txt")
for l in f.readlines():
    print l.strip().split("\t")
    break
    f.close()

Output:

['session_id\t', '\tevent_id_concat'] 

The first column name is id where it is not printed in the above array.

print l yields the following:

'id\tsession_id\tevent_id_concat\r\n'

Output:

['id\t', '\tevent_id_concat'] 

Upvotes: 9

Views: 130813

Answers (3)

NeilG
NeilG

Reputation: 4160

The question is a simple old Python 2 scenario so I hope the following might be a more up-to-date and complete alternative to the others here.

The csv module was being used to read a CSV file generated from an Excel document, but when that changed to a tab delimited file from a similar source I couldn't see why the csv module was necessary.

def read_rows(filename: str) -> list[dict[str, str]]:
    """Read TAB delimited file with header row and return rows."""
    with open(filename, newline="", encoding="utf-8") as tabfile:
        fieldnames = [field.strip() for field in next(tabfile).split("\t")]
        return [
            dict(zip(fieldnames, (field.strip() for field in line.split("\t"))))
            for line in tabfile.readlines()
        ]


rows = read_rows("/home/user/in.txt")
# rows is now a list of dict keyed on the field names from the first row

I'm interested why anyone would import the csv module just for this task.

To clarify the pre-conditions where this approach is reliable, if the data has the following two characteristics, which is a common general scenario, then the simple "dict zip split slurp" above should work:

  1. The tab character only appears in the file as a delimiter - tabs never appear in the field contents themselves (this is a common truism for many data sources where it's impossible for a tab control character to enter the data or it can easily be stripped from values when creating the file because tab is invalid as a content character)
  2. An equal number of tab characters appear on every line - this is the same as saying there are always the same number of fields on every line (again this is a common guarantee in many data sets that is a consequence of an orderly data creation step and can therefore be relied upon)

Given these two pre-conditions, I can't see any reason not to just slurp up the file like this and save an import of the csv module.

Using CSV files, especially MS Excel CSV files, there are number of gotchas and special cases where it's sensible to use csv module for protection. But in general usage tab characters are rare in content, especially web content where the tab key is used to change fields. It's quite usual to get a scenario where the pre-conditions mentioned above are guaranteed and it seems a waste of effort, and extra lines of code, to bother using csv when a reliable tab character delimiter is in use.

Refer to the Python open documentation to understand the keyword arguments to the open call.

Upvotes: 0

wagnerpeer
wagnerpeer

Reputation: 947

I would suggest to use the csv module. It is easy to use and fits best if you want to read in table like structures stored in a CSV like format (tab/space/something else delimited).

The module documentation gives good examples where the simplest usage is stated to be:

import csv
with open('/tmp/data.txt', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        print row

Every row is a list which is very usefull if you want to do index based manipulations.

If you want to change the delimiter there is a keyword for this but I am often fine with the predefined dialects which can also be defined via a keyword.

import csv
with open('/tmp/data.txt', 'r') as f:
    reader = csv.reader(f, dialect='excel', delimiter='\t')
    for row in reader:
        print row

I am not sure if this will fix your problems but the use of elaborated modules will ensure you that something is wrong with your file and not your code if the error will remain.

Upvotes: 19

elyase
elyase

Reputation: 40963

It should work but it is better to use 'with':

with open('/tmp/data.txt') as f:
   for l in f:
       print l.strip().split("\t")

if it doesn't then probably your file doesn't have the required format.

Upvotes: 8

Related Questions