Martijn
Martijn

Reputation: 3

Comma and tab delimited tsv file

This question is for python 2.7 using the pandas library. I downloaded this file: http://language.media.mit.edu/data/public/wikipedia_userlang_iso639-3.zip this is a file with tabs and commas. I've searched the whole internet for this.

I want to seperate this using pandas. This gives errors:

df = pd.read_table('wikipedia_userlang_iso639-3.tsv', sep= '\t')

print df [:10]

because the file also has commas.

Help is much appreciated!

Upvotes: 0

Views: 1434

Answers (1)

Blender
Blender

Reputation: 298196

That file can't be parsed as a CSV file because each row doesn't have a fixed number of fields (it ranges from 2 to 241). You'll have to parse it yourself and decide how you want to handle the variable number of languages for each user:

import codecs

with codecs.open('wikipedia_userlang_iso639-3.tsv', 'r', 'utf-8') as handle:
    for line in handle:
        chunks = line.strip().split('\t')

        username = chunks[0]
        languages = [c.split(',') for c in chunks[1:]]

        # Do something with the above variables

Upvotes: 1

Related Questions