Reputation: 3
This question is for python 2.7 using the pandas library. I downloaded this file: http://language.media.mit.edu/data/public/wikipedia_userlang_iso639-3.zip this is a file with tabs and commas. I've searched the whole internet for this.
I want to seperate this using pandas. This gives errors:
df = pd.read_table('wikipedia_userlang_iso639-3.tsv', sep= '\t')
print df [:10]
because the file also has commas.
Help is much appreciated!
Upvotes: 0
Views: 1434
Reputation: 298196
That file can't be parsed as a CSV file because each row doesn't have a fixed number of fields (it ranges from 2 to 241). You'll have to parse it yourself and decide how you want to handle the variable number of languages for each user:
import codecs
with codecs.open('wikipedia_userlang_iso639-3.tsv', 'r', 'utf-8') as handle:
for line in handle:
chunks = line.strip().split('\t')
username = chunks[0]
languages = [c.split(',') for c in chunks[1:]]
# Do something with the above variables
Upvotes: 1