Reputation: 641
I have comma corrupt csv file that can be read by microsoft excel file, but can't be read well using pandas dataframe, I got macro solution, but I want solution that works in python as well, here's first 5 lines of my data
mathematicians,occupation,country of citizenship,place of birth,date of death,educated at,employer,place of death,member of,employer,doctoral advisor,"languages spoken, written or signed",academic degree,doctoral student,manner of death,position held,field of work,award received,Erdős number,instance of,sex or gender,approx. date of birth,day of birth,month of birth,year of birth,approx. date of death,day of death,month of death,year of death
Roger Joseph Boscovich,"['physicist', 'astronomer', 'mathematician', 'philosopher', 'diplomat', 'poet', 'theologian', 'priest', 'polymath', 'historian', 'scientist', 'writer', 'cleric', 'university teacher']",['Republic of Ragusa'],"Dubrovnik, Republic of Ragusa",13 February 1787,['Pontifical Gregorian University'],['Pontifical Gregorian University'],"['Milan', 'Habsburg Empire']","['Royal Society', 'Russian Academy of Sciences', 'Russian Academy of Sciences']",['Pontifical Gregorian University'],,['Latin'],,,,,,['Fellow of the Royal Society'],,['human'],['male'],False,18,May,1711,False,13,February,1787
Emma Previato,['mathematician'],"['United States of America', 'Italy']",Badia Polesine,,"['Harvard University', 'University of Padua']","['Boston University', 'University of Padua']",,['American Mathematical Society'],"['Boston University', 'University of Padua']",['David Mumford'],,,,,,,,,['human'],['female'],False,,,1952,False,,,
Feodor Deahna,['mathematician'],,,1844,,,,,,,,,,,,['differential geometry'],,,['human'],['male'],False,,,1815,False,,,1844
Denis Henrion,"['publisher', 'mathematician']",['France'],,1640,,,,,,,['French'],,,,,,,,['human'],['male'],True,,,1500,False,,,1640
Here's working macro solutoion
For i1 = 1 To 9000
dump = ""
For i2 = 1 To 29
dump = dump & "," & Cells(i1, i2).Value
Next i2
Cells(i1, 30).Value = Mid(dump, 2, 1000)
Next i1
How to convert macro solution to pandas solution?
Upvotes: 0
Views: 91
Reputation: 46759
To convert the list representations used in some of the cells:
import pandas as pd
import csv
import ast
data = []
with open('input.csv', 'rb') as f_input:
for row in csv.reader(f_input):
for index, v in enumerate(row):
if v.startswith('['):
row[index] = ', '.join(ast.literal_eval(v))
data.append(row)
print pd.DataFrame(data)
For example a cell containing the text:
['United States of America', 'Italy']
Would become:
United States of America, Italy
The dataframe would appear as:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
0 mathematicians occupation country of citizenship place of birth date of death educated at employer place of death member of employer doctoral advisor languages spoken, written or signed academic degree doctoral student manner of death position held field of work award received Erdős number instance of sex or gender approx. date of birth day of birth month of birth year of birth approx. date of death day of death month of death year of death
1 Roger Joseph Boscovich physicist, astronomer, mathematician, philosop... Republic of Ragusa Dubrovnik, Republic of Ragusa 13 February 1787 Pontifical Gregorian University Pontifical Gregorian University Milan, Habsburg Empire Royal Society, Russian Academy of Sciences, Ru... Pontifical Gregorian University Latin Fellow of the Royal Society human male False 18 May 1711 False 13 February 1787
2 Emma Previato mathematician United States of America, Italy Badia Polesine Harvard University, University of Padua Boston University, University of Padua American Mathematical Society Boston University, University of Padua David Mumford human female False 1952 False
3 Feodor Deahna mathematician 1844 differential geometry human male False 1815 False 1844
4 Denis Henrion publisher, mathematician France 1640 French human male True 1500 False 1640
Upvotes: 1