milka1117
milka1117

Reputation: 521

Fixing data in csv file

I was given a csv file with purchase data but it has a problem: it has 4 columns, all separated by commas but one column with a price in it has many values with comma used as decimal separator. So in the end when I'm trying to read the file it reads these lines as having 5 columns and runs into an error. Something like this:

transaction id,user id,purchase price,purchase date
1009497,490408,10,41674
1077573,490408,8,95,41676

So pd.read_csv is able to read the labels, read the first line, but it stops at the second line, because it thinks I am giving it 5 columns instead of 4. What would be the most efficient way to fix my data? Its not possible to change all decimal separators from comma to dot by hand.

UPDATE: I am thinking about reading each line as string and then counting commas in each line and if it has 4 commas then I would use regex to take replace data around that comma as "." instead of ","

Upvotes: 0

Views: 124

Answers (2)

spYder
spYder

Reputation: 317

I would do like this, when I try to replicate your issue I have the following DF:

   transaction id  user id  purchase price  purchase date Unnamed: 4
0         1009497   490408              10          41674        nan
1         1077573   490408               8             95    41676.0

# So basically I get a new column " Unnamed:4" 

df['Unnamed: 4'] = df['Unnamed: 4'].astype(str) # Convert to string..... 
df['purchase date'] = df['purchase date'].astype(str)


df.loc[df['Unnamed: 4'] != 'nan', 'purchase price'] = df['purchase price'].astype(str) + '.' + df['purchase date'] # When it's not nan, will merge with the purchase price


df.loc[df['Unnamed: 4'] != 'nan', 'purchase date'] = df['Unnamed: 4'].str.split('.').str[0] # When it's not nan, will reassign the purchase date

#Just drop the last column....

df.drop(columns=['Unnamed: 4'])

# You can return the purchase price to float 

df['purchase price'] = df['purchase price'].astype(float)

Upvotes: 0

Kostas Charitidis
Kostas Charitidis

Reputation: 3103

If you're sure that it is only the purchase price field with this issue you can do the following although if your file is big it will take a while but it works:

import pandas as pd

with open('your_csv.csv', 'r') as f:
    file_text = f.readlines()

with open('your_csv.csv', 'w') as f:
    for line in file_text:
        if len(line.split(',')) > 4:
            line = '%s,%s,%s.%s,%s' % tuple([i for i in line.split(',')])
        f.write(line)

csv = pd.read_csv('your_csv.csv')
print(csv)

Upvotes: 2

Related Questions