86_951
86_951

Reputation: 57

Clean messy data with varying deliniator from a text file

I have cleaned a .csv file with four columns; all the data resided in the first column:

    "Plot Title: 10862077                           ",,,
"#  ""Date Time"," GMT-04:00""  ""Temp", �C (LGR S/N: 10862077," SEN S/N: 10862077)""   Coupler Detached (LGR S/N: 10862077)    Coupler Attached (LGR S/N: 10862077)    Host Connected (LGR S/N: 10862077)  Stopped (LGR S/N: 10862077) End Of File (LGR S/N: 10862077)"
"1  9/8/2016 15:47  23.256                  ",,,
"2  9/8/2016 15:47      Logged              ",,,
"3  9/8/2016 15:52          Logged  Logged      ",,,
"4  9/8/2016 15:53      Logged              ",,,
"5  9/8/2016 16:02  22.681                  ",,,

Above was the original, and below is how I output the data--into a text file--separated by '\n':

    ('#\t"Date Time',)
    ('1\t9/8/2016 15:47\t23.256\t\t\t\t\t',)
    ('2\t9/8/2016 15:47\t\tLogged\t\t\t\t',)
    ('3\t9/8/2016 15:52\t\t\tLogged\tLogged\t\t',)
    ('4\t9/8/2016 15:53\t\tLogged\t\t\t\t',)
    ('5\t9/8/2016 16:02\t22.681\t\t\t\t\t',)

The desired output would look something like this, in .csv form:

    (Date, Time, Temperature)
    (9/8/2016, 15:47, 23.256)

Background

I am a new to python (started learning in June, 2017) and I am helping out a friend to clean up this data for use in research. This data originates from a temperature sensor somewhere out in the ocean. I would really appreciate some help getting to the finish line.

I have searched for approaches, although my severe lack of exposure and experience with python is plainly evident in this project.

My initial approach to getting the desired output was to create an if statement to replace a predefined string of \t or \t\t with a (,) and to delete multiple \t's and Logged. I have since removed those attempts from my code and progressed to built-in functions for a solution (.replace and .rstrip and .split) to no avail.

My Code

Disclaimer: I plan to tidy-up (make more pythonic) once I am out of the testing phase. Here is what I have so far, commented out code is either failed attempts or notes for myself:

import pandas as pd

# Open data file and name it:
    # Read file with PANDAS csv reader
    # Make data into a DataFrame with PANDAS
    # Close file

# Open file to write and name it:
        # Iterate rows into tuples (for performance per docs), remove added name/index
        # Strip out trailing, empty columns after C:1
        # Write to new text file with '\n'
    # Close file

with open('BAD_data.csv', 'r') as csvfile:
    reader = pd.read_csv(csvfile)
    data_frm = pd.DataFrame(reader)
    csvfile.close()
with open('improved_data.txt', 'w') as imp_writeDat:
    for row in data_frm.itertuples(index=False, name=None):
        clean_row = str(row[:1])
        imp_writeDat.write(clean_row + '\n')
    imp_writeDat.close()

with open('improved_data.txt', 'r') as imp_readDat:
    data2 = imp_readDat.read()
      print data2.rstrip('\t')
    # print data3.replace('\t\t\t\t\t', '')
    # print imp_readDat.replace(' ', ',')
    # print imp_readDat.replace('\t\t\tLogged\t\t\t', '')
    # print imp_readDat.replace('\t\tLogged\t\t\t\t', '')
    # print imp_readDat.replace('\t\t\tLogged\t\t\t', '')
    # print imp_readDat.replace('\t\t\t\tLogged\t\t', '')
    # print imp_readDat.replace('\t\t\t\t\tLogged\tLogged', '')

The commented out code directly above did not produce any difference in the output.

Upvotes: 2

Views: 900

Answers (1)

jezrael
jezrael

Reputation: 863751

Use:

df = pd.read_csv('BAD_data.csv', 
                 encoding='ISO-8859-1', #if not necessary omit it
                 sep='[\t+,]', #multiple separators
                 header=[0,1], #read first 2 rows to Multiindex
                 engine='python', 
                 dtype=str) #all values convart to strings for NOT change floats

#remove " in first column
df.iloc[:, 0] = df.iloc[:, 0].str.strip('"')
#replace nan strings to NaN
df = df.replace('nan', np.nan)
#remove " and whitespaces in columns
a = df.columns.get_level_values(0).str.strip('" ')
a = np.where(a.str.startswith('Unnamed'), np.nan, a)
b = df.columns.get_level_values(1).str.strip('" ')
df.columns  = [a,b]

#print (df.head())

#write to csv
df.to_csv('Good_data.csv')

Upvotes: 1

Related Questions