Reputation: 71

Changing Headers in .csv files

Right now I am trying to read in data which is provided in a messy to read-in format. Here is an example

#Name,
#Comment,""
#ExtComment,""
#Source,
[Data]
1,2
3,4
5,6
#[END_OF_FILE]

When working with one or two of these files, I have manually changed the ['DATA'] header to ['x', 'y'] and am able to read in data just fine by skipping the first few rows and not reading the last line.

However, right now I have 30+ files, split between two different folders and I am trying to figure out the best way to read in the files and change the header of each file from ['DATA'] to ['x', 'y'].

The excel files are in a folder one path lower than the file that is supposed to read them (i.e. folder 1 contains set of code below, and folder 2 contains the excel files, folder 1 contains folder 2)

Here is what I have right now:

#sets - refers to the set containing the name of each file (i.e. [file1, file2])
#df - the dataframe which you are going to store the data in 
#dataLabels - the headers you want to search for within the .csv file
#skip - the number of rows you want to skip
#newHeader - what you want to change the column headers to be
#pathName - provide path where files are located

def reader (sets, df, dataLabels, skip, newHeader, pathName):
     for i in range(len(sets)):
        
        df_temp = pd.read_csv(glob.glob(pathName+ sets[i]+".csv"), sep=r'\s*,', skiprows = skip, engine = 'python')[:-1] 
        df_temp.column.value[0] = [newHeader]
        for j in range(len(dataLabels)):
           df_temp[dataLabels[j]] = pd.to_numeric(df_temp[dataLabels[j]],errors = 'coerce')       
        df.append(df_temp)       
     return df

When I run my code, I run into the error:

No columns to parse from file

I am not quite sure why - I have tried skipping past the [DATA] header and I still receive that error.

Note, for this example I would like the headers to be 'x', 'y' - I am trying to make a universal function so that I could change it to something more useful depending on what I am measuring.

Upvotes: 1

Answers (3)

Panagiotis Kanavos

Reputation: 131729

If you're lucky, you can use Kraigolas' answer to treat those lines as comments.

In other cases you may be able to use the skiprows argument to skip header columns:

df= pd.read_csv(path,skiprows=10,skipfooter=2,names=["x","y"])

And yes, I do have an unfortunate file with a 10-row heading and 2 rows of totals.

Unfortunately I also have very unfortunate files where the number of headings change.

In this case I used the following code to iterate until I find the first "good" row, then create a new dataframe from the rest of the rows. The names in this case are taken from the first "good" row and the types from the first data row

This is certainly not fast, it's a last resort solution. If I had a better solution I'd use it:

data = df
if(first_col not in df.columns):
    # Skip rows until we find the first col header
    for i, row in df.iterrows():
        if row[0] == first_col:
            data = df.iloc[(i + 1):].reset_index(drop=True)
            # Read the column names
            series = df.iloc[i]
            series = series.str.strip()
            data.columns = list(series)

            # Use only existing column types
            types = {k: v for k, v in dtype.items() if k in data.columns}
            # Apply the column types again
            data = data.astype(dtype=types)
            break
return data

In this case the condition is finding the first column name (first_col) in the first cell.

This can be adopted to use different conditions, eg looking for the first numeric cell:

columns = ["x", "y"]
dtypes = {"x":"float64", "y": "float64"}
data = df
# Skip until we find the first numeric value
for i, row in df.iterrows():
    if row[0].isnumeric():
        data = df.iloc[(i + 1):].reset_index(drop=True)
        # Apply names and types
        data.columns = columns
        data = data.astype(dtype=dtypes)
            break
return data

Upvotes: 1

SamR

Reputation: 20512

Expanding Kraigolas's answer, to do this with multiple files you can use a list comprehension:

files = [glob.glob(f"{pathName}{set_num}.csv") for set_num in sets]
df = pd.concat([pd.read_csv(file, comment="#", names = ["x", "y"]) for file in files])

Upvotes: 1

Kraigolas

Reputation: 5590

If the #[DATA] row is to be replaced regardless, just ignore it. You can just tell pandas to ignore lines that start with # and then specify your own names:

import pandas as pd

df = pd.read_csv('test.csv', comment='#', names=['x', 'y'])

which gives

Upvotes: 5

Changing Headers in .csv files

Answers (3)

Related Questions