Reputation: 71
Right now I am trying to read in data which is provided in a messy to read-in format. Here is an example
#Name,
#Comment,""
#ExtComment,""
#Source,
[Data]
1,2
3,4
5,6
#[END_OF_FILE]
When working with one or two of these files, I have manually changed the ['DATA'] header to ['x', 'y'] and am able to read in data just fine by skipping the first few rows and not reading the last line.
However, right now I have 30+ files, split between two different folders and I am trying to figure out the best way to read in the files and change the header of each file from ['DATA'] to ['x', 'y'].
The excel files are in a folder one path lower than the file that is supposed to read them (i.e. folder 1 contains set of code below, and folder 2 contains the excel files, folder 1 contains folder 2)
Here is what I have right now:
#sets - refers to the set containing the name of each file (i.e. [file1, file2])
#df - the dataframe which you are going to store the data in
#dataLabels - the headers you want to search for within the .csv file
#skip - the number of rows you want to skip
#newHeader - what you want to change the column headers to be
#pathName - provide path where files are located
def reader (sets, df, dataLabels, skip, newHeader, pathName):
for i in range(len(sets)):
df_temp = pd.read_csv(glob.glob(pathName+ sets[i]+".csv"), sep=r'\s*,', skiprows = skip, engine = 'python')[:-1]
df_temp.column.value[0] = [newHeader]
for j in range(len(dataLabels)):
df_temp[dataLabels[j]] = pd.to_numeric(df_temp[dataLabels[j]],errors = 'coerce')
df.append(df_temp)
return df
When I run my code, I run into the error:
No columns to parse from file
I am not quite sure why - I have tried skipping past the [DATA] header and I still receive that error.
Note, for this example I would like the headers to be 'x', 'y' - I am trying to make a universal function so that I could change it to something more useful depending on what I am measuring.
Upvotes: 1
Views: 1498
Reputation: 131219
If you're lucky, you can use Kraigolas' answer to treat those lines as comments.
In other cases you may be able to use the skiprows
argument to skip header columns:
df= pd.read_csv(path,skiprows=10,skipfooter=2,names=["x","y"])
And yes, I do have an unfortunate file with a 10-row heading and 2 rows of totals.
Unfortunately I also have very unfortunate files where the number of headings change.
In this case I used the following code to iterate until I find the first "good" row, then create a new dataframe from the rest of the rows. The names in this case are taken from the first "good" row and the types from the first data row
This is certainly not fast, it's a last resort solution. If I had a better solution I'd use it:
data = df
if(first_col not in df.columns):
# Skip rows until we find the first col header
for i, row in df.iterrows():
if row[0] == first_col:
data = df.iloc[(i + 1):].reset_index(drop=True)
# Read the column names
series = df.iloc[i]
series = series.str.strip()
data.columns = list(series)
# Use only existing column types
types = {k: v for k, v in dtype.items() if k in data.columns}
# Apply the column types again
data = data.astype(dtype=types)
break
return data
In this case the condition is finding the first column name (first_col) in the first cell.
This can be adopted to use different conditions, eg looking for the first numeric cell:
columns = ["x", "y"]
dtypes = {"x":"float64", "y": "float64"}
data = df
# Skip until we find the first numeric value
for i, row in df.iterrows():
if row[0].isnumeric():
data = df.iloc[(i + 1):].reset_index(drop=True)
# Apply names and types
data.columns = columns
data = data.astype(dtype=dtypes)
break
return data
Upvotes: 1
Reputation: 20250
Expanding Kraigolas's answer, to do this with multiple files you can use a list comprehension:
files = [glob.glob(f"{pathName}{set_num}.csv") for set_num in sets]
df = pd.concat([pd.read_csv(file, comment="#", names = ["x", "y"]) for file in files])
Upvotes: 1
Reputation: 5560
If the #[DATA]
row is to be replaced regardless, just ignore it. You can just tell pandas to ignore lines that start with #
and then specify your own names:
import pandas as pd
df = pd.read_csv('test.csv', comment='#', names=['x', 'y'])
which gives
x y
0 1 2
1 3 4
2 5 6
Upvotes: 5