Reputation: 101
I am trying to read a csv file in Pandas. The file seems in a strange format I downloaded from LinkedIN campaign manager. Can you help me read this file normally? Here is the code:
path = r'C:\Users\FilePath' # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))
dfAllDataLI = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
Here is the error:
UnicodeDecodeError Traceback (most recent call
last)
~\AppData\Local\Temp/ipykernel_11340/2382686370.py in <module>
3 path = r'C:\Users\n' # use your path
4 all_files = glob.glob(os.path.join(path, "*.csv"))
----> 5 dfAllDataLI = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
6 dfAllDataLI = dfAllDataLI.fillna('')
7
c:\Userspackages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
c:\Usersshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
292 ValueError: Indexes have overlapping values: ['a']
293 """
--> 294 op = _Concatenator(
295 objs,
296 axis=axis,
c:\Useronda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
346 objs = [objs[k] for k in keys]
...
c:\Useda3\lib\site-packages\pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
c:\Users\ackages\pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Campaign Performance Report (in UTC)
Report Start: April 1, 2022, 12:00 AM
Report End: April 19, 2022, 11:59 PM
Date Generated: September 7, 2022, 1:12 PM
Start Date (in UTC) Account Name Campaign Group Name Campaign Group ID Campaign Name Campaign ID Campaign Type Campaign Start Date Campaign Group Start Date Campaign End Date Total Budget Clicks Impressions Average CPM Average CPC Avg. Last Day Reach Video Completions
4/19/2022 Wiener Stadtwerke GmbH_iprospect WST_Content_Promotion_2022 622214964 14.04. | Spendeaktion UKR | reach 194421704 Sponsored Update 4/19/2022 3/8/2022 4/30/2022 600 23 3109 17.22 2.33 3096 58
Upvotes: 1
Views: 1263
Reputation: 168967
The file has 5 non-CSV rows before the column header.
Happily, read_csv
allows you to skip those lines. You'll also need to specify the text encoding (it's UTF-16LE, not UTF-8) and separator for that file (it's tab-separated):
import pandas as pd
df = pd.read_csv('csv file.csv', skiprows=5, encoding='utf-16le', sep='\t')
print(df.columns)
outputs
Index(['Start Date (in UTC)', 'Account Name', 'Campaign Group Name',
'Campaign Group ID', 'Campaign Name', 'Campaign ID', 'Campaign Type',
'Campaign Start Date', 'Campaign Group Start Date', 'Campaign End Date',
'Total Budget', 'Clicks', 'Impressions', 'Average CPM', 'Average CPC',
'Avg. Last Day Reach', 'Video Completions'],
dtype='object')
Upvotes: 1