Reputation: 38135
I have three csv dataframes of tweets, each ~5M tweets. The following code for concatenating them exists with low memory error. My machine has 32GB memory. How can I assign more memory for this task in pandas?
df1 = pd.read_csv('tweets.csv')
df2 = pd.read_csv('tweets2.csv')
df3 = pd.read_csv('tweets3.csv')
frames = [df1, df2, df3]
result = pd.concat(frames)
result.to_csv('tweets_combined.csv')
The error is:
$ python concantenate_dataframes.py
sys:1: DtypeWarning: Columns (0,1,2,3,4,5,6,8,9,10,11,12,13,14,19,22,23,24) have mixed types.Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
File "concantenate_dataframes.py", line 19, in <module>
df2 = pd.read_csv('tweets2.csv')
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
UPDATE: tried the suggestions in the answer and still get error
$ python concantenate_dataframes.py
Traceback (most recent call last):
File "concantenate_dataframes.py", line 18, in <module>
df1 = pd.read_csv('tweets.csv', low_memory=False, error_bad_lines=False)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 943, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
I am running the code on Ubuntu 20.04 OS
Upvotes: 0
Views: 185
Reputation: 347
I think this is problem with malformed data (some data not structure properly in tweets2.csv
) for that you can use error_bad_lines=False
and try to chnage engine from c to python like engine='python'
ex : df2 = pd.read_csv('tweets2.csv', error_bad_lines=False)
or
ex : df2 = pd.read_csv('tweets2.csv', engine='python')
or maybe
ex : df2 = pd.read_csv('tweets2.csv', engine='python', error_bad_lines=False)
but I recommand to identify those revord and repair that.
And also if you want hacky way to do this than use
2) https://askubuntu.com/questions/656039/concatenate-multiple-files-without-headerenter link description here
Upvotes: 0