Reputation: 39
I am trying to read several (>1000) .txt files (on average approx. 700 MB, delimited, header-less CSV, without commas or other separator) and merge them into one pandas dataframe (to next run an analysis on the entire dataset).
I am running this through SSH on a HPC server, on which I requested 50GB RAM, 1 node, 1 task-per-node (that was all just a wild guess, as I have never done this before).
So far, my idea was this:
li = []
for filename in all_files:
df = pd.read_csv(filename, sep=None, header=0, engine='python')
li.append(df)
df = pd.concat(li, axis=0, ignore_index=True)
but after a few hours and having loaded the approx. 360th file the process gets killed and I get the error message:
numpy.core._exceptions.MemoryError: Unable to allocate 1.11 GiB for an array with shape (10, 14921599) and data type float64
Do you have any idea how to load and merge the data more memory efficient? (I assume just requesting more RAM still does not get me through the entire set of .txt files!?)
Also, I would like to save the resulting dataframe in a 'memory-efficient' way afterwards, do you know the best way/format (csv?) to do that?
Any help would be much appreciated!
Upvotes: 3
Views: 1120
Reputation: 1
Q : "How can I ... Any help would be much appreciated!"
A :
best follow the laws of the ECONOMY-of-COMPUTING :
Your briefly sketched problem has, out of question, immense "setup"-costs, having unspecified amount of some useful work to be computed on an unspecified HPC-ecosystem.
Even without hardware & rental details ( devil is always hidden in detail(s) & one can easily pay hilarious amounts of money for trying to make a (hiddenly) "shared"-platform deliver any improved computing performance - many startups have experienced this on voucher-sponsored promises, the more if an overall computing strategy was poorly designed )
I cannot resist not to quote the so called 1st Etore's Law of Evolution of Systems' Dynamics :
If we open a can of worms,
the only way how to put them back
is to use a bigger can
Closing our eyes not to see the accumulating inefficiencies is the worst sin of sins, as devastatingly exponential growths of all of costs, time & resources, incl. energy-consumption are common to meet on such, often many-levels stacked-inefficiencies' complex systems
Sorry if these were known to you beforehand, just trying to build some common ground, as a platform to lay further argumentation rock-solid on. More details are here and this is only a needed beginning, as more problems will definitely come from any real-world O( Mx * Ny * ... )-scaling related issues in further modelling.
0.1 ns - CPU NOP - a DO-NOTHING instruction
0.5 ns - CPU L1 dCACHE reference (1st introduced in late 80-ies )
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
3~4 ns - CPU L2 CACHE reference (2020/Q1)
7 ns - CPU L2 CACHE reference
19 ns - CPU L3 CACHE reference (2020/Q1 considered slow on 28c Skylake)
______________________on_CPU______________________________________________________________________________________
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
2,500 ns - Read 10 kB sequentially from MEMORY------ HPC-node
25,000 ns - Read 100 kB sequentially from MEMORY------ HPC-node
250,000 ns - Read 1 MB sequentially from MEMORY------ HPC-node
2,500,000 ns - Read 10 MB sequentially from MEMORY------ HPC-node
25,000,000 ns - Read 100 MB sequentially from MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
250,000,000 ns - Read 1 GB sequentially from MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
2,500,000,000 ns - Read 10 GB sequentially from MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
25,000,000,000 ns - Read 100 GB sequentially from MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
_____________________________________________________________________________own_CPU/DDR__________________________
| | | | |
| | | | ns|
| | | us|
| | ms|
| s|
h|
500,000 ns - Round trip within a same DataCenter ------- HPC-node / HPC-storage latency on each access
20,000,000 ns - Send 2 MB over 1 Gbps NETWORK
200,000,000 ns - Send 20 MB over 1 Gbps NETWORK
2,000,000,000 ns - Send 200 MB over 1 Gbps NETWORK
20,000,000,000 ns - Send 2 GB over 1 Gbps NETWORK
200,000,000,000 ns - Send 20 GB over 1 Gbps NETWORK
2,000,000,000,000 ns - Send 200 GB over 1 Gbps NETWORK
____________________________________________________________________________via_LAN_______________________________
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
____________________________________________________________________________via_WAN_______________________________
10,000,000 ns - DISK seek spent to start file-I/O on spinning disks on any next piece of data seek/read
30,000,000 ns - DISK 1 MB sequential READ from a DISK
300,000,000 ns - DISK 10 MB sequential READ from a DISK
3,000,000,000 ns - DISK 100 MB sequential READ from a DISK
30,000,000,000 ns - DISK 1 GB sequential READ from a DISK
300,000,000,000 ns - DISK 10 GB sequential READ from a DISK
3,000,000,000,000 ns - DISK 100 GB sequential READ from a DISK
______________________on_DISK_______________________________________________own_DISK______________________________
| | | | |
| | | | ns|
| | | us|
| | ms|
| s|
h|
Given these elements, the end-to-end computing strategy may and shall be improved.
A naive figure shows more than thousands words
localhost |
: file-I/O ~ 25+GB SLOWEST/EXPENSIVE
: 1st time 25+GB file-I/O-s
: |
: | RAM |
: | |
+------+ |IOIOIOIOIOI|
|.CSV 0| |IOIOIOIOIOI|
|+------+ |IOIOIOIOIOI|
||.CSV 1| |IOIOIOIOIOI|
||+------+ |IOIOIOIOIOI|-> local ssh()-encrypt+encapsulate-process
|||.CSV 2| |IOIOIOIOIOI| 25+GB of .CSV
+|| | |IOIOIOIOIOI|~~~~~~~|
|| | |IOIOIOIOIOI|~~~~~~~|
+| | | |~~~~~~~|
| | | |~~~~~~~|
+------+ | |~~~~~~~|
... | |~~~~~~~|-> LAN SLOW
... | | WAN SLOWER
... | | transfer of 30+GB to "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
+------+ | | | 30+GB file-I/O ~ 25+GB SLOWEST/EXPENSIVE
|.CSV 9| | |~~~~~~~~~~~~~~~| 2nd time 25+GB file-I/O-s
|+------+ | |~~~~~~~~~~~~~~~|
||.CSV 9| | |~~~~~~~~~~~~~~~|
||+------+ | |~~~~~~~~~~~~~~~|
|||.CSV 9| | |~~~~~~~~~~~~~~~|
+|| 9| | |~~~~~~~~~~~~~~~|
|| 9| | |~~~~~~~~~~~~~~~|
+| | | |~~~~~~~~~~~~~~~|
| | | |~~~~~~~~~~~~~~~|-> file-I/O into python
+------+ | | all .CSV file to RAM ~ 25+GB SLOWEST/EXPENSIVE
| |***| 3rd time 25+GB file-I/O-s
| | RAM .CSV to df CPU work
| |***| df to LIST new RAM-allocation + list.append( df )-costs
| |***| + 25+GB
| |***|
many hours | |***|
[SERIAL] flow ...| |***|
/\/\/\/\/\/\/\/\/\/|\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\|***|/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
| crashed |***|
| on about |***|
| 360-th file |***|
| |***|->RAM ~ 50~GB with a LIST of all 25+GB dataframes held in LIST
| | CPU +mem-I/O costs LIST to new 25+GB dataframe RAM-allocation & DATA-processing
| |~~~~~| mem-I/O RAM| :: GB
| |~~~~~| mem-I/O |RAM flow of ~ 50+GB over only 2/3/? mem-I/O HW-channels
| |~~~~~| only if "HPC"
| |~~~~~| is *NOT* a "shared"-rental of cloud HW,
| |~~~~~| remarketed as an "HPC"-illusion
| |~~~~~|
| :::::::::::::?
| :::::::::::::?
| :::::::::::::?
| <...some amount of some usefull work --"HPC"-processing the ~ 25+GB dataframe...>
| <...some amount of some usefull work ...>
| <...some amount of some usefull work the more ...>
| <...some amount of some usefull work the better ...>
| <...some amount of some usefull work as ...>
| <...some amount of some usefull work it ...>
| <...some amount of some usefull work dissolves to AWFULLY ...>
| <...some amount of some usefull work HIGH ...>
| <...some amount of some usefull work SETUP COSTS ...>
| <...some amount of some usefull work ...>
| <...some amount of some usefull work --"HPC"-processing the ~ 25+GB dataframe...>
| :::::::::::::?
| :::::::::::::?
| :::::::::::::?
| |-> file-I/O ~ 25+GB SLOWEST/EXPENSIVE
| |~~~~~| 4th time 25+GB file-I/O-s
| |~~~~~|
| |~~~~~|->file left on remote storage (?)
| |
| O?R
| |
| |-> file-I/O ~ 25+GB SLOWEST/EXPENSIVE
| |~~~~~| 5th time 25+GB file-I/O-s
| |~~~~~|
| |~~~~~|
| |~~~~~|
| |~~~~~|-> RAM / CPU ssh()-encrypt+encapsulate-process
| |????????????| 25+GB of results for repatriation
| |????????????| on localhost
| |????????????|
| |????????????|
| |????????????|-> LAN SLOW
| | WAN SLOWER
| | transfer of 30+GB from "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
| | | 30+GB file-I/O ~ 25+GB SLOWEST/EXPENSIVE
| |~~~~~~~~~~~~~~~| 6th time 25+GB file-I/O-s
| |~~~~~~~~~~~~~~~|
| |~~~~~~~~~~~~~~~|
| |~~~~~~~~~~~~~~~|
| |~~~~~~~~~~~~~~~|
SUCCESS ? | |~~~~~~~~~~~~~~~|-> file transferred back and stored on localhost storage
after |
how many |
failed |
attempts |
having |
how high |
recurring|
costs |
for any |
next |
model|
recompute|
step(s) |
|
|
All |
that |
( at what overall |
[TIME]-domain |
& "HPC"-rental |
costs ) |
Tips :
.mean()
, .sum()
, .min()
, .max()
or even .rsi()
, .EMA()
, .TEMA()
, .BollingerBands()
... and many more alike ) - stream-computing formulated algorithms reduce both the RAM-allocations, can be & shall be pre-computed (once) & minimise the [SERIAL]-one-after-another processing pipeline latency )numpy
-tools instead, where all the array syntax & methods remain the same, yet it can, by-design, work without moving all data at once from disk into physical RAM ( using this was my life-saving trick since ever, the more when using many-model simulations & HyperParameterSPACE optimisations on a few tens of GB data on 32-bit hardware )For more details on going into the direction of RAM-protecting memory-mapped np.ndarray
processing, with all smart numpy-vectorised and all other high performance-tuned tricks, read more details in this :
>> print( np.memmap.__doc__ ) Create a memory-map to an array stored in a *binary* file on disk. Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy's memmap's are array-like objects. (...)
Upvotes: 0
Reputation: 1183
as you said you have so many files and it needs so much memory so I suggest loading and saving all of the files in a single file in appending mode (append data to the previously saved data) like this
for filename in all_files:
df = pd.read_csv(filename, sep=None, header=0, engine='python')
df.to_csv('./data.csv', header=None, index=None, mode='a')
after saving all of the files in single file now you can read the single file as a dataframe like this:
df = pd.read_csv('./data.csv',header=None,index_col=False)
after that if you have any issues with reading this file because of memory you can use a reader like this:
chunksize = 10 ** 6
with pd.read_csv('./data.csv',header=None,index_col=False, chunksize=chunksize) as reader:
for chunk in reader:
# Do What you want
Upvotes: 1