Polyhedronic
Polyhedronic

Reputation: 63

Reading large text file into a dataframe for data analysis in Python

I know similar questions have been asked before. But I still cannot figure out the best way to process data for my program

I have a large text file (50,000 to 5,000,000 lines of text). I need to process each line of this file and write it into a Dataframe so that I can do some data analysis on them.

The dataframe has 9 columns mostly floats and some strings and no. of rows ~ no. of lines in the input file

Currently, I am reading this file line-by line using "with open.." and then using regex to extract the required data and writing this as a row into the Data frame. As this is going through a For loop it takes forever to complete.

What is the best way to do this ? Any pointers or sample programs ? Should I even be using a dataframe ?

Here is my code.

    def gcodetodf(self):


    with open(self.inputfilepath, 'r') as ifile:

        lflag = False
        for item in ifile:

            layermatch = self.layerpattern.match(item)
            self.tlist = item.split(' ')
            self.clist = re.split(r"(\w+)", item)

            if layermatch and (str(self.tlist[2][:-1]) == 'end' or int(self.tlist[2][:-1]) == (self.endlayer + 1)):
                break

            if (layermatch and int(self.tlist[2][:-1]) == self.startlayer) or lflag is True:
                lflag = True
                # clist = re.split(r"(\w+)", item)

                map_gcpat = {bool(self.gonepattern.match(item)): self.gc_g1xyef,
                             bool(self.gepattern.match(item)): self.gc_g1xye,
                             bool(self.gtrpattern.match(item)): self.gc_g1xyf,
                             bool(self.resetextpattern.match(item)): self.gc_g92e0,
                             bool(self.ftpattern.match(item)): self.gc_ftype,
                             bool(self.toolcompattern.match(item)): self.gc_toolcmt,
                             bool(self.layerpattern.match(item)): self.gc_laycmt,
                             bool(self.zpattern.match(item)): self.gc_g1z}

                map_gcpat.get(True, self.contd)()

    # print(self.newdataframe)

an example function that writes to the dataframe looks like this:

def gc_g1xye(self):
    self.newdataframe = self.newdataframe.append(
        {'Xc': float(self.tlist[1][1:]), 'Yc': float(self.tlist[2][1:]), 'Zc': self.gc_z,
         'E': float(self.tlist[3][1:]),
         'F': None, 'FT': self.ft_var, 'EW': self.tc_ew, 'LH': self.tc_lh, 'Layer': self.cmt_layer},
        ignore_index=True)

sample input file:

........
G1 X159.8 Y140.2 E16.84505
G1 X159.8 Y159.8 E17.56214
M204 S5000
M205 X30 Y30
G0 F2400 X159.6 Y159.8
G0 X159.33 Y159.33
G0 X159.01 Y159.01
M204 S500
M205 X20 Y20
;TYPE:SKIN
G1 F1200 X140.99 Y159.01 E18.22142
G1 X140.99 Y140.99 E18.8807
G1 X159.01 Y140.99 E19.53999
G1 X159.01 Y159.01 E20.19927
M204 S5000
M205 X30 Y30
G0 F2400 X150.21 Y150.21
M204 S500
M205 X20 Y20
G1 F1200 X149.79 Y150.21 E20.21464
G1 X149.79 Y149.79 E20.23
G1 X150.21 Y149.79 E20.24537
G1 X150.21 Y150.21 E20.26073
M204 S5000
M205 X30 Y30
G0 F2400 X150.61 Y150.61
M204 S500
M205 X20 Y20
G1 F1200 X149.39 Y150.61 E20.30537
G1 X149.39 Y149.39 E20.35
G1 X150.61 Y149.39 E20.39464
..........

Upvotes: 0

Views: 691

Answers (1)

Lidae
Lidae

Reputation: 318

Beware that DataFrame.append returns a copy of your old DataFrame with the new rows added: it does not work inplace. Constructing a DataFrame row by row, using append will then work in O(n^2) instead of O(n), which is rather bad if you have 5 million rows...

What you want to do instead is to append each row to a list first (a list of dicts is fine), and then create the DataFrame object from that once all the parsing is done. This will be much faster because appending to a list is done in constant time, so your total complexity should be O(n) instead.

def gc_g1xye(self):
    self.data.append(
        {'Xc': float(self.tlist[1][1:]), 'Yc': float(self.tlist[2][1:]), 'Zc': self.gc_z,
         'E': float(self.tlist[3][1:]),
         'F': None, 'FT': self.ft_var, 'EW': self.tc_ew, 'LH': self.tc_lh, 'Layer': self.cmt_layer})

...

# Once the parsing is done:
self.newdataframe = pd.DataFrame(self.data)

Is this the best way of doing it? It looks like a good start to me. Should you be using a DataFrame? From what you say you want to do with the data once you've parsed it, a DataFrame sounds like a good option.

As a random unrelated tip, I recommend the tqdm package for showing a progress bar of your for-loop. It's super easy to use, and it helps you in judging whether it's worth waiting for that loop to finish!

Upvotes: 1

Related Questions