Reading large text file into a dataframe for data analysis in Python

Question

I know similar questions have been asked before. But I still cannot figure out the best way to process data for my program

I have a large text file (50,000 to 5,000,000 lines of text). I need to process each line of this file and write it into a Dataframe so that I can do some data analysis on them.

The dataframe has 9 columns mostly floats and some strings and no. of rows ~ no. of lines in the input file

Currently, I am reading this file line-by line using "with open.." and then using regex to extract the required data and writing this as a row into the Data frame. As this is going through a For loop it takes forever to complete.

What is the best way to do this ? Any pointers or sample programs ? Should I even be using a dataframe ?

Here is my code.

    def gcodetodf(self):


    with open(self.inputfilepath, 'r') as ifile:

        lflag = False
        for item in ifile:

            layermatch = self.layerpattern.match(item)
            self.tlist = item.split(' ')
            self.clist = re.split(r"(\w+)", item)

            if layermatch and (str(self.tlist[2][:-1]) == 'end' or int(self.tlist[2][:-1]) == (self.endlayer + 1)):
                break

            if (layermatch and int(self.tlist[2][:-1]) == self.startlayer) or lflag is True:
                lflag = True
                # clist = re.split(r"(\w+)", item)

                map_gcpat = {bool(self.gonepattern.match(item)): self.gc_g1xyef,
                             bool(self.gepattern.match(item)): self.gc_g1xye,
                             bool(self.gtrpattern.match(item)): self.gc_g1xyf,
                             bool(self.resetextpattern.match(item)): self.gc_g92e0,
                             bool(self.ftpattern.match(item)): self.gc_ftype,
                             bool(self.toolcompattern.match(item)): self.gc_toolcmt,
                             bool(self.layerpattern.match(item)): self.gc_laycmt,
                             bool(self.zpattern.match(item)): self.gc_g1z}

                map_gcpat.get(True, self.contd)()

    # print(self.newdataframe)

an example function that writes to the dataframe looks like this:

def gc_g1xye(self):
    self.newdataframe = self.newdataframe.append(
        {'Xc': float(self.tlist[1][1:]), 'Yc': float(self.tlist[2][1:]), 'Zc': self.gc_z,
         'E': float(self.tlist[3][1:]),
         'F': None, 'FT': self.ft_var, 'EW': self.tc_ew, 'LH': self.tc_lh, 'Layer': self.cmt_layer},
        ignore_index=True)

sample input file:

........
G1 X159.8 Y140.2 E16.84505
G1 X159.8 Y159.8 E17.56214
M204 S5000
M205 X30 Y30
G0 F2400 X159.6 Y159.8
G0 X159.33 Y159.33
G0 X159.01 Y159.01
M204 S500
M205 X20 Y20
;TYPE:SKIN
G1 F1200 X140.99 Y159.01 E18.22142
G1 X140.99 Y140.99 E18.8807
G1 X159.01 Y140.99 E19.53999
G1 X159.01 Y159.01 E20.19927
M204 S5000
M205 X30 Y30
G0 F2400 X150.21 Y150.21
M204 S500
M205 X20 Y20
G1 F1200 X149.79 Y150.21 E20.21464
G1 X149.79 Y149.79 E20.23
G1 X150.21 Y149.79 E20.24537
G1 X150.21 Y150.21 E20.26073
M204 S5000
M205 X30 Y30
G0 F2400 X150.61 Y150.61
M204 S500
M205 X20 Y20
G1 F1200 X149.39 Y150.61 E20.30537
G1 X149.39 Y149.39 E20.35
G1 X150.61 Y149.39 E20.39464
..........

Reading large text file into a dataframe for data analysis in Python

Answers (1)

Related Questions