senor_stees
senor_stees

Reputation: 23

Reading txt files with no delimiter and irregular width

I'm trying to read in a large number of text files and clean up into a dataframe. The files do not include delimiters, have irregular widths due to the inclusion of some rows with data that corresponds to the subsequent set of rows.

Here are 2 samples:

    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0326806  WASHOUT BASIN                                1.000    L SUM
 1216      Copenhaver Construction, Inc.                                1,000.0000        1,000.00        1,000.00
 1320      D. Construction, Inc.                                        1,500.0000        1,500.00        1,500.00
 3069      K-Five Construction Corporation                              1,000.0000        1,000.00        1,000.00
 3702      Martam Construction Incorporated                             1,500.0000        1,500.00        1,500.00
 4741      Phoenix Corporation of the Quad Cities                       5,000.0000        5,000.00        5,000.00
 4786      Pir Tano Construction Company, Inc.                          1,200.0000        1,200.00        1,200.00
 1560      R. W. Dunteman Company                                         450.0000          450.00          450.00
 5378      Schroeder Asphalt Services, Inc.                             5,100.0000        5,100.00        5,100.00

    X0327036  BIKE PATH REM                              120.000    SQ YD
 1216      Copenhaver Construction, Inc.                                   16.0000        1,920.00        1,920.00
 1320      D. Construction, Inc.                                           20.0000        2,400.00        2,400.00
 3069      K-Five Construction Corporation                                  5.0000          600.00          600.00
 3702      Martam Construction Incorporated                                10.0000        1,200.00        1,200.00
 4741      Phoenix Corporation of the Quad Cities                          14.0000        1,680.00        1,680.00
 4786      Pir Tano Construction Company, Inc.                             32.0000        3,840.00        3,840.00
 1560      R. W. Dunteman Company                                          12.8400        1,540.80        1,540.80
 5378      Schroeder Asphalt Services, Inc.                                18.0000        2,160.00        2,160.00


                           

and another file here:

    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00

I'm open to using R or Python and have tried a variety of approaches with base R, readr & data.table as well as pandas & looping over rows with open() with little success. My delimiter usage has been wrong as my results have either parsed every single space into a column or given me with a single column with all contents from each row.

Is there a clean way of accomplishing this? Thanks.

Upvotes: 0

Views: 1782

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269754

Define a function, Read, to read in the lines of one group. It removes empty lines, combines the first remaining line of each group with the others and parses it into a data frame. We have used the letters vector to define the column names but you can replace it with anything you like.

Now read the input file, trim whitespace off the ends removing junk in the first 4 lines, split it into groups, apply Read to each group and then combine the individual data frames into one overall data frame.

library(magrittr)  # use pipes
library(readr) # read_lines, read_delim

Read <- function(x) x %>%
  Filter(nzchar, .) %>%
  { paste(.[-1], "", .[1])) } %>%
  gsub("  +", ";", .) %>%
  I %>%
  read_delim(delim = ";", col_names = letters)

DF <- "data.txt" %>%
   read_lines(skip = 4) %>%
   trimws %>%
   by(cumsum(!nzchar(.)), Read) %>%
   do.call("rbind", .)

Note

Create input file for testing.

Lines <- "    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00"
writeLines(Lines, "data.txt")

Update

Simplified code.

Upvotes: 0

RJ Adriaansen
RJ Adriaansen

Reputation: 9629

Here is a working workflow:

  1. split the text by multiple newlines (and process all items in the list except the first one that only contains headers)
  2. Use pandas read_fwf to read the first line (item data) as a dataframe by identifying the fixed-width fields of the columns
  3. Do the same for the rest of the text (bidder data)
  4. Concatenate the two dataframes and append to a list
  5. Concatenate all dataframes in the list to one df

Code:

import re
import pandas as pd

data = '''
    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00'''

#with open('filename.txt') as f:
#    data = f.read()

tables = [i for i in re.split(r'\n\n+', data)[1:] if i]

dfs= []
for i in tables:
    item_df = pd.read_fwf(io.StringIO(i.splitlines()[0]), names=['ITEM NBR','ITEM DESCRIPTION','QUANTITY','UNIT OF MEASURE'], colspecs=[(0,12),(12,45),(45,64),(64,73)])

    headings = ['BIDR NBR','BIDDER NAME','UNIT PRICE','BIDDER EXTENSION','CALCULATED EXTENSION']
    colspecs = [(1, 11), (11, 64), (64, 82), (82,98), (98, 114)]
    buyers_df = pd.read_fwf(io.StringIO(i), names=headings, index=False, colspecs=colspecs, skiprows=1, thousands=',')
    dfs.append(pd.concat([item_df, buyers_df], axis=1).ffill())
    
df = pd.concat(dfs)

Output:

ITEM NBR ITEM DESCRIPTION QUANTITY UNIT OF MEASURE BIDR NBR BIDDER NAME UNIT PRICE BIDDER EXTENSION CALCULATED EXTENSION
0 X0320050 CONSTRUCTN LAYOUT SPL 1 L SUM 2341 Builders Paving, LLC 5000 5000 5000
1 X0320050 CONSTRUCTN LAYOUT SPL 1 L SUM 3020 J. A. Johnson Paving Company 5000 5000 5000
2 X0320050 CONSTRUCTN LAYOUT SPL 1 L SUM 280 Peter Baker & Son Co. 1500 1500 1500
0 X0327611 REM & REIN BRIC PAVER 55 SQ FT 2341 Builders Paving, LLC 20 1100 1100
1 X0327611 REM & REIN BRIC PAVER 55 SQ FT 3020 J. A. Johnson Paving Company 40 2200 2200
2 X0327611 REM & REIN BRIC PAVER 55 SQ FT 280 Peter Baker & Son Co. 20 1100 1100

Upvotes: 1

Tanmay Shrivastava
Tanmay Shrivastava

Reputation: 579

Try this

import pandas as pd

data=open("filename.txt","r").read()
df = pd.DataFrame({"data": data.split()})
print(df)

Upvotes: 0

Related Questions