Reputation: 23

Reading txt files with no delimiter and irregular width

I'm trying to read in a large number of text files and clean up into a dataframe. The files do not include delimiters, have irregular widths due to the inclusion of some rows with data that corresponds to the subsequent set of rows.

Here are 2 samples:

    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0326806  WASHOUT BASIN                                1.000    L SUM
 1216      Copenhaver Construction, Inc.                                1,000.0000        1,000.00        1,000.00
 1320      D. Construction, Inc.                                        1,500.0000        1,500.00        1,500.00
 3069      K-Five Construction Corporation                              1,000.0000        1,000.00        1,000.00
 3702      Martam Construction Incorporated                             1,500.0000        1,500.00        1,500.00
 4741      Phoenix Corporation of the Quad Cities                       5,000.0000        5,000.00        5,000.00
 4786      Pir Tano Construction Company, Inc.                          1,200.0000        1,200.00        1,200.00
 1560      R. W. Dunteman Company                                         450.0000          450.00          450.00
 5378      Schroeder Asphalt Services, Inc.                             5,100.0000        5,100.00        5,100.00

    X0327036  BIKE PATH REM                              120.000    SQ YD
 1216      Copenhaver Construction, Inc.                                   16.0000        1,920.00        1,920.00
 1320      D. Construction, Inc.                                           20.0000        2,400.00        2,400.00
 3069      K-Five Construction Corporation                                  5.0000          600.00          600.00
 3702      Martam Construction Incorporated                                10.0000        1,200.00        1,200.00
 4741      Phoenix Corporation of the Quad Cities                          14.0000        1,680.00        1,680.00
 4786      Pir Tano Construction Company, Inc.                             32.0000        3,840.00        3,840.00
 1560      R. W. Dunteman Company                                          12.8400        1,540.80        1,540.80
 5378      Schroeder Asphalt Services, Inc.                                18.0000        2,160.00        2,160.00

and another file here:

    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00

I'm open to using R or Python and have tried a variety of approaches with base R, readr & data.table as well as pandas & looping over rows with open() with little success. My delimiter usage has been wrong as my results have either parsed every single space into a column or given me with a single column with all contents from each row.

Is there a clean way of accomplishing this? Thanks.

Upvotes: 0

Answers (3)

G. Grothendieck

Reputation: 269754

Define a function, Read, to read in the lines of one group. It removes empty lines, combines the first remaining line of each group with the others and parses it into a data frame. We have used the letters vector to define the column names but you can replace it with anything you like.

Now read the input file, trim whitespace off the ends removing junk in the first 4 lines, split it into groups, apply Read to each group and then combine the individual data frames into one overall data frame.

library(magrittr)  # use pipes
library(readr) # read_lines, read_delim

Read <- function(x) x %>%
  Filter(nzchar, .) %>%
  { paste(.[-1], "", .[1])) } %>%
  gsub("  +", ";", .) %>%
  I %>%
  read_delim(delim = ";", col_names = letters)

DF <- "data.txt" %>%
   read_lines(skip = 4) %>%
   trimws %>%
   by(cumsum(!nzchar(.)), Read) %>%
   do.call("rbind", .)

Note

Create input file for testing.

Lines <- "    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00"
writeLines(Lines, "data.txt")

Update

Simplified code.

Upvotes: 0

RJ Adriaansen

Reputation: 9629

Here is a working workflow:

split the text by multiple newlines (and process all items in the list except the first one that only contains headers)
Use pandas read_fwf to read the first line (item data) as a dataframe by identifying the fixed-width fields of the columns
Do the same for the rest of the text (bidder data)
Concatenate the two dataframes and append to a list
Concatenate all dataframes in the list to one df

Code:

import re
import pandas as pd

data = '''
    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00'''

#with open('filename.txt') as f:
#    data = f.read()

tables = [i for i in re.split(r'\n\n+', data)[1:] if i]

dfs= []
for i in tables:
    item_df = pd.read_fwf(io.StringIO(i.splitlines()[0]), names=['ITEM NBR','ITEM DESCRIPTION','QUANTITY','UNIT OF MEASURE'], colspecs=[(0,12),(12,45),(45,64),(64,73)])

    headings = ['BIDR NBR','BIDDER NAME','UNIT PRICE','BIDDER EXTENSION','CALCULATED EXTENSION']
    colspecs = [(1, 11), (11, 64), (64, 82), (82,98), (98, 114)]
    buyers_df = pd.read_fwf(io.StringIO(i), names=headings, index=False, colspecs=colspecs, skiprows=1, thousands=',')
    dfs.append(pd.concat([item_df, buyers_df], axis=1).ffill())
    
df = pd.concat(dfs)

Output:

	ITEM NBR	ITEM DESCRIPTION	QUANTITY	UNIT OF MEASURE	BIDR NBR	BIDDER NAME	UNIT PRICE	BIDDER EXTENSION	CALCULATED EXTENSION
0	X0320050	CONSTRUCTN LAYOUT SPL	1	L SUM	2341	Builders Paving, LLC	5000	5000	5000
1	X0320050	CONSTRUCTN LAYOUT SPL	1	L SUM	3020	J. A. Johnson Paving Company	5000	5000	5000
2	X0320050	CONSTRUCTN LAYOUT SPL	1	L SUM	280	Peter Baker & Son Co.	1500	1500	1500
0	X0327611	REM & REIN BRIC PAVER	55	SQ FT	2341	Builders Paving, LLC	20	1100	1100
1	X0327611	REM & REIN BRIC PAVER	55	SQ FT	3020	J. A. Johnson Paving Company	40	2200	2200
2	X0327611	REM & REIN BRIC PAVER	55	SQ FT	280	Peter Baker & Son Co.	20	1100	1100

Upvotes: 1

Tanmay Shrivastava

Reputation: 579

Try this

import pandas as pd

data=open("filename.txt","r").read()
df = pd.DataFrame({"data": data.split()})
print(df)

Upvotes: 0

Reading txt files with no delimiter and irregular width

Answers (3)

Note

Update

Related Questions