Reputation: 23
I'm trying to read in a large number of text files and clean up into a dataframe. The files do not include delimiters, have irregular widths due to the inclusion of some rows with data that corresponds to the subsequent set of rows.
Here are 2 samples:
ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0326806 WASHOUT BASIN 1.000 L SUM
1216 Copenhaver Construction, Inc. 1,000.0000 1,000.00 1,000.00
1320 D. Construction, Inc. 1,500.0000 1,500.00 1,500.00
3069 K-Five Construction Corporation 1,000.0000 1,000.00 1,000.00
3702 Martam Construction Incorporated 1,500.0000 1,500.00 1,500.00
4741 Phoenix Corporation of the Quad Cities 5,000.0000 5,000.00 5,000.00
4786 Pir Tano Construction Company, Inc. 1,200.0000 1,200.00 1,200.00
1560 R. W. Dunteman Company 450.0000 450.00 450.00
5378 Schroeder Asphalt Services, Inc. 5,100.0000 5,100.00 5,100.00
X0327036 BIKE PATH REM 120.000 SQ YD
1216 Copenhaver Construction, Inc. 16.0000 1,920.00 1,920.00
1320 D. Construction, Inc. 20.0000 2,400.00 2,400.00
3069 K-Five Construction Corporation 5.0000 600.00 600.00
3702 Martam Construction Incorporated 10.0000 1,200.00 1,200.00
4741 Phoenix Corporation of the Quad Cities 14.0000 1,680.00 1,680.00
4786 Pir Tano Construction Company, Inc. 32.0000 3,840.00 3,840.00
1560 R. W. Dunteman Company 12.8400 1,540.80 1,540.80
5378 Schroeder Asphalt Services, Inc. 18.0000 2,160.00 2,160.00
and another file here:
ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0320050 CONSTRUCTN LAYOUT SPL 1.000 L SUM
2341 Builders Paving, LLC 5,000.0000 5,000.00 5,000.00
3020 J. A. Johnson Paving Company 5,000.0000 5,000.00 5,000.00
0280 Peter Baker & Son Co. 1,500.0000 1,500.00 1,500.00
X0327611 REM & REIN BRIC PAVER 55.000 SQ FT
2341 Builders Paving, LLC 20.0000 1,100.00 1,100.00
3020 J. A. Johnson Paving Company 40.0000 2,200.00 2,200.00
0280 Peter Baker & Son Co. 20.0000 1,100.00 1,100.00
I'm open to using R or Python and have tried a variety of approaches with base R, readr & data.table as well as pandas & looping over rows with open() with little success. My delimiter usage has been wrong as my results have either parsed every single space into a column or given me with a single column with all contents from each row.
Is there a clean way of accomplishing this? Thanks.
Upvotes: 0
Views: 1782
Reputation: 269754
Define a function, Read, to read in the lines of one group. It removes empty lines, combines the first remaining line of each group with the others and parses it into a data frame. We have used the letters vector to define the column names but you can replace it with anything you like.
Now read the input file, trim whitespace off the ends removing junk in the first 4 lines, split it into groups, apply Read to each group and then combine the individual data frames into one overall data frame.
library(magrittr) # use pipes
library(readr) # read_lines, read_delim
Read <- function(x) x %>%
Filter(nzchar, .) %>%
{ paste(.[-1], "", .[1])) } %>%
gsub(" +", ";", .) %>%
I %>%
read_delim(delim = ";", col_names = letters)
DF <- "data.txt" %>%
read_lines(skip = 4) %>%
trimws %>%
by(cumsum(!nzchar(.)), Read) %>%
do.call("rbind", .)
Create input file for testing.
Lines <- " ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0320050 CONSTRUCTN LAYOUT SPL 1.000 L SUM
2341 Builders Paving, LLC 5,000.0000 5,000.00 5,000.00
3020 J. A. Johnson Paving Company 5,000.0000 5,000.00 5,000.00
0280 Peter Baker & Son Co. 1,500.0000 1,500.00 1,500.00
X0327611 REM & REIN BRIC PAVER 55.000 SQ FT
2341 Builders Paving, LLC 20.0000 1,100.00 1,100.00
3020 J. A. Johnson Paving Company 40.0000 2,200.00 2,200.00
0280 Peter Baker & Son Co. 20.0000 1,100.00 1,100.00"
writeLines(Lines, "data.txt")
Simplified code.
Upvotes: 0
Reputation: 9629
Here is a working workflow:
Code:
import re
import pandas as pd
data = '''
ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0320050 CONSTRUCTN LAYOUT SPL 1.000 L SUM
2341 Builders Paving, LLC 5,000.0000 5,000.00 5,000.00
3020 J. A. Johnson Paving Company 5,000.0000 5,000.00 5,000.00
0280 Peter Baker & Son Co. 1,500.0000 1,500.00 1,500.00
X0327611 REM & REIN BRIC PAVER 55.000 SQ FT
2341 Builders Paving, LLC 20.0000 1,100.00 1,100.00
3020 J. A. Johnson Paving Company 40.0000 2,200.00 2,200.00
0280 Peter Baker & Son Co. 20.0000 1,100.00 1,100.00'''
#with open('filename.txt') as f:
# data = f.read()
tables = [i for i in re.split(r'\n\n+', data)[1:] if i]
dfs= []
for i in tables:
item_df = pd.read_fwf(io.StringIO(i.splitlines()[0]), names=['ITEM NBR','ITEM DESCRIPTION','QUANTITY','UNIT OF MEASURE'], colspecs=[(0,12),(12,45),(45,64),(64,73)])
headings = ['BIDR NBR','BIDDER NAME','UNIT PRICE','BIDDER EXTENSION','CALCULATED EXTENSION']
colspecs = [(1, 11), (11, 64), (64, 82), (82,98), (98, 114)]
buyers_df = pd.read_fwf(io.StringIO(i), names=headings, index=False, colspecs=colspecs, skiprows=1, thousands=',')
dfs.append(pd.concat([item_df, buyers_df], axis=1).ffill())
df = pd.concat(dfs)
Output:
ITEM NBR | ITEM DESCRIPTION | QUANTITY | UNIT OF MEASURE | BIDR NBR | BIDDER NAME | UNIT PRICE | BIDDER EXTENSION | CALCULATED EXTENSION | |
---|---|---|---|---|---|---|---|---|---|
0 | X0320050 | CONSTRUCTN LAYOUT SPL | 1 | L SUM | 2341 | Builders Paving, LLC | 5000 | 5000 | 5000 |
1 | X0320050 | CONSTRUCTN LAYOUT SPL | 1 | L SUM | 3020 | J. A. Johnson Paving Company | 5000 | 5000 | 5000 |
2 | X0320050 | CONSTRUCTN LAYOUT SPL | 1 | L SUM | 280 | Peter Baker & Son Co. | 1500 | 1500 | 1500 |
0 | X0327611 | REM & REIN BRIC PAVER | 55 | SQ FT | 2341 | Builders Paving, LLC | 20 | 1100 | 1100 |
1 | X0327611 | REM & REIN BRIC PAVER | 55 | SQ FT | 3020 | J. A. Johnson Paving Company | 40 | 2200 | 2200 |
2 | X0327611 | REM & REIN BRIC PAVER | 55 | SQ FT | 280 | Peter Baker & Son Co. | 20 | 1100 | 1100 |
Upvotes: 1
Reputation: 579
Try this
import pandas as pd
data=open("filename.txt","r").read()
df = pd.DataFrame({"data": data.split()})
print(df)
Upvotes: 0