Reputation: 159
We have a table (in a .doc
document) that looks as follows:
item_number | item_code | description | unit | QUANTITY | BID | AMOUNT |
---|---|---|---|---|---|---|
1 | 074016 | CONSTRUCTION SITE MANAGEMENT | LS | LUMP SUM | 24,826.49 | 24,826.49 |
2 | 074017 | PREPARE WATER POLLUTION CONTROL PROGRAM | LS | 30 | 125.38 | 3,761.40 |
3 | 840521 | 4" THERMOPLASTIC TRAFFIC STRIPE (BROKEN 6-1) | SQFT | LUMP SUM | .19 | 32.30 |
I imported the text in the .doc
file and am now using regex to extract the table. When imported, the table text looks as follows:
1 074016 CONSTRUCTION SITE MANAGEMENT LS LUMP SUM 24,826.49 24,826.49
2 074017 PREPARE WATER POLLUTION CONTROL LS LUMP SUM 708.63 708.63
PROGRAM
3 074038 TEMPORARY DRAINAGE INLET PROTECTION EA 30 125.38 3,761.40
4 074041 STREET SWEEPING LS LUMP SUM 10,379.25 10,379.25
5 120090 CONSTRUCTION AREA SIGNS LS LUMP SUM 9,880.75 9,880.75
6 120100 TRAFFIC CONTROL SYSTEM LS LUMP SUM 10,932.61 10,932.61
7 152440 ADJUST MANHOLE TO GRADE EA 110 453.42 49,876.20
8 153103 COLD PLANE ASPHALT CONCRETE PAVEMENT SQYD 143,000 1.37 195,910.00
I am trying to create a pattern that can capture different variable values in different regex groups. Right now, the pattern I have is ^(\s{6}|\s{7})(\d+)\s+(\d+)\s+(([A-Z.]{2}[^\n\d]*[A-Z)]\s{2}))
. But, it also captures LS
and LUMP SUM
in the third group (description).
Code:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
import itertools
text = '...
074016 CONSTRUCTION SITE MANAGEMENT LS LUMP SUM 24,826.49 24,826.49
074017 PREPARE WATER POLLUTION CONTROL LS LUMP SUM 708.63 708.63
PROGRAM
074038 TEMPORARY DRAINAGE INLET PROTECTION EA 30 125.38 3,761.40
074041 STREET SWEEPING LS LUMP SUM 10,379.25 10,379.25
120090 CONSTRUCTION AREA SIGNS LS LUMP SUM 9,880.75 9,880.75
120100 TRAFFIC CONTROL SYSTEM LS LUMP SUM 10,932.61 10,932.61
152440 ADJUST MANHOLE TO GRADE EA 110 453.42 49,876.20
153103 COLD PLANE ASPHALT CONCRETE PAVEMENT SQYD 143,000 1.37 195,910.00
015299 LEAD COMPLIANCE PLAN (STRIPE REMOVAL) LS LUMP SUM 828.25 828.25
374002 ASPHALTIC EMULSION (FOG SEAL COAT) TON 18 1,013.60 18,244.80
390095 REPLACE ASPHALT CONCRETE SURFACING CY 160 277.89 44,462.40
390137 RUBBERIZED HOT MIX ASPHALT (GAP GRADED) TON 9,650 101.05 975,132.50
394050 RUMBLE STRIP STA 180 26.38 4,748.40
015300 REPLACE AIR MARKER EA 50 139.27 6,963.50
840504 4" THERMOPLASTIC TRAFFIC STRIPE LF 146,000 .36 52,560.00'
# creating a list: each item is a row from the dataset
text = re.split(r'(?ms)\n\s+\d+', text)
Is there a way to capture different variables in different groups? Any help would be appreciated. Thank you!
Edit 1: The code misses datasets that look as follows:
1 074016 CONSTRUCTION SITE MANAGEMENT LS LUMP SUM 240.00 240.00
2 074019 PREPARE STORM WATER POLLUTION LS LUMP SUM 2,300.00 2,300.00
PREVENTION PLAN
3 074038 TEMPORARY DRAINAGE INLET PROTECTION EA 12 240.00 2,880.00
4 074041 STREET SWEEPING LS LUMP SUM 1,700.00 1,700.00
5 074042 TEMPORARY CONCRETE WASHOUT (PORTABLE) LS LUMP SUM 370.00 370.00
6 120090 CONSTRUCTION AREA SIGNS LS LUMP SUM 7,100.00 7,100.00
7 120100 TRAFFIC CONTROL SYSTEM LS LUMP SUM 35,900.00 35,900.00
8 120165 CHANNELIZER (SURFACE MOUNTED) EA 40 20.00 800.00
9 128650 PORTABLE CHANGEABLE MESSAGE SIGN EA 4 2,200.00 8,800.00
10 129000 TEMPORARY RAILING (TYPE K) LF 960 27.50 26,400.00
11 129100 TEMPORARY CRASH CUSHION MODULE EA 56 127.00 7,112.00
12 150662 REMOVE METAL BEAM GUARD RAILING LF 1,390 3.00 4,170.00
13 153210 REMOVE CONCRETE CY 2 660.00 1,320.00
14 015310 REMOVE BRIDGE APPROACH GUARD RAILING LF 200 6.30 1,260.00
15 156585 REMOVE CRASH CUSHION EA 1 300.00 300.00
16 160101 CLEARING AND GRUBBING LS LUMP SUM 2,500.00 2,500.00
17 190110 LEAD COMPLIANCE PLAN LS LUMP SUM 850.00 850.00
18 (F) 510502 MINOR CONCRETE (MINOR STRUCTURE) CY 4 2,740.00 10,960.00
19 820118 GUARD RAILING DELINEATOR EA 12 15.00 180.00
20 839303 SINGLE THRIE BEAM BARRIER (STEEL POST) LF 3,630 22.00 79,860.00
I suspect this could be due to the (F) there. Is there a way to tackle this? Thank you so much!
Edit 2:
#import sys
#sys.modules[__name__].__dict__.clear()
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
import itertools
from io import StringIO
# setting directory
os.chdir('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc-test')
text = textract.process('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc/081204R0.doc_133.doc')
text = text.decode("utf-8")
# splitting by contract number
nob = text.split('BID OPENING DATE')
del nob[0]
# create a dataframe to store the data
# dff = pd.DataFrame(columns = ['contract_number', 'item_number', 'item_code', 'description', 'unit', 'QUANTITY', 'BID', 'AMOUNT'])
dff = pd.DataFrame(columns = ['0', '1', '2', '3', '4', '5', '6', '7'])
# file level loop starts here
dataframes = dict()
for i in range(len(nob)):
try:
txt = nob[i]
# contract number
cn1 = re.search('CONTRACT NUMBER\s+(.........)', txt)
cn2 = re.search('CONTRACT NUMBER\n+(.*)', txt)
if not (cn1 is None):
cn = cn1.group(1)
elif not (cn2 is None):
cn = cn2.group(1)
else:
cn = "Not captured"
# getting the contract proposal of low bidder table
hi = re.split('(?ms)C O N T R A C T\s+P R O P O S A L\s+O F\s+L O W\s+B I D D E R(.*?)S U M M A R Y', txt)
hi = hi[1]
# splitting again
hi = re.split('(?ms)---(\s+\n.*?\s*)TOTAL', hi)
hi = hi[1]
hi.replace('(F)', ' ')
hi.replace('(S)', ' ')
df = pd.read_fwf(StringIO(hi), header=None, dtype=str)
# df.columns = ['item_number', 'item_code', 'description', 'unit', 'QUANTITY', 'BID', 'AMOUNT']
df['item_number'] = df['item_number'].ffill()
df = df.fillna('').groupby('item_number').agg(lambda x: ' '.join(x).strip()).reset_index()
df['contract_number'] = cn
i = df.columns.get_loc('item_number')
# dff = dff.append(df, ignore_index = True)
dataset = pd.read_csv('dataset.csv')
dataframes['ok'] = dataset
dff = dff.append(df, ignore_index = True)
except Exception as e:
print(e)
print('Error in contract number: ' + cn)
# print(dff)
Upvotes: 1
Views: 52
Reputation: 195543
To join correct the table you can try next example:
from io import StringIO
txt = '''\
1 074016 CONSTRUCTION SITE MANAGEMENT LS LUMP SUM 24,826.49 24,826.49
2 074017 PREPARE WATER POLLUTION CONTROL LS LUMP SUM 708.63 708.63
PROGRAM
3 074038 TEMPORARY DRAINAGE INLET PROTECTION EA 30 125.38 3,761.40
4 074041 STREET SWEEPING LS LUMP SUM 10,379.25 10,379.25
5 120090 CONSTRUCTION AREA SIGNS LS LUMP SUM 9,880.75 9,880.75
6 120100 TRAFFIC CONTROL SYSTEM LS LUMP SUM 10,932.61 10,932.61
7 152440 ADJUST MANHOLE TO GRADE EA 110 453.42 49,876.20
8 153103 COLD PLANE ASPHALT CONCRETE PAVEMENT SQYD 143,000 1.37 195,910.00'''
df = pd.read_fwf(StringIO(txt), header=None, dtype=str)
df.columns = ['item_number', 'item_code', 'description', 'unit', 'QUANTITY', 'BID', 'AMOUNT']
df['item_number'] = df['item_number'].ffill()
df = df.fillna('').groupby('item_number').agg(lambda x: ' '.join(x).strip()).reset_index()
print(df)
Prints:
item_number item_code description unit QUANTITY BID AMOUNT
0 1 074016 CONSTRUCTION SITE MANAGEMENT LS LUMP SUM 24,826.49 24,826.49
1 2 074017 PREPARE WATER POLLUTION CONTROL PROGRAM LS LUMP SUM 708.63 708.63
2 3 074038 TEMPORARY DRAINAGE INLET PROTECTION EA 30 125.38 3,761.40
3 4 074041 STREET SWEEPING LS LUMP SUM 10,379.25 10,379.25
4 5 120090 CONSTRUCTION AREA SIGNS LS LUMP SUM 9,880.75 9,880.75
5 6 120100 TRAFFIC CONTROL SYSTEM LS LUMP SUM 10,932.61 10,932.61
6 7 152440 ADJUST MANHOLE TO GRADE EA 110 453.42 49,876.20
7 8 153103 COLD PLANE ASPHALT CONCRETE PAVEMENT SQYD 143,000 1.37 195,910.00
Upvotes: 2