Dhar_
Dhar_

Reputation: 71

How to put information from text into CSV with Python

I am using re to search a html document and output all the text between two given strings. The text returned has the water usage rates of different towns which I need to automatically put into a csv table format. How can I do this? Here is my code so far:

file = open(r'PDFs/murrumbidgee/Murrumbidgee Unregulated River Water Sources 2012_20200815.html', 'r', encoding='utf8')
contents = file.read()
#Converts the html into a soup object
soup = BS(contents, 'html.parser')
rawText = soup.get_text(strip=True, separator='\n')

#Searches the soup object for two given strings and returns the text in between
finalText = re.search(r'19\s+domestic and stock rights(.*?)20\s+native title rights', rawText, flags=re.S|re.I).group(1)

print(finalText)  

Here is the output text which needs to be scraped for the info and put into csv:

The water requirements of persons entitled to domestic and stock rights in these water
sources are estimated to total 4,385 megalitres per year (hereafter
ML/year
),
distributed as follows:
(a)
91 ML/year in the Adjungbilly/Bombowlee/Brungle Water Source,
(b)
75 ML/year in the Billabung Water Source,
(c)
72 ML/year in the Bredbo Water Source,
Page 25
Water Sharing Plan for the Murrumbidgee Unregulated River Water Sources 2012
(d)
82 ML/year in the Burkes/Bullenbung Water Source,
(e)
45 ML/year in the Burrinjuck Dam Catchment Water Source,
(f)
0 ML/year the Burrumbuttock Water Source,
(g)
67 ML/year in the Gilmore/Sandy Water Source,
(h)
39 ML/year in the Goobarragandra Water Source,
(i)
28 ML/year in the Goodradigbee Water Source,
(j)
113 ML/year in the Hillas Water Source,
(k)
100 ML/year in the Houlaghans Water Source,
(l)
291 ML/year in the Jugiong Water Source,
(m)  75 ML/year in the Kyeamba Water Source,
(n)
178 ML/year in the Lake George Water Source,
(o)
44 ML/year the Lower Billabong Water Source,
(p)
169 ML/year the Lower Billabong Anabranch Water Source,
(q)
156 ML/year the Middle Billabong Water Source,
(r)
103 ML/year in the Molonglo Water Source,
(s)
73 ML/year the Mountain Water Source,
(t)
2 ML/year in the Murrumbidgee (Balranald to Weimby) Water Source,
(u)
34 ML/year in the Murrumbidgee (Gogeldrie to Waldaira) Water Source,
(v)
92 ML/year in the Murrumbidgee Central (Burrinjuck to Gogeldrie) Water
Source,
(w)  218 ML/year in the Murrumbidgee I Water Source,
(x)
133 ML/year in the Murrumbidgee II Water Source,
(y)
116 ML/year in the Murrumbidgee III Water Source,
(z)
73 ML/year in the Murrumbidgee North Water Source,
(aa)  476 ML/year in the Murrumbidgee Western Water Source,
(ab)  92 ML/year in the Muttama Water Source,
(ac)  150 ML/year in the Numeralla East Water Source,

This is what the table should look like

Town    Water Usage
Billabung 75 ML
Muttam    92 ML

etc..

Upvotes: 1

Views: 73

Answers (3)

deadshot
deadshot

Reputation: 9071

Try this

import regex
import pandas as pd

pattern = regex.compile('\d+ ML(?=\/year)|(?<= in the|the )[\w \/\(\)]+(?=\s+Water Source|Water)')
columns = ('Water Usage', 'Town')

res = [dict(zip(columns, pattern.findall(line))) for line in finalText.splitlines() if pattern.match(line)]
df = pd.DataFrame(res)
df.to_csv('output.csv', index=False) #this will save the output to csv file
print(df)

Output:

   Water Usage                                              Town
0        91 ML                    Adjungbilly/Bombowlee/Brungle 
1        75 ML                                        Billabung 
2        72 ML                                           Bredbo 
3        82 ML                                Burkes/Bullenbung 
4        45 ML                         Burrinjuck Dam Catchment 
5         0 ML                                    Burrumbuttock 
6        67 ML                                    Gilmore/Sandy 
7        39 ML                                   Goobarragandra 
8        28 ML                                     Goodradigbee 
9       113 ML                                           Hillas 
10      100 ML                                       Houlaghans 
11      291 ML                                          Jugiong 
12      178 ML                                      Lake George 
13       44 ML                                  Lower Billabong 
14      169 ML                        Lower Billabong Anabranch 
15      156 ML                                 Middle Billabong 
16      103 ML                                         Molonglo 
17       73 ML                                         Mountain 
18        2 ML               Murrumbidgee (Balranald to Weimby) 
19       34 ML             Murrumbidgee (Gogeldrie to Waldaira) 
20       92 ML   Murrumbidgee Central (Burrinjuck to Gogeldrie) 
21      133 ML                                  Murrumbidgee II 
22      116 ML                                 Murrumbidgee III 
23       73 ML                               Murrumbidgee North 

Upvotes: 1

Siva Shanmugam
Siva Shanmugam

Reputation: 658

This is working for the given file.

If you are going to use it for other docs, you have to do bit changes in some lines (marked in the code).

Here is the code.

f = open("sample.txt", 'r').readlines()[5:] #initial unwanted lines
for i in f:
    i= [x for x in i.rstrip().split(" ") if x!=" "]
    if len(i)!=1 and i[0] not in ['Page','Water']: # list will have first word of unwanted lines in the middle of the file text
        if '(' in i[0]:
            print (i[6],i[2],i[3])
        else:
            print (i[4],i[0],i[1])

Output:

Adjungbilly/Bombowlee/Brungle 91 ML/year
Billabung 75 ML/year
Bredbo 72 ML/year
Burkes/Bullenbung 82 ML/year
Burrinjuck 45 ML/year
Water 0 ML/year
Gilmore/Sandy 67 ML/year
Goobarragandra 39 ML/year
Goodradigbee 28 ML/year
Hillas 113 ML/year
Houlaghans 100 ML/year
Jugiong 291 ML/year
Kyeamba 75 ML/year
Lake 178 ML/year
Billabong 44 ML/year
Billabong 169 ML/year
Billabong 156 ML/year
Molonglo 103 ML/year
Water 73 ML/year
Murrumbidgee 2 ML/year
Murrumbidgee 34 ML/year
Murrumbidgee 92 ML/year
Murrumbidgee 218 ML/year
Murrumbidgee 133 ML/year
Murrumbidgee 116 ML/year
Murrumbidgee 73 ML/year
Murrumbidgee 476 ML/year
Muttama 92 ML/year
Numeralla 150 ML/year

Upvotes: 0

Stefan
Stefan

Reputation: 1934

From the output you gave it is not clear what exactly you want to extract. Provided the finalText you posted and what I assumed you actually want you could do something like:

import re

l = re.findall(r'(\d+)\s(ML)/year\s(.*) Water', finalText)
header = [('Water', 'Usage', 'Town')]

data = header + l

with open('your.csv', 'w') as f:
    for line in data:
        f.write(f"{line[2]},{line[0]},{line[1]}\n")

your.csv will look like

Town,Water,Usage
Adjungbilly/Bombowlee/Brungle,91,ML
Billabung,75,ML
Bredbo,72,ML
Burkes/Bullenbung,82,ML
Burrinjuck Dam Catchment,45,ML
the Burrumbuttock,0,ML
Gilmore/Sandy,67,ML
Goobarragandra,39,ML
Goodradigbee,28,ML
Hillas,113,ML
Houlaghans,100,ML
Jugiong,291,ML
Kyeamba,75,ML
Lake George,178,ML
the Lower Billabong,44,ML
the Lower Billabong Anabranch,169,ML
the Middle Billabong,156,ML
Molonglo,103,ML
the Mountain,73,ML
Murrumbidgee (Balranald to Weimby),2,ML
Murrumbidgee (Gogeldrie to Waldaira),34,ML
Murrumbidgee Central (Burrinjuck to Gogeldrie),92,ML
Murrumbidgee I,218,ML
Murrumbidgee II,133,ML
Murrumbidgee III,116,ML
Murrumbidgee North,73,ML
Murrumbidgee Western,476,ML
Muttama,92,ML
Numeralla East,150,ML

Upvotes: 1

Related Questions