regex extract data from raw text

Question

I work in hotel. here is raw file from rapports i have.I need to extract data in order to have something like data['roomNumber']=('paxNumber',isbb,)

Here is a sample that concern only 2 room, the 10 and 12 so the data i need should be BreakfastData = {'10':['2','BB'],'12':['1','BB']}

1)roomNumber : 'start and ends with number' or 'start with number and strictly one or more space followd by string' 2)paxNumber are the two numbers just before the 'VA' string
3)isbb is defined by the 'BB' or 'HPDJ' occurrence which can be find between two '/'. But sometimes the format is not good so it can be '/HPDJ/' or '/ HPDJ /' or '/ HPDJ/' etc

10 PxxxxD,David,Mme, Mr T- EXPEDIA TRAVEL

08.05.17 12.05.17 TP

SUP DBL / HPDJ / DEBIT CB AGENCE - NR

2 0 VA

NR

12

LxxxxSH,Claudia,Mrs

08.05.17 19.05.17 TP

1 0 VA

NR BB

SUP SGL / BB / EN ATTENTE DE VIREMENT- EVITER LA 66 -

.... etc

edit :latest

import re
data = {}
pax=''
r = re.compile(r"(\d+)\W*(\d+)\W*VA")
r2 = re.compile(r"/\s*(BB|HPDJ)\s*/")
r3 = re.compile(r"\d+
")
r4 = re.compile(r"\d+\s+\w")
PATH = "/home/ryms/regextest"

with open(PATH, 'rb') as raw:
    text=raw.read()
#roomNumber = re.search(r4, text).group()
#roomNumber2 = re.search(r3, text).group()
roomNumber = re.search(r4, text).group().split()[0]
roomNumber2 = re.search(r3, text).group().split()[0]

pax = re.findall(r, text)
adult = pax[0]; enfant = pax[1]
# if enfant is '0':
#   pax=adult
# else:
#   pax=(str(adult)+'+'+str(enfant))
bb = re.findall(r2, text)       #On recherche BB ou HPDJ
data[roomNumber]=pax,bb

print(data)
print(roomNumber)
print(roomNumber2)

return

{'10': ([('2', '2'), ('1', '1')], ['HPDJ', 'BB'])}
10
12
[Finished in 0.1s]

How can i get the two roomNumber in my return? I have lot of trouble with the issue and read(), readline(), readlines().what is the trick?

When i will have all raw data, how will i get the proper BreakfastData{}? will i use .zip()? At the bigining i wanted to split the file and then parse it , but i try so may things, i get lost. And for that i need a regex that match both pattern.

Somil · Accepted Answer

On first case you want to select two numbers which are followed by 'VA' you can do like this

 r = re.compile(r"(\d+)\W*(\d+)\W*VA")

In second case you can get HPDJ or BB like this

r = re.compile(r"/\s*(HPDJ|BB)\s*/")

this will handle all cases you mentioned >> /HPDJ/' or '/ HPDJ /' or '/ HPDJ/'

regex extract data from raw text

Answers (2)

Related Questions