r01_mage
r01_mage

Reputation: 31

regex extract data from raw text

I work in hotel. here is raw file from rapports i have.I need to extract data in order to have something like data['roomNumber']=('paxNumber',isbb,)

Here is a sample that concern only 2 room, the 10 and 12 so the data i need should be BreakfastData = {'10':['2','BB'],'12':['1','BB']}

1)roomNumber : 'start and ends with number' or 'start with number and strictly one or more space followd by string' 2)paxNumber are the two numbers just before the 'VA' string
3)isbb is defined by the 'BB' or 'HPDJ' occurrence which can be find between two '/'. But sometimes the format is not good so it can be '/HPDJ/' or '/ HPDJ /' or '/ HPDJ/' etc

10 PxxxxD,David,Mme, Mr T- EXPEDIA TRAVEL

08.05.17 12.05.17 TP

SUP DBL / HPDJ / DEBIT CB AGENCE - NR

2 0 VA

NR

12

LxxxxSH,Claudia,Mrs

08.05.17 19.05.17 TP

1 0 VA

NR BB

SUP SGL / BB / EN ATTENTE DE VIREMENT- EVITER LA 66 -

.... etc

edit :latest

import re
data = {}
pax=''
r = re.compile(r"(\d+)\W*(\d+)\W*VA")
r2 = re.compile(r"/\s*(BB|HPDJ)\s*/")
r3 = re.compile(r"\d+\n")
r4 = re.compile(r"\d+\s+\w")
PATH = "/home/ryms/regextest"

with open(PATH, 'rb') as raw:
    text=raw.read()
#roomNumber = re.search(r4, text).group()
#roomNumber2 = re.search(r3, text).group()
roomNumber = re.search(r4, text).group().split()[0]
roomNumber2 = re.search(r3, text).group().split()[0]

pax = re.findall(r, text)
adult = pax[0]; enfant = pax[1]
# if enfant is '0':
#   pax=adult
# else:
#   pax=(str(adult)+'+'+str(enfant))
bb = re.findall(r2, text)       #On recherche BB ou HPDJ
data[roomNumber]=pax,bb

print(data)
print(roomNumber)
print(roomNumber2)

return

{'10': ([('2', '2'), ('1', '1')], ['HPDJ', 'BB'])}
10
12
[Finished in 0.1s]

How can i get the two roomNumber in my return? I have lot of trouble with the \n issue and read(), readline(), readlines().what is the trick?

When i will have all raw data, how will i get the proper BreakfastData{}? will i use .zip()? At the bigining i wanted to split the file and then parse it , but i try so may things, i get lost. And for that i need a regex that match both pattern.

Upvotes: 0

Views: 751

Answers (2)

Somil
Somil

Reputation: 1941

On first case you want to select two numbers which are followed by 'VA' you can do like this

 r = re.compile(r"(\d+)\W*(\d+)\W*VA")

In second case you can get HPDJ or BB like this

r = re.compile(r"/\s*(HPDJ|BB)\s*/")

this will handle all cases you mentioned >> /HPDJ/' or '/ HPDJ /' or '/ HPDJ/'

Upvotes: 1

Abid Hasan
Abid Hasan

Reputation: 658

The regex expression to get the text before the VA is as follows:

r = re.compile(r"(.*) VA")

Then the "number" (which will be a string) will be stored in the first group of the search match object, once you run the search.

I am not quite sure what the room number even is, because your description is a bit unclear, so I cannot help with that unless you clarify.

Upvotes: 0

Related Questions