Chennai Cheetah
Chennai Cheetah

Reputation: 363

R extract specific word after keyword

How do I extract a specific word after keyword in R.

I have the following input text which contains details about policy. I need to extract specific words value like FirstName , SurName , FatherName and dob. input.txt

In Case of unit linked plan, Investment risk in Investment Portfolio is borne by the policyholder.
ly
c I ROPOSAL FORM z
Insurance
Proposal Form Number: 342525 PF 42242
Advisor Coe aranch Code 2
Ff roanumber =F SSOS™S™~™S~S rancid ate = |
IBR. Code S535353424
re GFN ——
INSTRUCTION FOR FILLING THES APPLICATION FORM ; 1. Compiets the proocsal form in CAPITAL LETTERS using = Black Ball Point P]n. 2. Sless= mark your selection by marking “X" insides the
Boe. 3. Slnsse bases 2 Blank soece after eect word, letter or initial 4. Slssse write "MA" for questions whic are not apolicatie. 5.00 NOT USE the Sor") to identify your initial or seperate the sddressiiine.
6. Sulmissson of age proof ie mandatory along wall Ge propel fonm.
IMPORTANT INSTRUCTIONS WITH REGARD TO DISCLOSURE OF INFORMATION: Inturance it a contract of UTMOST GOOD FAITH and itis required by disclose all material and nelevant
fach: complebehy, DO) NOT suppress any fac: in response by the questions in the priposal form. FAILURE TO PROVIDE COMPLETE AND ACCURATE INFORMATION OR
MISREPRESENTATION OF THE FACTS COULD DECLARE THES POLICY CONTRACT NULL AND VOID AFTER PAYMENT OF SURRENDER VALUE, IF ANY, SUBJECT TO SECTION 45 OF
INSURANCE ACT, 1998 As AMENDED FROM TIME TO TIME,
Section I - Details of the Life to be Assured
1. Tite E-] Mr. LJ Mrs. LJ Miss [J Or. LJ Others (Specify)
2. FirstName PETER PAUL
3. Surname T
44. Father's Name
46, Mother's Name ERIKA RESWE D
5. Date of Birth 13/02/1990 6, Gender E] Male ] Female
7. Age Proof L] School Certificate [] Driving License [] Passport {Birth Certificate E"] PAN Card
3, Marital Status D) Single EF] Married 0 Widower) 0 Civorcee
9, Spouse Name ERISEWQ FR
10. Maiden Name
iL. Nationality -] Resident Indian National [J Non Resident Indian (MRI) L] Others (Specify)
12, Education J Postgraduate / Doctorate Ee) Graduate [] 12thstd. Pass [J 10thstd. Pass [J Below 10th std.
OO Dliterate / Uneducated CJ Others (Specify)
13. Address For No 7¥%a vaigai street Flower
Communication Nagar selaiyur
Landmark
City Salem
Pin Code BO00 73: State TAMIL NADU
Address proof [] Passport ([] Driving License [] Voter ID [] Bank Statement [] Utility Bill G4 Others (Specify) Aadhaar Card
14, Permanent No 7¥a vaigai street Flower
Address :
Nagar selaiyur
Landmark
City Salem
Pin Code 5353535 state (TAMIL NADU
Address proof CJ] Passport [9 DrivingLicense [J Voter ID [ Bank Statement [ Utility Bill B] Others (Specify) Aadhaar Card
15. Contact Details Mobile 424242424 Phone (Home)
Office / Business
E-mail [email protected]
Preferred mode: ((] Letter EF) E-Mail
Preferred Language for Letter {other than English): [] Hindi [] Kannada [-] Tamil J Telugu C] Malayalam C) Gujarati
Bengali GOriya =D] Marathi
16. Occupation CL] Salaried-Govt /PSU ( Salaried-other [9 Self Employed Professional [J Aagriculturist {Farmer [Part Time Business
LJ Retired ] Landlord J Student (current Std) -] Others (Specify) Salaried - MNC
17. Full Name of the Capio software
Employers Businnes/
School/College
18, Designation & Exact nature of Work / Business Manager
19. AnnualIncomein 1,200,000.00 20. Annual Income of Husband / Father = 1,500,000.00
Figures (%) (for female and minor lives)
21. Exact nature of work / business of Husband / Father for female and minor lives Government Employee
Page 10fé

The below code works for me but the problem is if line order changes everything get changed. Is there a way to extract keyword value irrespective of line order. ?

Current Code

path <- getwd()
my_txt <- readLines(paste(path, "/input.txt", sep = ""))

fName <- sub('.*FirstName', '', my_txt[7])
SName <- sub('.*Surname', '', my_txt[8])
FatherNm <- sub(".*Father's Name", '', my_txt[9])
dob <- sub("6, Gender.*", '',sub(".*Date of Birth", '', my_txt[11]))

Upvotes: 1

Views: 61

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388817

You can combine the text together as one string and extract the values based on pattern in the data. This approach will work irrespective of the line number in the data provided the pattern in the data is always valid for all the files.

my_txt <- readLines(paste(path, "/input.txt", sep = ""))
#Collapse data in one string
text <- paste0(my_txt, collapse = '\n')
#Extract text after FirstName till '\n'
fName <- sub('.*FirstName (.*?)\n.*', '\\1', text)
fName
#[1] "John  Woo"

#Extract text after Surname till '\n'
SName <- sub('.*Surname (.*?)\n.*', '\\1', text)
SName
#[1] "T"

#Extract text after Father's Name till '\n'
FatherNm <- sub(".*Father's Name (.*?)\n.*", '\\1', text)
FatherNm
#[1] "Bill Woo"

#Extract numbers which come after Date of Birth.
dob <- sub(".*Date of Birth (\\d+/\\d+/\\d+).*", '\\1', text)
dob
#[1] "13/07/1970"

Upvotes: 1

Related Questions