Girish HM
Girish HM

Reputation: 41

How to process unstructured Text File using Spark

I am looking to process text file using Spark RDD which has data like below:

----------------------------*-----------------------

   state:xx             sub:z    |Basic info

company:abc        rate:123      |

----------------------------*------------------------

                     Date: 12-03-2019

I am expecting data to be in below format:

State:XX
Sub:z
Company:abc
rate:123
Date:12-03-2019

When I tried to remove special characters '-' using data1=data.ReplaceAll('-',"") function, it is removing - even from date also,i.e 12032019, But date should be in 12-03-2019 and also I am not getting how to move sub:z ,company:abc andrate:123 to new lines.Please help

Upvotes: 1

Views: 288

Answers (1)

zhang-yuan
zhang-yuan

Reputation: 448

without providing further details, here are my suggestions:

  1. just remove lines start with -, you may get something like this
state:xx sub:z |Basic info
company:abc rate:123 |
Date: 12-03-2019
  1. then remove data afeter |
state:xx sub:z
company:abc rate:123
Date: 12-03-2019
  1. replace the (blank space) with \n\r

    not sure whether Date: has a blank space behind

    if so, you can replace that 'Date: ' to 'Date:' first

state:xx
sub:z
company:abc
rate:123
Date:12-03-2019

hope this would help

Upvotes: 1

Related Questions