Kundan Kumar
Kundan Kumar

Reputation: 2002

Splitting a string array in scala

I have an array of string

Array[String] = Array(Monthend_Date Stf_Ttl Staffno Name    Surname Full_Name   MGr_Staffno Manager_Name    Cluster Consolidate Level3  Division    Region  Area    Branch  BranchID    COGNOS_unit Job_Family  Staff_Category  PosID   Position    PattersonGrade  Age Gender  Race    Disabled___Not_Disabled DTI_Race    DTI_EE_level    Staff_count FTE_HeadcountPrfl_Hay   Prfl_Hay_Ptrsn_Grd  Office  Hrc_Stf_No  Stf_Grd Stf_Term_Dt Term_Rsn_Desc   Prfl_Job_Desc   Br_Brnd Br_Cl_Id    Hr_Br_Ind   Txn_Tp_Desc To_Txn_Tp_Desc  Stf_Bnd_ID  Pos_Lvl Pos_Vrtl_Ind    Job_Fnctn_Desc  Job_Fnctn_ID    Grd_Code    Grd_Desc    Grd_Pay_Grp Qlfn_Desc   Qlfn_Desc_Other Prfl_Crr_Desc   Prfl_Crr_ID Core / Support  Rem_Srvy_ID Rem_Srvy_Desc   Gwhc_Mo_End_Dt  LastName    Level1Stf_No    Level1Name  Level2Stf_No    Level2Name  Level3Stf_No    Level3Name  Level4Stf_No    Level4Name  Level5Stf_No    Level5...

Here the delimiter is "tab". When I split it using "\t" as delimiter then it works fine.

Array[String] = Array(Monthend_Date, Stf_Ttl, Staffno, Name, Surname, Full_Name, MGr_Staffno, Manager_Name, Cluster, Consolidate, Level3, Division, Region, Area, Branch, BranchID, COGNOS_unit, Job_Family, Staff_Category, PosID, Position, PattersonGrade, Age, Gender, Race, Disabled___Not_Disabled, DTI_Race, DTI_EE_level, Staff_count, FTE_Headcount, Prfl_Hay, Prfl_Hay_Ptrsn_Grd, Office, Hrc_Stf_No, Stf_Grd, Stf_Term_Dt, Term_Rsn_Desc, Prfl_Job_Desc, Br_Brnd, Br_Cl_Id, Hr_Br_Ind, Txn_Tp_Desc, To_Txn_Tp_Desc, Stf_Bnd_ID, Pos_Lvl, Pos_Vrtl_Ind, Job_Fnctn_Desc, Job_Fnctn_ID, Grd_Code, Grd_Desc, Grd_Pay_Grp, Qlfn_Desc, Qlfn_Desc_Other, Prfl_Crr_Desc, Prfl_Crr_ID, Core / Support, Rem_Srvy_ID, Rem_Srvy_Desc, Gwhc_Mo_End_Dt, LastName, Level1Stf_No, Level1Name, Level2Stf_No, Level2Na...

But splitting it using "|" pipe as delimiter I get the following result.

Array("", M, o, n, t, h, e, n, d, _, D, a, t, e, "  ", S, t, f, _, T, t, l, "   ", S, t, a, f, f, n, o, "   ", N, a, m, e, "    ", S, u, r, n, a, m, e, "   ", F, u, l, l, _, N, a, m, e, " ", M, G, r, _, S, t, a, f, f, n, o, "   ", M, a, n, a, g, e, r, _, N, a, m, e, "    ", C, l, u, s, t, e, r, "   ", C, o, n, s, o, l, i, d, a, t, e, "   ", L, e, v, e, l, 3, "  ", D, i, v, i, s, i, o, n, "    ", R, e, g, i, o, n, "  ", A, r, e, a, "    ", B, r, a, n, c, h, "  ", B, r, a, n, c, h, I, D, "    ", C, O, G, N, O, S, _, u, n, i, t, "   ", J, o, b, _, F, a, m, i, l, y, "  ", S, t, a, f, f, _, C, a, t, e, g, o, r, y, "  ", P, o, s, I, D, " ", P, o, s, i, t, i, o, n, "    ", P, a, t, t, e, r, s, o, n, G, r, a, d, e, "  ", A, g, e, "   ", G, e, n, d, e, r, "  ", R, a, c, e, "    ", D, i, s, a, b, l, e, d, _, _,...

Why the above is being split at each character ? There is no pipe separator in the string array.

What is the correct way to achieve this ?

Splitting with "," comma gives the following output.

Array(Monthend_Date Stf_Ttl Staffno Name    Surname Full_Name   MGr_Staffno Manager_Name    Cluster Consolidate Level3  Division    Region  Area    Branch  BranchID    COGNOS_unit Job_Family  Staff_Category  PosID   Position    PattersonGrade  Age Gender  Race    Disabled___Not_Disabled DTI_Race    DTI_EE_level    Staff_count FTE_Headcount   Prfl_Hay    Prfl_Hay_Ptrsn_Grd  Office  Hrc_Stf_No  Stf_Grd Stf_Term_Dt Term_Rsn_Desc   Prfl_Job_Desc   Br_Brnd Br_Cl_Id    Hr_Br_Ind   Txn_Tp_Desc To_Txn_Tp_Desc  Stf_Bnd_ID  Pos_Lvl Pos_Vrtl_Ind    Job_Fnctn_Desc  Job_Fnctn_ID    Grd_Code    Grd_Desc    Grd_Pay_Grp Qlfn_Desc   Qlfn_Desc_Other Prfl_Crr_Desc   Prfl_Crr_ID Core / Support  Rem_Srvy_ID Rem_Srvy_Desc   Gwhc_Mo_End_Dt  LastName    Level1Stf_No    Level1Name  Level2Stf_No    Level2Name  Level3Stf_No    Level3Name  Level4Stf_No    Level4Name  Level5Stf_No...

Upvotes: 1

Views: 1899

Answers (1)

sberry
sberry

Reputation: 131978

If you use quotes then the delimiter is treated as a regular expression. When you provide a pipe | (a special character for regex) then you get a split on empty string OR empty string. So it splits on each character...

scala> val m = Array[String]("foo bar", "bar foo")
m: Array[String] = Array(foo bar, bar foo)

scala> m.flatMap(_.split("|"))
res1: Array[String] = Array("", f, o, o, " ", b, a, r, "", b, a, r, " ", f, o, o)

Either of these should work:

scala> m.flatMap(_.split("""\|"""))
res2: Array[String] = Array(foo bar, bar foo)

scala> m.flatMap(_.split('|'))
res3: Array[String] = Array(foo bar, bar foo)

Upvotes: 6

Related Questions