Reputation: 112
I have a Dataframe with a column like this
Title
"Over the Hill,to the Poorhouse"
"Wilson"
"Darling Lili"
"The Ten Commandments"
"12 Angry Men"
"Twelve Monkeys"
"1776"
"1941"
"Chacun sa nuit"
"2001: A Space Odyssey"
"20,000 Leagues Under the Sea"
"20,000 Leagues Under the Sea"
"24,7: Twenty Four Seven"
"Twin Falls Idaho"
"Three Kingdoms: Resurrection of the Dragon"
.......
.......
and I would like to transform this column into an array like this.
[Over, the, Hill, to, the, Poorhouse]
[Wilson]
[Darling, Lili]
[The, Ten, Commandments]
[12, Angry, Men]
[Twelve, Monkeys]
[1776]
[1941]
[Chacun, sa, nuit]
[2001, , A, Space, Odyssey]
[20, 000, Leagues, Under, the, Sea]
[20, 000, Leagues, Under, the, Sea]
[24, 7, , Twenty, Four, Seven]
[Twin, Falls, Idaho]
[Three, Kingdoms, , Resurrection, of, the, Dragon]
so I would have this two columns
Title Title_Words
Over the Hill to the Poorhouse [Over, the, Hill, to, the, Poorhouse]
Wilson [Wilson]
Darling Lili [Darling, Lili]
The Ten Commandments [The, Ten, Commandments]
12 Angry Men [12, Angry, Men]
Twelve Monkeys [Twelve, Monkeys]
1776 [1776]
1941 [1941]
Chacun sa nuit [Chacun, sa, nuit]
2001: A Space Odyssey [2001, , A, Space, Odyssey]
20,000 Leagues Under the Sea [20, 000, Leagues, Under, the, Sea]
20,000 Leagues Under the Sea [20, 000, Leagues, Under, the, Sea]
24 7: Twenty Four Seven [24, 7, , Twenty, Four, Seven]
Twin Falls Idaho [Twin, Falls, Idaho]
Three Kingdoms: Resurrection of the Dragon[Three, Kingdoms, , Resurrection, of, the, Dragon]
The problem is that a String could have several separators: spaces, comma, colon.
How could it be done?
Upvotes: 0
Views: 59
Reputation: 6323
Try this-
df.withColumn("Title_Words", split(col("Title"), "\\s+|[,:]"))
Upvotes: 2