Vadim Smolyakov
Vadim Smolyakov

Reputation: 1197

Split line based on regex in Julia

I'm interested in splitting a line using a regular expression in Julia. My input is a corpus in Blei's LDA-C format consisting of docId wordID : wordCNT For example a document with five words is represented as follows:

186 0:1 12:1 15:2 3:1 4:1

I'm looking for a way to aggregate words and their counts into separate arrays, i.e. my desired output:

words =  [0, 12, 15, 3, 4]
counts = [1,  1,  2, 1, 1]

I've tried using m = match(r"(\d+):(\d+)",line). However, it only finds the first pair 0:1. I'm looking for something similar to Python's re.compile(r'[ :]').split(line). How would I split a line based on regex in Julia?

Upvotes: 3

Views: 933

Answers (3)

mbauman
mbauman

Reputation: 31342

There's no need to use regex here; Julia's split function allows using multiple characters to define where the splits should occur:

julia> split(line, [':',' '])
11-element Array{SubString{String},1}:
 "186"
 "0"
 "1"
 "12"
 "1"
 "15"
 "2"
 "3"
 "1"
 "4"
 "1"

julia> words = v[2:2:end]
5-element Array{SubString{String},1}:
 "0"
 "12"
 "15"
 "3"
 "4"

julia> counts = v[3:2:end]
5-element Array{SubString{String},1}:
 "1"
 "1"
 "2"
 "1"
 "1"

Upvotes: 6

oyd11
oyd11

Reputation: 433

As Matt B. mentions, there's no need for a Regex here as the Julia lib split() can use an array of chars.

However - when there is a need for Regex - the same split() function just works, similar to what others suggest here:

line = "186 0:1 12:1 15:2 3:1 4:1"
s = split(line, r":| ")
words = s[2:2:end]
counts = s[3:2:end]

I've recently had to do exactly that in some Unicode processing code (where the split chars - where a "combined character", thus not something that can fit in julia 'single-quotes') meaning:

split_chars = ["bunch","of","random","delims"]
line = "line_with_these_delims_in_the_middle"
r_split = Regex( join(split_chars, "|") )
split( line, r_split )

Upvotes: 1

Vadim Smolyakov
Vadim Smolyakov

Reputation: 1197

I discovered the eachmatch method that returns an iterator over the regex matches. An alternative solution is to iterate over each match:

words, counts = Int64[], Int64[]
for m in eachmatch(r"(\d+):(\d+)", line)
    wd, cnt = m.captures
    push!(words,  parse(Int64, wd))
    push!(counts, parse(Int64, cnt))
end

Upvotes: 5

Related Questions