Reputation: 1197
I'm interested in splitting a line using a regular expression in Julia. My input is a corpus in Blei's LDA-C format consisting of docId wordID : wordCNT
For example a document with five words is represented as follows:
186 0:1 12:1 15:2 3:1 4:1
I'm looking for a way to aggregate words and their counts into separate arrays, i.e. my desired output:
words = [0, 12, 15, 3, 4]
counts = [1, 1, 2, 1, 1]
I've tried using m = match(r"(\d+):(\d+)",line)
. However, it only finds the first pair 0:1
. I'm looking for something similar to Python's re.compile(r'[ :]').split(line)
. How would I split a line based on regex in Julia?
Upvotes: 3
Views: 933
Reputation: 31342
There's no need to use regex here; Julia's split
function allows using multiple characters to define where the splits should occur:
julia> split(line, [':',' '])
11-element Array{SubString{String},1}:
"186"
"0"
"1"
"12"
"1"
"15"
"2"
"3"
"1"
"4"
"1"
julia> words = v[2:2:end]
5-element Array{SubString{String},1}:
"0"
"12"
"15"
"3"
"4"
julia> counts = v[3:2:end]
5-element Array{SubString{String},1}:
"1"
"1"
"2"
"1"
"1"
Upvotes: 6
Reputation: 433
As Matt B. mentions, there's no need for a Regex here as the Julia lib split() can use an array of chars.
However - when there is a need for Regex - the same split() function just works, similar to what others suggest here:
line = "186 0:1 12:1 15:2 3:1 4:1"
s = split(line, r":| ")
words = s[2:2:end]
counts = s[3:2:end]
I've recently had to do exactly that in some Unicode processing code (where the split chars - where a "combined character", thus not something that can fit in julia 'single-quotes') meaning:
split_chars = ["bunch","of","random","delims"]
line = "line_with_these_delims_in_the_middle"
r_split = Regex( join(split_chars, "|") )
split( line, r_split )
Upvotes: 1
Reputation: 1197
I discovered the eachmatch
method that returns an iterator over the regex matches. An alternative solution is to iterate over each match:
words, counts = Int64[], Int64[]
for m in eachmatch(r"(\d+):(\d+)", line)
wd, cnt = m.captures
push!(words, parse(Int64, wd))
push!(counts, parse(Int64, cnt))
end
Upvotes: 5