Reputation: 79
(Julia and general programming newb)
I'm trying to read a directory full of JSON files containing lots of HTML pages (about 30), Regex match short strings (many per file, up to 60k total) and output these to one big file - which I'll try and parse later so I can add to a MySQL DB.
Here's my code:
patFilename = r"[0-9]+_[0-9]+.json"
patID = r"\/entry\/[0-9]+\/go"
filenames = readdir("C:/getentries/data/")
caseIDs = []
for filename in filenames
if match(patFilename, filename) === nothing
continue
end
file = open("C:/getentries/data/" * filename)
case = read(file, String)
push!(caseIDs, match(patID, case))
end
println(caseIDs)
touch("C:/getentries/data/caseIDs.txt")
open("C:/getentries/data/caseIDs.txt", "w") do caseID
println(caseID, caseIDs)
end
No errors are thrown but only a few strings are written to the file. So I'm assuming something's going wrong as I try to collect all the strings.
I thought I could try the approach suggested in my last question but this didn't help - although that's likely due to my complete inexperience!
May I ask if anyone has any thoughts?
Upvotes: 2
Views: 164
Reputation: 20288
It's hard to say without a minimal, reproducible example. But my guess is that, since you're calling match
once per file, you're only getting the first match in each file. Instead, you could call eachmatch
to get an iterator over all matches in the file contents.
This would look something like the following:
for filename in filenames
# Note that you forgot to close the file in your original example
# Using higher-level functions such as this method of `read` may be safer
str = read(filename, String)
# Loop over all matches of the regexp found in the string
for m in eachmatch(pattern, str)
push!(matches, m)
end
end
Upvotes: 2