Bryan White
Bryan White

Reputation: 334

Regex: How do I delete all but every fourth word in a text file?

I'm doing a rather chaotic experiment with a goofy Markov Chain twitter bot. The current version of the bot opens a CSV file of my tweet archive, strips out things like links and whatnot and leaves only plain text. Works like a charm. Love it!

PATH_TO_TWEETS_CSV   = 'tweets.csv'
PATH_TO_TWEETS_CLEAN = 'liber_markov.txt'

csv_text = CSV.parse(File.read(PATH_TO_TWEETS_CSV))

File.open(PATH_TO_TWEETS_CLEAN, 'w') do |file|
  csv_text.reverse.each do |row|
    tweet_text = row[5].gsub(/(?:f|ht)tps?:\/[^\s]+/, '').gsub(/\n/,' ')
    file.write("#{tweet_text}\n")
  end
end

However.

I'd like to take an insane step forward and sift through the file a second time, stripping out all but every fourth word, effectively removing 75% of the content. Is there a regex that can handle that?

Upvotes: 1

Views: 59

Answers (3)

Jordan Running
Jordan Running

Reputation: 106077

The accepted answer is fine, but since you asked about regular expressions, I thought I'd show you how it can be done. Here's a Regexp to start with:

/((\S+\s+){3})\S+\s*/

I've chosen to take "word" to mean any sequence of non-whitespace characters. This matches any word (\S+) followed by one or more whitespace characters (\s+), three times, followed by any word and zero or more whitespace characters (zero so it can match the last word in the string). Here's how you would use it:

tweet_text = "I'm doing a rather chaotic experiment with a goofy Markov Chain twitter bot."
tweet_text.gsub(/((\S+\s+){3})\S+\s*/, '\1')
# => I'm doing a chaotic experiment with goofy Markov Chain bot.

Upvotes: 0

chills42
chills42

Reputation: 14523

I'd probably do it using each_slice:

File.open(PATH_TO_TWEETS_CLEAN, 'w') do |file|
  csv_text.reverse.each do |row|
    tweet_text = row[5].gsub(/(?:f|ht)tps?:\/[^\s]+/, '').gsub(/\n/,' ')
    tweet_text = tweet_text.split.each_slice(4).map(&:first).join(' ')
    file.write("#{tweet_text}\n")
  end
end

Upvotes: 0

moveson
moveson

Reputation: 5213

I don't know about a regex solution specifically, but you could to this:

File.open(PATH_TO_TWEETS_CLEAN, 'w') do |file|
  csv_text.reverse.each do |row|
    clean_text = row[5].gsub(/(?:f|ht)tps?:\/[^\s]+/, '').gsub(/\n/,' ')
    tweet_text = clean_text.split.select.with_index { |_, i| i % 4 == 0 }.join(' ')
    file.write("#{tweet_text}\n")
  end
end

Upvotes: 1

Related Questions