Reputation: 151
I am trying to implement a fast Markov Chain made specifically for text generation. Every state of the chain holds a ChainItem which is just a wrapper for Vec. I want to use an arena allocator to hold these ChainItems to avoid the drop
of thousands of ChainItems and faster allocation. I decided to use bumpalo and things are not going so well.
markov.rs:
/// Represents a Markov Chain that is designed to generate text.
///
/// [Wikipedia](https://en.wikipedia.org/wiki/Markov_chain)
pub struct MarkovChain<'a> {
pub items: HashMap<String, ChainItem<'a>>,
pub state_size: usize,
regex: &'a Regex,
arena: Bump,
}
impl<'a> MarkovChain<'a> {
/// Create an empty MarkovChain.
///
/// The hash map of the MarkovChain is initially created with a capacity of 0, so it will not allocate until it
/// is first inserted into.
#[allow(dead_code)]
pub fn new(state_size: usize, regex: &Regex) -> MarkovChain {
MarkovChain {
items: HashMap::<String, ChainItem>::new(),
state_size,
regex,
arena: Bump::new(),
}
}
/// Create an empty `HashMap` with the specified capacity.
///
/// The hash map of the MarkovChain will be able to hold at least `capacity` elements without
/// reallocating. If `capacity` is 0, the hash map will not allocate.
pub fn with_capacity(state_size: usize, capacity: usize, regex: &Regex) -> MarkovChain {
MarkovChain {
items: HashMap::with_capacity(capacity),
state_size,
regex,
arena: Bump::new(),
}
}
/// Add text as training data.
pub fn add_text(&'a mut self, text: &str) {
let tokens = self.regex.find_iter(text);
let mut prev: VecDeque<&str> = VecDeque::with_capacity(self.state_size + 100);
let mut prev_buf: String = String::with_capacity(255);
for t in tokens {
for i in 1..=prev.len() {
prev_buf.clear();
for (i, s) in prev.iter().rev().take(i).rev().enumerate() {
if i > 0 {
prev_buf.push(' ')
}
prev_buf.push_str(s);
}
let t = ustr(t.as_str());
if let Some(ci) = self.items.get_mut(&prev_buf) {
ci.add(t);
} else {
self.items.insert(prev_buf.clone(), ChainItem::new(t, &self.arena));
}
}
if prev.len() == self.state_size {
prev.pop_front();
}
prev.push_back(t.as_str());
}
}
// Bla bla
}
/// Wrapper for Vec<Ustr> to make some operations easier.
pub struct ChainItem<'a> {
pub items: bumpalo::collections::Vec<'a, Ustr>,
}
impl<'a> ChainItem<'a> {
/// Create a ChainItem, which will also contain `s`.
pub fn new(s: Ustr, bump: &'a Bump) -> ChainItem<'a> {
ChainItem { items: bumpalo::vec![in bump;s] }
}
/// Add item.
pub fn add(&mut self, s: Ustr) {
self.items.push(s);
}
/// Get a random item.
pub fn get_rand(&self) -> Ustr {
*self.items
// get a random item from the Vec
.choose(&mut rand::thread_rng())
.unwrap()
}
}
When I try to do this on main
// Reads every file into a string
let contents = files.filter_map(|f| read_to_string(f.path()).ok());
// Creating the Markov Chain
let markov_chain = contents.fold(
MarkovChain::with_capacity(2, 8_000_000, &WORD_REGEX),
|mut a, s| {
a.add_text(&s);
a
},
);
I get this:
error[E0597]: `a` does not live long enough
--> src/main.rs:39:4
|
38 | |mut a, s| {
| -----
| |
| binding `a` declared here
| has type `markov::MarkovChain<'1>`
39 | a.add_text(&s);
| ^-------------
| |
| borrowed value does not live long enough
| argument requires that `a` is borrowed for `'1`
40 | a
41 | },
| - `a` dropped here while still borrowed
error[E0505]: cannot move out of `a` because it is borrowed
--> src/main.rs:40:4
|
38 | |mut a, s| {
| -----
| |
| binding `a` declared here
| has type `markov::MarkovChain<'1>`
39 | a.add_text(&s);
| --------------
| |
| borrow of `a` occurs here
| argument requires that `a` is borrowed for `'1`
40 | a
| ^ move out of `a` occurs here
Some errors have detailed explanations: E0505, E0597.
For more information about an error, try `rustc --explain E0505`.
error: could not compile `markov-chain` (bin "markov-chain") due to 2 previous errors
Cargo.toml:
[package]
name = "markov-chain"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
hashbrown = "0.14.5"
once_cell = "1.19.0"
rand = "0.8.5"
regex = "1.10.4"
ustr = "1.0.0"
bumpalo = { version = "3.16.0", features = ["collections"] }
[target.'cfg(not(target_env = "msvc"))'.dependencies]
tikv-jemallocator = "0.5"
[package.metadata.clippy]
# This sets all pedantic lints to be treated as errors
deny = ["clippy::pedantic"]
Am I doing somethimng wrong or is this a limitation of arenas or bumpalo
? Also, I am open for suggestions regarding which arena to use if some other arena allocator is more suited for this usecase.
Upvotes: 0
Views: 38