Reputation:
I'm hearing the term "negative-sampling" and "sub sampling" used in conjunction with word2vec a lot.
Before I attempt to mess with word2vec I'm trying to go back through papers which reference word embedding, and start from the beginning. The paper trail has landed me here:
https://gul.gu.se/public/pp/public_courses/course77642/published/1497871737091/resourceId/37659332/content/UploadedResources/lecture10-slides-word2vec_sungmin_VT17.pdf (Google for, "Efficient Estimation of Word Representations in Vector Space" if you don't trust links.)
and states:
(I'm familiar with all bullet points minus the first)
The only stuff I've found on negative-sampling and subsampling has been contained within articles about word2vec, and that's what I'm trying to avoid.
If anyone could explain these terms or point me in the right direction, it would be greatly appreciated :).
Edit: the subsampling tag it's self leads to this definition:
"Subsampling is a resampling procedure akin to the bootstrap in which fewer than all observations are being drawn with replacement (vs. the original sample size used in the textbook bootstrap method). For creating samples out of your existing data, please consider "sampling" tag instead." --- a concrete example of this would be great.
Upvotes: 1
Views: 2072
Reputation:
I finally found something for negative sampling, which, if you studied computer science, and know all about "connect the dots" a.k.a graphs, this will be a very helpful link for anyone who wants a concrete example.
(or google: "mastering java for data science negative sampling")
For subsampling, I'll be using it for nlp, so this was most relevant:
Upvotes: 2