Reputation: 415
I've been searching on the web, and found media such as CNN and NPR provide links to access to their transcripts. To obtain them requires writing something like a crawler which is not so convenient. The reason is that I'm trying to use some transcripts of TV show, interview, radio, movie as training data in my natural language processing projects. So I'm wondering whether there's any collection or database freely available on the web so that I can download all of them at once without writing a crawler by myself?
Upvotes: 3
Views: 2285
Reputation: 1683
I would recommend the British National Corpus. I would also mention the American National Corpus, but the transcripts there are only of phone calls or face to face conversations - no news, tv shows, etc.
You also mentioned CNN and NPR. There are transcripts from 1996 as an LDC corpus here.
Upvotes: 2