Reputation: 666
I have a folder with series subtitles. I would like to get from the folder one subtitle file per episode. My problem is that some of the subtitles are on the same episode but with different name like
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.S09E02.720p.HDTV.x264-MOMENTUM.HI.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.902.720p.HDTV.x264.MOMENTUM.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.9X02.HDTV.XviD-MOMENTUM.HI.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.S09E02.HDTV.XviD-MOMENTUM.srt
so they are very similar, but not 100% identical.
How can I remove the duplicates documents and remain just with the distinct episodes subtitles?
I would attach what I tried but unfortunately I'm pretty clueless...
Upvotes: 0
Views: 377
Reputation: 3382
You can use the cosine similarity between the documents.
The assumption is that similar documents will have a high similarity , and then you can apply a threshold above which documents will be considered as identical.
For example, if those are your documents:
1."The child went home today, and his mother waited for him"
2."My car is big"
3."The kid went to his house today, while his mama waited for him to come"
I use vpekar
code from the answer and I do the below:
>>> v1 = text_to_vector("the child went home today, and his mother waited for him")
>>> v2 = text_to_vector("My car is big, so said my mother")
>>> v3 = text_to_vector("The kid went to his house today, while his mama waited for him to come")
and the cosine similarities between the vectors are:
>>> get_cosine(v1,v2)
0.10660035817780521
>>> get_cosine(v1,v3)
0.48420012470625223
>>> get_cosine(v2,v3)
0.0
So you obviously see that documents 1 and 3 are the most similar - and thus may be subtitles of the same episode. So, to summarize:
1. you need to apply (n choose 2) comparisons (check every possible pair). 2. If the cosine similarity between 2 documents is higher then a threshold you will find by trial and error - the subtitles are probably of the same episode - and you should remove one of them.
Upvotes: 6