oshribr
oshribr

Reputation: 666

Remove similar documents in Python

I have a folder with series subtitles. I would like to get from the folder one subtitle file per episode. My problem is that some of the subtitles are on the same episode but with different name like

/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.S09E02.720p.HDTV.x264-MOMENTUM.HI.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.902.720p.HDTV.x264.MOMENTUM.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.9X02.HDTV.XviD-MOMENTUM.HI.srt
/data/netfilx/reality_subtitle/Top Chef/Top.Chef-Texas.S09E02.HDTV.XviD-MOMENTUM.srt

so they are very similar, but not 100% identical.

How can I remove the duplicates documents and remain just with the distinct episodes subtitles?
I would attach what I tried but unfortunately I'm pretty clueless...

Upvotes: 0

Views: 377

Answers (1)

Binyamin Even
Binyamin Even

Reputation: 3382

You can use the cosine similarity between the documents.

The assumption is that similar documents will have a high similarity , and then you can apply a threshold above which documents will be considered as identical.

For example, if those are your documents:

1."The child went home today, and his mother waited for him"
2."My car is big"
3."The kid went to his house today, while his mama waited for him to come"

I use vpekar code from the answer and I do the below:

>>> v1 = text_to_vector("the child went home today, and his mother waited for him")
>>> v2 = text_to_vector("My car is big, so said my mother")
>>> v3 = text_to_vector("The kid went to his house today, while his mama waited for him to come")

and the cosine similarities between the vectors are:

>>> get_cosine(v1,v2)
0.10660035817780521

>>> get_cosine(v1,v3)
0.48420012470625223

>>> get_cosine(v2,v3)
0.0

So you obviously see that documents 1 and 3 are the most similar - and thus may be subtitles of the same episode. So, to summarize:

1. you need to apply (n choose 2) comparisons (check every possible pair).
2. If the cosine similarity between 2 documents is higher then a threshold you will find by trial and error - 
    the subtitles are probably of the same episode - and you should remove one of them.

Upvotes: 6

Related Questions