Reputation: 23
I am currently cleaning data from text files. And the files contains transcriptions of speeches from daily conversations. Some of the files are multilingual, a few examples of a multilingual portion are like so:
around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too
so at least need to <mandarin>跑两趟:pao liang tang</mandarin>,then I told them that it is fine
There can be multiple of such other languages in one file
Going back to the first example, what I am trying to do with the data is to remove "<tamil>"
, "அம்மா:" and "</tamil>"
, keeping just the english pronunciation of the word. I have tried to replace the <tamil>
to "", but am quite unsure of how to approach the removal of the tamil words.
The expected output would be:
around that area, ammaa would have cooked too
so at least need to pao liang tang,then I told them that it is fine
How would I go about doing so?
Upvotes: 0
Views: 47
Reputation: 4062
Yes, Pls try this
content="around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too"
ft=' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
outputs=ft.encode('ascii','ignore')
print(outputs.decode('utf-8'))
output
around that area, :ammaa would have cooked too
It's not complete output..Like if you see final string there some extra things like ":", some punctuations..So pls edit them yourself using regex..I've posted 99% of the answer
Upvotes: 1