Damian Tan
Damian Tan

Reputation: 23

Editing data encapsulated in flags from text file

I am currently cleaning data from text files. And the files contains transcriptions of speeches from daily conversations. Some of the files are multilingual, a few examples of a multilingual portion are like so:

around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too
so at least need to <mandarin>跑两趟:pao liang tang</mandarin>,then I told them that it is fine

There can be multiple of such other languages in one file

Going back to the first example, what I am trying to do with the data is to remove "<tamil>", "அம்மா:" and "</tamil>", keeping just the english pronunciation of the word. I have tried to replace the <tamil> to "", but am quite unsure of how to approach the removal of the tamil words.

The expected output would be:

around that area, ammaa would have cooked too
so at least need to pao liang tang,then I told them that it is fine

How would I go about doing so?

Upvotes: 0

Views: 47

Answers (1)

Bhargav
Bhargav

Reputation: 4062

Yes, Pls try this

content="around that area,<tamil>அம்மா:ammaa</tamil> would have cooked too"

ft=' '.join([word for line in [item.strip() for item in content.replace('<',' <').replace('>','> ').split('>') if not (item.strip().startswith('<') or (item.strip().startswith('&') and item.strip().endswith(';')))] for word in line.split() if not (word.strip().startswith('<') or (word.strip().startswith('&') and word.strip().endswith(';')))])
outputs=ft.encode('ascii','ignore')

print(outputs.decode('utf-8')) 

​

output

around that area, :ammaa would have cooked too

It's not complete output..Like if you see final string there some extra things like ":", some punctuations..So pls edit them yourself using regex..I've posted 99% of the answer

Upvotes: 1

Related Questions