Reputation: 143
I'm trying to form sentences from single words in a dataframe (sometimes ending with .?!), and recognize that U. or S. is not the end of the sentence.
data = {
"start_time": [0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.9, 2.1, 2.3, 2.5],
"end_time": [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4],
"word": [
"WHERE",
"ARE",
"YOU?",
"I",
"AM",
"U.",
"S.",
"OK,",
"COOL!",
"YES",
"IT",
"IS.",
],
}
df = pd.DataFrame(data, columns=["start_time", "end_time", "word"])
The dataframe looks like:
s_time e_time word
0.1 0.2 WHERE
0.3 0.4 ARE
0.5 0.6 YOU?
0.7 0.8 I
0.9 1.0 AM
1.1 1.2 U.
1.3 1.4 S.
1.5 1.6 OK,
1.7 1.8 COOL!
1.9 2.0 YES
2.1 2.2 IT
2.3 2.4 IS.
The result I want to get looks like:
s_time e_time sentence
0.1 0.6 WHERE ARE YOU?
0.7 1.4 I AM U. S.
1.5 1.8 OK, COOL!
1.9 2.4 YES IT IS.
I am stuck with how to get U. S. in one sentence.
Any suggestion would be much appreciated and really thanks for anyone help!
Upvotes: 0
Views: 565
Reputation: 13488
You could try this:
# Initialize variables
new_data = {"start_time": [], "end_time": [], "sentence": []}
sentence = []
start_time = None
# Iterate on the dataframe
for i, row in df.iterrows():
# Initialize start_time
if not start_time:
start_time = row["start_time"]
if (
not row["word"].endswith("?")
and not row["word"].endswith("!")
and not row["word"].endswith("S.")
):
# If word is not ending a phrase, get it
sentence.append(row["word"])
else:
# Pause iteration and update new_data with start_time, end_time
# and completed sentence
new_data["start_time"].append(start_time)
new_data["end_time"].append(row["end_time"])
sentence.append(row["word"])
new_data["sentence"].append(" ".join(sentence))
# Reset variables
start_time = None
sentence = []
new_df = pd.DataFrame(new_data, columns=["start_time", "end_time", "sentence"])
print(new_df)
# Outputs
start_time end_time sentence
0 0.1 0.6 WHERE ARE YOU?
1 0.7 1.4 I AM U. S.
2 1.5 1.8 OK, COOL!
3 2.1 2.4 YES IT IS.
Upvotes: 1