Mark J.
Mark J.

Reputation: 143

How to form sentences from single words in a dataframe?

I'm trying to form sentences from single words in a dataframe (sometimes ending with .?!), and recognize that U. or S. is not the end of the sentence.

data = {
    "start_time": [0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.9, 2.1, 2.3, 2.5],
    "end_time": [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4],
    "word": [
        "WHERE",
        "ARE",
        "YOU?",
        "I",
        "AM",
        "U.",
        "S.",
        "OK,",
        "COOL!",
        "YES",
        "IT",
        "IS.",
    ],
}
df = pd.DataFrame(data, columns=["start_time", "end_time", "word"])

The dataframe looks like:

s_time e_time word

0.1 0.2 WHERE

0.3 0.4 ARE

0.5 0.6 YOU?

0.7 0.8 I

0.9 1.0 AM

1.1 1.2 U.

1.3 1.4 S.

1.5 1.6 OK,

1.7 1.8 COOL!

1.9 2.0 YES

2.1 2.2 IT

2.3 2.4 IS.

The result I want to get looks like:

s_time e_time sentence

0.1 0.6 WHERE ARE YOU?

0.7 1.4 I AM U. S.

1.5 1.8 OK, COOL!

1.9 2.4 YES IT IS.

I am stuck with how to get U. S. in one sentence.

Any suggestion would be much appreciated and really thanks for anyone help!

Upvotes: 0

Views: 565

Answers (1)

Laurent
Laurent

Reputation: 13488

You could try this:

# Initialize variables
new_data = {"start_time": [], "end_time": [], "sentence": []}
sentence = []
start_time = None

# Iterate on the dataframe
for i, row in df.iterrows():
    # Initialize start_time
    if not start_time:
        start_time = row["start_time"]
    if (
        not row["word"].endswith("?")
        and not row["word"].endswith("!")
        and not row["word"].endswith("S.")
    ):
        # If word is not ending a phrase, get it
        sentence.append(row["word"])
    else:
        # Pause iteration and update new_data with start_time, end_time
        # and completed sentence
        new_data["start_time"].append(start_time)
        new_data["end_time"].append(row["end_time"])
        sentence.append(row["word"])
        new_data["sentence"].append(" ".join(sentence))
        # Reset variables
        start_time = None
        sentence = []

new_df = pd.DataFrame(new_data, columns=["start_time", "end_time", "sentence"])

print(new_df)
# Outputs
   start_time  end_time        sentence
0         0.1       0.6  WHERE ARE YOU?
1         0.7       1.4      I AM U. S.
2         1.5       1.8       OK, COOL!
3         2.1       2.4      YES IT IS.

Upvotes: 1

Related Questions