peru_45
peru_45

Reputation: 310

How to extract date in Month D, Yr format using regex?

I have a timestamp on a video in similar format to Thurs May 7 10:21:02 1998. I'd like to extract this piece of text from the video. Note: The day may be of 3-4 characters (ex. Wed, Thurs) and the date may be of 1-2 characters.

I tried to look for similarly asked questions on this platform but I couldn't find one that uses regex to extract date in this particular format, taking care of the spaces and the changing number of characters for the day and date.

Here is my attempt:

text = pytesseract.image_to_string(Image.open(file))

# date_time = re.findall(r'\d{2}:\d{2}:\d{2}', text) # works fine; extracts the time as desired
date_time = re.findall(r'\d{3,4} \d{3} \d{1,2} \d{2}:\d{2}:\d{2} \d{4}', text) #doesn't work

print ("timestamp: ", date_time)

Upvotes: 1

Views: 1429

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You can use

\w+\s+\w{3}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+\d{4}

See the regex demo. Details:

  • \w+ - one or more word chars
  • \s+ - one or more whitespaces
  • \w{3} - three word chars
  • \s+ - one or more whitespaces
  • \d{1,2} - one or two digits
  • \s+ - one or more whitespaces
  • \d{2}:\d{2}:\d{2} - two digits, :, two digits, : and two digits
  • \s+ - one or more whitespaces
  • \d{4} - four digits.

Upvotes: 2

Related Questions