Robert Andrews
Robert Andrews

Reputation: 1249

Regex to capture transcript speaker names before colon

From text transcripts, I want to capture all names of speakers. The target names start at the start of a line and should end at a ": " (ie. colon and space).

Optionally, for even finer control, it may be safe to assume the first colon and two spaces.

Example text:

Julian Z.:          What's really exciting is the opportunity to be more intelligent about how you approach trying to reach your consumer. In a world where digital and the use of digital has exploded, to be able to have one-on-one conversations in the digital world, and to be able to eventually translate that into the TV space, whether that be addressable or data-driven, is really fantastic. Because at the end of the day, you want your brand, in our case, our networks, to be able to have a relationship with the consumer. Data is a proxy to allow for that to occur.
            From an advertiser perspective, obviously now the ability to go to the broadcast networks and have a data-driven buy has absolutely blown up and proliferated. That's with us. That's with some of our competitors. Obviously, we think we're the best at it, but neither here nor there. I think it's a really wonderful foundational approach for advertisers to take. I think it's a great advancement in the market.
                As a spender of money, and as somebody who is trying to get people to engage with our brands, the ability to use data to really have, again, these really one-on-one, unique conversations, and to be able to deliver creative content that's relevant for individual consumers, that's driven by what we know about the consumer, now, ultimately, where we can reach them effectively and in environments where we know they're engaged, is really a great, tremendous advancement. You'll see by our ratings numbers, which are on the upswing, that approach has really had a direct impact on what our linear ratings have resulted in.

Speaker 2:          Great. Tell us a little bit about Viacom. It's a lot of fans, a lot of passion in people. How do you define the audience in broad strokes? How do they respond to advertising and what are some of the concerns that consumers have around ads?

Julian Z.:          Well, I think, again, when you're talking about how we're reaching fans, it is using intelligence, and information, and data, not only to profile who our fans are, but ultimately where they're best reached. Our job is to deliver great, compelling content, which we believe we're really, really good at. 
                In order to do that, there's the linear side of the equation, but of course we want to make sure that we're reaching our fans in digital as well, and that there's a 360 kind of fan experience. We believe holistically that our fans are really the base of what we're trying to do. We're trying to please and create value for our fans. The more we engage with them, and the more we know about them, the better we're able to deliver customized content that fits their need. 
                Ultimately, as a content creator, what's more exciting than to delivery really great content to people that they really, really engage with and they build relationships with? That's all you can really hope for is, somebody that creates content, is to be able to develop compelling content and content that your audience really wants to engage with.

Speaker 2:          When you look at targeting, is that a cross-platform? Where does that targeting happen?

Julian Z.:          It absolutely is cross-platform. Of course, there is natural addressability in the digital market, because it is much more of a one-to-one. But now you see a lot of the MVPDs have obviously opened up addressable inventory. A lot of the MVPDs now have matured their addressable footprint, which allows you now to have a digital-like, not exactly the same obviously, but a digital-like experience in the linear space, to deliver content to the consumer or advertising to the consumer when it's relevant and when it's going to have the most impact for your message. 
                Ultimately, it's absolutely cross-platform because addressability is all about having that conversation, having that direct one-to-one with your audience. Our partners on the MVPD side have really matured over the last several years as of regard to addressable, and now you can have that 360 experience of having a conversation in linear and in digital that really is addressable. 

Example strings to be captured are: Julian Z. and Speaker 2. Names will vary from text to text. I need all/multiple names present. As you see, names may include a mixture of alpha case, punctuation characters and numbers.

I will want to deduplicate names, which are repeated in the text, but believe I should shelve that for now, focusing this question on the capture.

I have tried plenty, for the last day or two.

eg. ^[^:]+\s* with /g comes close, but only captures the first, single Julian Z., whereas I want everything. For now, I am out of ideas and need to learn how to do this.

Upvotes: 2

Views: 237

Answers (2)

anubhava
anubhava

Reputation: 785541

You can use this regex based on a negated character class:

/^\w[^:\n]*/mg

RegEx Breakup:

  • ^\w: Match a word character at the start
  • [^:\n]*: Match zero or more of any character that is not a colon and not a newline.

Code:

var names = inputData.transcript.match(/^\w[^:\n]*/mg) || [];

Upvotes: 1

Miguel
Miguel

Reputation: 20633

Regex to match any characters up until the first colon:

/^.*?(?=:)/gm

https://regex101.com/r/3uyXMM/3

^: match from beginning of line

.: match anything

*?: non-greedy search, so it stops at first colon (see next line)

(?=:): positive lookahead meaning next character should be colon but it doesn't capture

g: don't return after first match, returns all matches

m: run regex for each line

Upvotes: 2

Related Questions