George Black
George Black

Reputation: 35

Join lines after specific word till another specific word

I have a .txt file of a transcript that looks like this

MICHEAL: blablablabla.

further talk by Michael.

more talk by Michael.

VALERIE: blublublublu.

Valerie talks more.

MICHAEL: blibliblibli.

Michael talks again.

........

All in all this pattern goes on for up to 4000 lines and not just two speakers but with up to seven different speakers, all with unique names written with upper-case letters (as in the example above). For some text mining I need to rearrange this .txt file in the following way

  1. Join the lines following one speaker - but only the ones that still belong to him - so that the above file looks like this:

    MICHAEL: blablablabla. further talk by Michael. more talk by Michael.
    
    VALERIE: blublublublu. Valerie talks more.
    
    MICHAEL: blibliblibli. Michael talks again.
    
  2. Sort the now properly joined lines in the .txt file alphabetically, so that all lines spoken by a speaker are now together. But, the sort function should not sort the sentences spoken by one speaker (after having sorted each speakers lines together).

I know some basic vim commands, but not enough to figure this out. Especially, the first one. I do not know what kind of pattern I can implement in vim so that it only joins the lines of each speaker.

Any help would be greatly apperciated!

Upvotes: 1

Views: 155

Answers (4)

yolenoyer
yolenoyer

Reputation: 9465

Here is a script solution to your problem.

It's not well tested, so I added some comments so you can fix it easily.

To make it run, just:

  • fill the g:speakers var in the top of the script with the uppercase names you need;
  • source the script (ex: :sav /tmp/script.vim|so %);
  • run :call JoinAllSpeakLines() to join the lines by speakers;
  • run :call SortSpeakLines() to sort

You may adapt the different patterns to better fit your needs, for example adding some space tolerance (\u\{2,}\s*\ze:).

Here is the code:

" Fill the following array with all the speakers names:
let g:speakers = [ 'MICHAEL', 'VALERIE', 'MATHIEU' ]
call sort(g:speakers)


function! JoinAllSpeakLines()
" In the whole file, join all the lines between two uppercase speaker names 
" followed by ':', first inclusive:
    silent g/\u\{2,}:/call JoinSpeakLines__()
endf

function! SortSpeakLines()
" Sort the whole file by speaker, keeping the order for
" each speaker.
" Must be called after JoinAllSpeakLines().

    " Create a new dict, with one key for each speaker:
    let speakerlines = {}
    for speaker in g:speakers
        let speakerlines[speaker] = []
    endfor

    " For each line in the file:
    for line in getline(1,'$')
        let speaker = GetSpeaker__(line)
        if speaker == ''
            continue
        endif
        " Add the line to the right speaker:
        call add(speakerlines[speaker], line)
    endfor

    " Delete everything in the current buffer:
    normal gg"_dG

    " Add the sorted lines, speaker by speaker:
    for speaker in g:speakers
        call append(line('$'), speakerlines[speaker])
    endfor

    " Delete the first (empty) line in the buffer:
    normal gg"_dd
endf

function! GetOtherSpeakerPattern__(speaker)
" Returns a pattern which matches all speaker names, except the
" one given as a parameter.
    " Create an new list with a:speaker removed:
    let others = copy(g:speakers)
    let idx = index(others, a:speaker)
    if idx != -1
        call remove(others, idx)
    endif
    " Create and return the pattern list, which looks like
    " this : "\v<MICHAEL>|<VALERIE>..."
    call map(others, 'printf("<%s>:",v:val)')
    return '\v' . join(others, '|')
endf

function! GetSpeaker__(line)
" Returns the uppercase name followed by a ':' in a line
    return matchstr(a:line, '\u\{2,}\ze:')
endf

function! JoinSpeakLines__()
" When cursor is on a line with an uppercase name, join all the
" following lines until another uppercase name.
    let speaker = GetSpeaker__(getline('.'))
    if speaker == ''
        return
    endif
    normal V
    " Search for other names after the cursor line:
    let srch = search(GetOtherSpeakerPattern__(speaker), 'W')
    echo srch
    if srch == 0
        " For the last one only:
        normal GJ
    else
        normal kJ
    endif
endf

Upvotes: 0

hek2mgl
hek2mgl

Reputation: 158270

In vim you might take a two step approach, first replace all newlines.

:%s/\n\+/ /g

Then insert a new line before the terms UPPERCASE: except the first one:

:%s/ \([[:upper:]]\+:\)/\r\1/g

For the sorting you can leverage the UNIX sort program:

:%sort!

You can combine them using a pipe symbol:

:%s/\n\+/ /g | %s/ \([[:upper:]]\+:\)/\r\1/g | %!sort

and map them to a key in your vimrc file:

:nnoremap <F5> :%s/\n\+/ /g \| %s/ \([[:upper:]]\+:\)/\r\1/g \| %sort! <CR>

If you press F5 in normal mode, the transformation happens. Note that the | needs to get escaped in the nnoremap command.

Upvotes: 0

Taren
Taren

Reputation: 46

Alright, first the answer:

:g/^\u\+:/,/\n\u\+:\|\%$/join

And now the explanation:

  • g stands for global and executes the following command on every line that matches
  • /^\u+:/ is the pattern :g searches for : ^ is start of line, \u is a upper case character, + means one or more matches and : is unsurprisingly :
  • then comes the tricky bit, we make the executed command a range, from the match so some other pattern match. /\n\u+:\|\%$ is two parts parted by the pipe \| . \n\u+: is a new line followed by the last pattern, i.e. the line before the next speaker. \%$ is the end of the file
  • join does what it says on the tin

So to put it together: For each speaker, join until the line before the next speaker or the end of the file.

The closest to the sorting I now of is

:sort /\u+:/ r

which will only sort by speaker name and reverse the other line so it isn't really what you are looking for

Upvotes: 3

user2705585
user2705585

Reputation:

Well I don't know much about vim, but I was about to match lines corresponding particular speaker and here is the regex for that.

Regex: /([A-Z]+:)([A-Za-z\s\.]+)(?!\1)$/gm

Explanation:
([A-Z]+:) captures the speaker's name which contains only capital letters.

([A-Za-z\s\.]+) captures the dialogue.

(?!\1)$ backreferences to the Speaker's name and compares if the next speaker was same as the last one. If not then it matches till the new speaker is found.

I hope this will help you with matching at least.

Upvotes: 0

Related Questions