David542
David542

Reputation: 110382

Re-order copyright with regex

I need to position the year of copyright at the beginning of a string. Here are possible inputs I would have:

(c) 2012 10 DC Comics
2012 DC Comics
10 DC Comics. 2012
10 DC Comics , (c) 2012.
10 DC Comics, Copyright 2012
Warner Bros, 2011
Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.
...etc...

From these inputs, I need to always have the output in the same format -

2012. 10 DC Comics.
2011. Warner Bros.
2011. Stanford and Sons, Ltd. Inc. All Rights Reserved
etc...

How would I do this with a combination of string formatting and regex?

This needs to be cleaned up, but this is what I am currently doing:

### copyright
copyright = value_from_key(sd_wb, 'COPYRIGHT', n).strip()
m = re.search('[0-2][0-9][0-9][0-9]', copyright)
try:
    year = m.group(0)
except AttributeError:
    copyright=''
else:
    copyright = year + ". " + copyright.replace(year,'')
    copyright = copyright.rstrip('.').strip() + '.'

if copyright:
    copyright=copyright.replace('\xc2\xa9 ','').replace('&', '&').replace('(c)','').replace('(C)','').replace('Copyright', '')
    if not copyright.endswith('.'):
        copyright = copyright + '.'
    copyright = copyright.replace('  ', ' ')

Upvotes: 5

Views: 284

Answers (4)

tchrist
tchrist

Reputation: 80423

This program:

from __future__ import print_function
import re

tests = (
    '(c) 2012 DC Comics',
    'DC Comics. 2012',
    'DC Comics, (c) 2012.',
    'DC Comics, Copyright 2012',
    '(c) 2012 10 DC Comics',
    '10 DC Comics. 2012',
    '10 DC Comics , (c) 2012.',
    '10 DC Comics, Copyright 2012',
    'Warner Bros, 2011',
    'Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.',
)

for input in tests:
    print("<", input)
    output = re.sub(r'''
            (?P<lead> (?: \S .*? \S )?? )
            [\s.,]*
            (?: (?: \( c \) | copyright ) \s+ )?
            (?P<year> (?:19|20)\d\d )
            [\s.,]?
        ''', r"\g<year>. \g<lead>", input, 1, re.I + re.X)
    print(">", output, "\n")

when run under Python 2.7 or 3.2, produces this output:

< (c) 2012 DC Comics
> 2012. DC Comics 

< DC Comics. 2012
> 2012. DC Comics 

< DC Comics, (c) 2012.
> 2012. DC Comics 

< DC Comics, Copyright 2012
> 2012. DC Comics 

< (c) 2012 10 DC Comics
> 2012. 10 DC Comics 

< 10 DC Comics. 2012
> 2012. 10 DC Comics 

< 10 DC Comics , (c) 2012.
> 2012. 10 DC Comics 

< 10 DC Comics, Copyright 2012
> 2012. 10 DC Comics 

< Warner Bros, 2011
> 2011. Warner Bros 

< Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.
> 2011. Stanford and Sons, Ltd. Inc All Rights Reserved. 

Which appears to be what you were looking for.

Upvotes: 2

andrew cooke
andrew cooke

Reputation: 46872

this is messy, and i am not sure that you'll get a perfect solution, but you can get most of the way there by doing three things:

  1. targeting the copyright, rather than the rest of the text, and defining a "standard" for your regexps that gives you the same set of results for each match

  2. ordering a list of different regexps with | which will match the first it can (left to right) because, for example, you want to match "(c) 2012" before "2012".

  3. adding a separate, final phase to clean up punctuation and spaces.

for the first part i would suggest you need to return three things: before, year, and after where either before or after might not exist, but together they give you what you want as a result, except for the year.

in other words, using b, y and a for before, year and after:

(c) 2012 10 DC Comics
    yyyy aaaaaaaaaaaa

2012 DC Comics
yyyy aaaaaaaaa

10 DC Comics , (c) 2012.
bbbbbbbbbbbb       yyyy

Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.
bbbbbbbbbbbbbbbbbbbbbbbbbbbb     yyyy  aaaaaaaaaaaaaaaaaaaa

(note that we don't name the "(c)" etc because you don't want that).

so, given the above, a first stab at the regexp might be:

(?i)(?:(?P<before>.*)\s*Copyright\s*(?P<year>\d{4})(?P<after>.*)|
       (?P<before>.*)\s*\(c\)\s*(?P<year>\d{4})(?P<after>.*)|
       (?P<before>.*)\s*(?P<year>\d{4})(?P<after>.*))

where you should ignore linebreaks. the idea is that we try the "Copyright" first, then "(c)" and finally just "2012" (the initial (?i) is to get case insensitive matching). and your code will need to create a result from the match with something like:

d = match.groupdict()
d['year'] + ' ' + d.get('before', '') + ' ' + d.get('after', '')

or, using .sub(), something like:

re.sub(..., r'\g<year> \g<before> \g<after>', ...)

finally, you will probably find that you need another pass to remove strange punctuation (remove any commas followed immediately by a period, replace multiple spaces with one, etc).

Upvotes: 1

Ethan Furman
Ethan Furman

Reputation: 69160

How about an answer that doesn't use regex?

tests = (
    '(c) 2012 DC Comics',
    'DC Comics. 2012',
    'DC Comics, (c) 2012.',
    'DC Comics, Copyright 2012',
    '(c) 2012 10 DC Comics',
    '10 DC Comics. 2012',
    '10 DC Comics , (c) 2012.',
    '10 DC Comics, Copyright 2012',
    'Warner Bros, 2011',
    'Stanford and Sons, Ltd. Inc. (C) 2011. All Rights Reserved.',
    )

def reorder_copyright(text):
    year = None
    first = []
    second = []
    words = text.split()
    if words[0].lower() in ('(c)','copyright'):
        year = words[1]
        company = ' '.join(words[2:])
    for i, word in enumerate(words):
        if word.lower() in ('(c)','copyright'):
            year = words[i+1]
            company = ' '.join(words[:i] + words[i+2:])
            break
    else:
        year = words[-1]
        company = ' '.join(words[:-1])
    year = year.strip(' ,.')
    company = company.strip(' ,.')
    return "%s. %s." % (year, company)

if __name__ == '__main__':
    for line in tests:
        print(reorder_copyright(line))

Upvotes: 2

alan
alan

Reputation: 4852

Search

^\(c\)\s+(?P<year>\d{4})\s+(?P<digits>\d{2}).*$|^(?P<digits>\d{2}).*(?P<year>\d{4})\.?

Replace

\g<year>. \g<digits> DC Comics.

This works with any four-digit year (not just 2012), and any two-digit number (not just 10). Don't know if you needed that or not. It's too ugly to explain :)

Edit: the OP changed both the inputs and outputs after I posted this answer, so it will not work. Move along, nothing to see here.

Upvotes: 1

Related Questions