Vinayak
Vinayak

Reputation: 776

How to maintain links when migrating from one CMS to another

Background: We are migrating a website hosted on a dot net based custom CMS to Wordpress.

The Problem: The content in the various posts contains links to other content in the CMS. These links have been put in manually and contain the entire URL starting from http. While we have moved all the post contents to Wordpress using a php script, the links within the content still point to the old links. Since the URL structure has changed there does not seem to be a programmatic way of replacing the links.

Example of the old url: http://www.example.com/doing-this-and-that-1234.aspx

Example of the new url: http://www.example.com/categoryname/doing-this-and-that/

Request: I need ideas on how can we handle this without needing to change all links manually.

Thanks in advance.

Upvotes: 0

Views: 359

Answers (3)

Blake
Blake

Reputation: 841

I'm doing something similar at the moment, migrating a huge static html store to on run on django (it's painful and bloody).

Our solution isn't anything particularly elegant. During migration of each page, we are noting the old url, then the new url, and add them to a redirect database. Once we've migrated all of our content to the new backend and url structure, we're running a script that will identify all links in the document with these xpath selectors:

 //a/@href
 //img/@src

Next we pull up the redirects from our redirect table and replace the links with the regexes below.

#escape special characters to avoid problems with the regex
link = link.replace('#', r'\#')
link = link.replace('.', r'\.')
link = link.replace('/', r'\/')
link = link.replace(':', r'\:')

#compile a regex, using the source link, and replace all existing links
repl_regex = r'href\s{0,}\=[\s\"\']{0,}(%s)[\s\"\']{0,}'%link
markup = re.sub(repl_regex, 'href="%s"'%dst_url, markup)

#repeat for images
repl_regex = r'src\s{0,}\=[\s\"\']{0,}(%s)[\s\"\']{0,}'%link
markup = re.sub(repl_regex, 'src="%s"'%dst_url, markup)

#Let me know if you have any questions, the above is written in python
#and it sounds like you're using php and a .net language.

Now while this method is probably more work than you'd like, and will require a little more upfront preparation, it has two advantages:

1) By comparing every link in a document to a redirect table, you will be able to more easily identify missing pages / missing redirects

2) SEO. Instead of making the googlebot recrawl your entire site, simply provide 301 redirects against your redirect table

Let me know if you have any questions.

Upvotes: 1

Ludovic Kuty
Ludovic Kuty

Reputation: 4954

If you can make a mapping between the number in the URL and a category name, then it is feasible. You search and replace all files with a regex to find URLs of the form http://www.example.com/doing-this-and-that-1234.aspx and you replace them with the new URL.

Regex:

(http://www\.example\.com/.*?)-(\d+)\.aspx

Upvotes: 0

Uphill_ What '1
Uphill_ What '1

Reputation: 683

I can't think of a really good way to do this but here is a thought. You can run a command line script to loop over all the pages then loop over all the links and show the user the original link and the "suggested" link. The suggested link could be the new format with the most common category name with option to change to any of the other category names.

If you don't want to write the script you can alternatively use a text editor like notepad++ or vim/gvim. In notepad++ you would use the replace with 'search mode' as 'regular expression' and with vim you would use the confirm flag of the substitute command (:%s/foo/bar/gc).

Upvotes: 1

Related Questions