So, you decided to upgrade to RoboHelp from a previous version. Beautiful. I did, too! And it’s great!
Except for all that leftover dirty html!
I am very excited by the ability disallow in-line styling, but years of in-line styling by a wide variety of authors has left our project with some very messy html. If you find yourself in the same predicament, keep reading and I’ll show you some great ways to clean it up!
WARNING: Continue at your own risk. Make lots of backups. If you break it, it’s not my fault, I warned you. And I will continue to warn you many times throughout this document.
These instructions are really for more advanced users and you should have a fairly decent amount of HTML knowledge before you do this.
Back it up!
BACK UP EVERYTHING ALL THE TIME. Again, BACK UP EVERYTHING ALL THE TIME!
Before you do anything, please make a copy of your project. Or maybe two copies. During the process of upgrading my project from RoboHelp 2017 to the new, sleek, amazing RoboHelp 2019, I actually made a copy at several points, especially while doing mass find/replace tasks.
Since this topic isn’t really about the upgrade process so much as it’s about the clean-up of your HTML after the upgrade, I’ll just give you a brief outline of my upgrade process:
- Remove project from source control (and archive RH 2017 version).
- Make a copy of the project.
- Clean up the project to get it ready for upgrade (made sure all file names had no spaces, cleaned up some stuff, organized topics, things of that nature).
- Make a copy of the project (so I don’t have to start all over if the upgrade fails).
- Clean your in-line styles. This was farther down, but I realized after a lot of work that you really should do this first to avoid screwing up your COnditional Build Tags. (make a new copy of your project after every replace. If something goes wrong, you’ll be glad you did!)
- Upgrade the project (Make sure your RH 2019 version has all service packs before you do this, or, like me, you’ll need to go back to that copy and start again after you apply them!)
- Make a copy of the new project.
- Spend some time figuring out the new format and getting used to the new layout. It’s beautiful. Gush about it to your coworkers and anyone who will listen.
- Realize that all the inline styles from the project are causing things to look funky.
Deep sigh because you know cleaning up the inline styles is going to take FOR-EV-ER!!
- Realize you have options!
Regex is my savior
You know and I know that making all of these changes in RoboHelp is going to take forever. Cleaning up HTML is a horrendous thing. And because RoboHelp now uses HTML5, I want to take advantage of that! I turned to Google for some help, and the overwhelming answer was “use a regular expression“. It took me some time to figure out how to do this, but it really is easy once you know what you’re doing, and I’m hoping to save you the trouble of trying to figure it out on your own.
Notepad++ to the rescue!
We’re going to use Notepad++ to do this cleanup, and here’s why:
- Notepad++ allows you to use regular expressions.
- Notepad++ allows you to filter the type of files you’re making replacements in.
- Notepad++ allows you to make changes to an entire directory’s worth of files and sub-folders.
- Notepad++ is free.
We’re going to use an awesome search feature in Notepad++. I actually recommend that, as you go through these, you do the find/replace one at a time for the first 10 or 15 replacements, just to make sure nothing is going wrong. I’ll list some potential roadblocks at the end of this article.
This is advanced stuff! (A disclaimer)
I’m not taking responsibility if you screw this up. I am merely telling you how to do it because I had some trouble finding an exact answer that wasn’t bent towards programmers. I had a programmer (Hi, Robert!) help me tweak the expression because I don’t know much at all about regular expressions at all. (I always say, I know enough to be dangerous…)
I’m not telling you this to keep you from trying this. No, not at all. I’m telling you this because I want you to BACK IT UP!
Back it up!
Back up your files again. I can’t stress this enough. When you find/replace all, you don’t get an undo. Really, best practice will be to make a copy of the entire directory before each mass find/replace. It’s a lot of work, but less work than, say, manually editing 800+ topics in your project to undo a find/replace that wasn’t … accurate.
Open search and set parameters!
WARNING: BE VERY, VERY CAREFUL WITH THIS, PARTICULARLY WITH THE MATCHED PAIRS. Some replacements have gone badly, requiring me to clean them up. This will happen if you don’t think your replacement through fully, and is why you should go one at a time if you aren’t sure!
Also of note, watch out for Conditional Build Tags (CBTs). Before the upgrade to RoboHelp 2019, CBTs have the special robohelp tag: <?rh-cbt_start condition=XXX” ?>. After you upgrade to RoboHelp 19, CBTs become a “class” for your html tag, so you’ll have something like <td data-condition=”CBTName”>. Because of this, it’s probably best to do your clean-up BEFORE you upgrade to RoboHelp 2019.
- Open Notepad++.
- Open search:
- CTRL + F on your keyboard and click the Find in Files tab, or
- Search > Find in Files….
- In the Find what: field, enter a phrase from the FIND EXPRESSION column in the chart below.
- In the Replace with: field, enter a what you want to replace the found string with (some suggestions are in the REPLACE EXPRESSION column in the chart below.
- In the Filters: field, enter a file type if you want to limit the find/replace option to certain file types. I strongly recommend using this option. I put *.htm in the field to limit my search to HTML files created by RoboHelp. Without a filter, it’s possible the search will find/replace something you don’t want replaced. (One of my searches had a match in a .png file, and a replace would have corrupted the image.)
- In the Directory: field, browse to the parent folder of the HTML files you want to perform the find/replace operation on.
- In the Search Mode section, click to select the Regular expression radio.
After all of this, I usually do a Find All first, then double-click a file from the find results and switch to the Replace tab. I then just do Find Next and Replace until I’m satisfied that the search is going to cover all desired instances without blowing things up. Once I’m sure of that, I switch back to the Find in Files tab and click Replace in Files.
All of the examples in the chart below are real examples that I found in the project html files and the real expressions I used to get it cleaned up.
|Dirty HTML |
|Clean HTML |
|<td style=”text-align: center;”>|
Option 1: <td style=”text-align: center;”>
Option 2: <td (.*?)>
|<td>||Fairly common in-line styles, centering the contents of a table cell. Because the open/close tag will remain the same, this find/replace shouldn’t pose many problems. |
After upgrade to RH19, use the first option (find/replace exact strings, use Notepad++’s strong search feature to do a search ONLY for option two, then weed through the results to find individual tags you need to remove).
If before upgrade to RH19, you should be safe to use option 2. In Option 2, the (.*?) is essentially a wildcard that finds all tags with any inline style. It should break safely across lines, but, as usual, preview your search results before you do the replace all.
You can extrapolate this search/replace out for any in-line styles for any html tag with in-line styles that you want to keep the base tag for. You’ll still want your <td> tags, your <p> tags, your <li> tags, etc… but you probably won’t want to keep any styles associated with them, so this is an efficient way of cleaning those up.
|<span style=”font-style: italic;”> ||<em>Italicized Content</em>||<span style=”font-style: italic;”>((.*?))</span>||<em>$1</em>||Replace text formatted as italic using a <span> tagwith the same text wrapped in an <em> tag instead.|
In the find expression, ((.*?)) finds matched tags breaking across lines with any content between the tags, and the extra parens around the expression means that you want to preserve the text.
In the replace expression, the $1 tells Notepad++ to take the preserved content and insert it here in the replacement string.
| <span style=”font-style: bold;”>||<strong>Bold Content</strong>|
<span style=”font-style: bold;”>((.*?))</span>
Replace text formatted as bold using a <span> tagwith the same text wrapped in an <strong> tag instead.
<span>Content that has no formatting</span>
|Just content, no tags.|
Find all unformatted text wrapped in a <span> tag and remove that tag while keeping the content.
|<li><p>List item content.</p></li>||<li>List item content.</li>||<li><p>((.*?))</p></li>||<li>$1</li>||Removes <p> tags from inside list item tags. |
I really don’t understand why RoboHelp wants to put a paragraph tag on everything. In my personal opinion, it doesn’t belong inside of list items, table cells, or in many of the other places that RoboHelp likes to put it in, and they perpetuated that issue into RH19 (though there’s some feature requests asking them to stop, I’ll let you know if they do!). I like my HTML clean, and I think list styles have enough to deal with competing with the ol and ul tags without throwing the paragraph tag in the mix.
Replace all the li instances on both the find and replace with a td and you can clean up paragraph tags inside of table cells, but be careful, some table cells might have more than one paragraph inside them, in which case, manual removal is necessary.
Please add a comment if you think of any more helpful examples of using RegEx to clean up old RoboHelp html before you upgrade to RH19! Again, read all the warnings, and back up everything!