Page Matching

Most of my work revolves around conversion rate optimization for websites and apps. In my near decade now doing this, I've used various tools (optimization tools such as Monetate, Adobe Target, Google Optimize, etc) and in my experience they all pretty much have a flaw - some have more than others - where it comes to matching pages you are wanting to test. Most platforms assume that you are only ever going to match a complete URL, actual use "in the field" says this is never the case. Target has a problem that if you say match this URL: https://www.mysite.com/search it will only match that exact URL. It will not match https://www.mysite.com/search?a_bunch_of_query_strings=value so if you are testing something on the search page, this can be problematic. There are various work arounds and each tool has different methods for getting this to work, but you got to know that this is the problem before you "press play".

There are also situations on some sites where the full URL might not be https. Granted now sites should definitely use TLS for everything, but like all things, not every site is set up with a TLS, though implementation is ramping up dramatically. Another problem is that not all sites use www. or have www. forward to / (or visa versa) so again this causes page matching issues within many testing platforms because they aren't really geared for all sites.

I do have a solution, well part of a solution due to the fact that even a few platforms won't handle this, but for those that do, it helps.

I want to say one word to you. Just one word. RegEx.

Ok technically that is two words...well...it's a portmanteau. Are portmanteaus multiple words? IDK, something for a future post. Needless to say, Regular Expressions are generally thought of as dangerous or complex, and many developers tend to shy away from them. But Regular Expressions are incredibly useful when done right, and if done right, you need not do it ever again. I love "smarter not harder" solutions to problems, even if it takes a bit of extra time to create a solution in the short term, but saves a tremendous amount of time in the long run.

Now I am not, by any means, a RegExPert, but I've had to deal with a it from time to time. So I know a few things. Enough to make me very dangerous, but at least I know that I am. Let us start from the beginning - of a regular expression that is. I needed to make sure we had a defined start of the pattern to match so that stray URL entries that I hadn't thought of or that just shouldn't be there. I've found myself facepalming so many times at web developers using "illegal" characters in URL strings. Sadly, Ive even seen enterprise level platforms doing things that should just not be.

Ready.Set.Go.

The start of the RegEx is the shorthand code of ^. This indicates that matching should start from this point, meaning that the string you are matching has to start with whatever immediately follows and not 1,397 characters into it or even just 2 characters into it. Char 0 of the string has to match what immediately follows the ^ in the regular expression.

Next we need to make sure we say what all websites have to say to start with: http or https. Now we could just write a group (http | https) which says this entry can be either http or https, but that's a lot of characters and I like to simplify this. Well both http and https contain http. So if we take that and just say that the s is optional or more rightly as "nothing or s" we see a simplified grouping as http(|s). We could also say it the opposite way as "s or nothing" http(s|) but when writing it out it is easier to visually scan as the pattern is similar to how a URL is structured.

For those unfamiliar with the "pipe", or as I like to call it "une pipe" as in "ceci n'est pas une pipe" - oh I could go on for hours with surrealist puns - but I digress. The "pipe", or | on the keyboard, is often used to mean the logical operator: "or". In JavaScript, we use two pipes to mean or, as expressed as: ||. With RegEx, though, you only need the one.

Escape is not his plan. I must face him, alone.

Next hurdle we come to are the slashes //. Since slashes are used by RegEx for expressions, we need to "escape" them out. To do so we just preface each slash with a forward slash like so: \/\/. The next part of URLs is the www which may or may not be there, so we need to group it like we did the http. Now the start of URLs do not have the . that trails after the www so we need to add that to the grouping as well, but . are also used by RegEx and we need to "escape" that as well. Our resulting grouping should look like this: (www\.|).

Now comes the easy part. just place the domain name for the site followed by an escaped . and the tld extension for the domain like: microsoft\.com or tracy\.is. It's fairly straight forward here, but after this is where it gets really complicated.

It's worked so far, but we're not out of this yet.

At this point we need to address the end of the URL string. We want this regular expression to be somewhat universal, or at least I do, so we need to address certain usual endings to URLs. We have the path end: tracy.is. The trailing slash path end: tracy.is/. The path end with a query string: tracy.is?something=darkside. The path end with the slash and query string: tracy.is/?something=else. Finally we have the odd placed hash at the end: tracy.is/#bookmark

With all this, how do you expect to match it all? It does take a bit of thought and sometimes, trial and error. Like the hash variant I didn't think of until I happened upon a site that had hashes at the end of every URL and sometimes hashes followed by query strings! It was a mess.

To take care of the actual end of the URL string we just use the RegEx code for the end of the string which is our handy dandy "root of all evil": $. The dollar sign in RegEx means that the string you are trying to match must end at this point. For the examples of path end and path end with a slash, we would just place the $ grouped with a \/$.

With our group ($|\/$) for an absolute ending, we now must look at all the other endings.

Are we there yet?

Since none of the other endings are predictable with content, length, or positioning, we need to leave the next group open ended. Now if you are feeling a little clever, you might be wondering why we need them at all. We could just leave the "absolute ending" as open ended. The problem with that is that, well it's just plain wrong. Having a matching rule like: ^http(|s):\/\/(www\.|)tracy\.is will match the home page just fine and it will match the home page with hashes and query strings, but it will also match this page and that page as well. We don't want that. If this were an A/B test, the data would be all wrong and you'd need to start from scratch once you found this out. We need the open ended grouping to ensure that no other folders get pulled into a match.

First we check for the presence of the trailing slash or not: (/|). We then see if what follows is a query string or a hash: (\?|#). We then group these two groups together as they rely upon each other: ((/|)(\?|#)). Next step, and it is the last, is to append this into our previous group with a preceding "or" logical operator:($|/$|((/|)(?|#))). Now we are done!

Our final page matching pattern should look like this:

et voilà

^http(|s):\/\/(www\.|)yourdomain\.tld($|\/$|((\/|)(\?|#)))

substitute yourdomain with your actual domain and tld with your Top Level Domain (.com .net .org, .etc)

That's all there really is to it. It only took 1420 words to explain it. But now that you have it down, you can do fancy things like place language directories in there as optionals like: (|/(en|is|es|zh)) right between the .tld and ($| ... which will handle multiple homepages based upon languages - if your site is structured like that.

examples:

https://tracy.is		//normal root home page
https://tracy.is/en		//english version of the home page
https://tracy.is/es		//spanish version of the home page
https://tracy.is/zh		//chinese version of the home page

With that in mind, you can go a step further and play around with subdomains other than www and pages other than the home page. However, that usually isn't necessary as most tools have simple look ups like "url contains" or similar. I find home pages to be the toughest because they don't usually have a distinctive address that isn't present in every single url iteration in the site.