URL GREP – Simplified!

In an earlier post “Get to grips with GREP” I spoke about my early experiments with GREP (Regular Expressions) and how amazingly useful they prove to be. Recently I started revisiting some of the GREPs I had been using in the magazine I produce for RCI and realised I could simplify them.

Think Laterally
Initially when grappling with GREP the approach is a literal one, but actually thinking laterally can be far more elegant. For instance the code I posted for finding urls was as follows:

(w+|w+-|w+/)*(w+.[lu]+(.[lu]+)*)(/w+(/w+|-w+)*)*

The magazine (Endless Vacation) has a house style which means that the URLs do not feature www. at the start (or http://) for that matter so I could not use them as ‘cues’ for website addresses. i.e I couldn’t search for ‘Anything that follows “www.” ‘ for example. In trying to solve this problem the code above searches for combinations of all the things that may turn up (slashes, hyphenated words, various combinations of dots (periods) and words or numbers etc). My new code does the opposite:

([^s]+).([^s]+)

It looks for anything that isn’t a space, followed by a dot (period), followed by anything that isn’t a space. In this way it finds any combination of letters, numbers, hyphens, underscores (and anything else) which has at least one dot (period) in it. If the dot has a space after it, then that indicates a sentence rather than a URL and it is ignored.

Simple and Effective
This solution had an added benefit. I had separate GREPS to find ‘http://’ and ‘https://’ in case they had been left in and also to find email addresses. This simpler GREP finds them all, so not only is the code shorter, it removes the need for two other GREPS.

I’ve also posted this on indesignsecrets.com and I’d welcome any feedback about it

Share this article: Share on LinkedInTweet about this on TwitterShare on FacebookShare on Google+


Pixooma