Recently I needed to create a regular expression that matches with URLs. But not just complete URLs like "http://www.google.com", the incomplete one's like "google.com" were also expected to match. I've found many examples on the internet, but none of them matches all that I need. So I started with a simple example and began testing:
First I created a test sample with this strings
- www.google.com
- google.com
- http://google.com.br
- testing www.google.com in the middle of the phrase
- testing words.separated by dot
- another test
- WWW.GOOGLE.COM
The first example I tested was this:
(?(http:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)
egrep returned only 3 matches:
www.google.com
http://google.com.br
testing www.google.com in the middle of the phrase
The second try was with this:
"([a-z].*://)|([a-z]|[0-9]).*\.[a-z0-9]+"
but now, it matched "testing words.separated by dot" because of the dot between 2 words.
Still not what I need, but is becoming better.
Because the matches will be in the middle of a text, I found out that I need to check for single words, since links doesn't contain spaces in the middle. So I came to my last and working regex:
"\b(\w.*://)|((\w).*\.\w{2,4})\b"
I use boost libraries in my code, so it became simple to make the verification. The code is:
boost::regex ex("\\b(\\w.*://)|((\\w).*\\.\\w{2,4})\\b",
boost::regex::perl|boost::regex::icase);
std::string::const_iterator start, end;
boost::match_results what;
boost::match_flag_type flags = boost::match_default;
if (boost::regex_search(msg, what, ex , flags) ) {
link =what.begin()->str();
//... // do what you want with the link
//
}