quinta-feira, 28 de abril de 2011

Regex to match URL/URI

Recently I needed to create a regular expression that matches with URLs. But not just complete URLs like "http://www.google.com", the incomplete one's like "google.com" were also expected to match. I've found many examples on the internet, but none of them matches all that I need. So I started with a simple example and began testing:

First I created a test sample with this strings
  • www.google.com
  • google.com
  • http://google.com.br
  • testing www.google.com in the middle of the phrase
  • testing words.separated by dot
  • another test
  • WWW.GOOGLE.COM

The first example I tested was this:
(?(http:[/][/]|www.)([a-z]|[A-Z]|[0-9]|[/.]|[~])*)

egrep returned only 3 matches:
www.google.com
http://google.com.br
testing www.google.com in the middle of the phrase

The second try was with this:
"([a-z].*://)|([a-z]|[0-9]).*\.[a-z0-9]+"

but now, it matched "testing words.separated by dot" because of the dot between 2 words.

Still not what I need, but is becoming better.

Because the matches will be in the middle of a text, I found out that I need to check for single words, since links doesn't contain spaces in the middle. So I came to my last and working regex:
"\b(\w.*://)|((\w).*\.\w{2,4})\b"

I use boost libraries in my code, so it became simple to make the verification. The code is:

boost::regex ex("\\b(\\w.*://)|((\\w).*\\.\\w{2,4})\\b",
boost::regex::perl|boost::regex::icase);
std::string::const_iterator start, end;
boost::match_results what;
boost::match_flag_type flags = boost::match_default;
if (boost::regex_search(msg, what, ex , flags) ) { 
link =what.begin()->str();
//... 
// do what you want with the link
//
}

Nenhum comentário:

Postar um comentário