Regular Expressions. Remember it, write it, test it.
- modified:
- reading: 4 minutes
I should say that I’m fan of regular expressions. Whenever I see the problem, which I can solve with Regex, I felt a burning desire to do it and going to write new test for new regex. Previously I had installed SharpDevelop Studio just for good regular expression tool in it (Why VS doesn’t have one?). But now I’m a little wiser, and for each Regex I write a separate test.
I find it difficult to remember the syntax of regular expressions (I don’t write them very often); I always forget which character is responsible for the beginning of the line, etc. So I use external small and easy articles like this “Regular expressions - An introduction”.
Now I want to show you little samples of regular expressions and want to show you how to test these samples.
Examples
Check language of text
On my website (https://www.outcoldman.com) I display my messages from Twitter, but once I separated all content of the site in Russian and English, I decided to separate messages from Twitter too. I got a very easy way for this: if text has Russian letter then the message is Russian, else English. Most likely you do not know Russian, so I'll do the opposite, if there is an English letter - it's English, otherwise the Russian. I wrote next regular expression for this:
const string RegexIsEnglish = @"[A-Za-z]+"
Of course, you can solve this problem with foreach or for operators with watching every letter that it in some range of letters. But this regex you can use very easy, just type this:
Regex.IsMatch(text, RegexIsRussian)
Regex is class from namespace System.Text.RegularExpressions
Finding and replacing urls in text
Previous sample is very easy. Let try to solve more difficult problem like finding http urls in text. Doing this job without regex is not an easy solution. So we will look how to solve this with regex:
const string RegexUrl = @"(https?://(www.)?([\w\-]+)(\.([\w\-]+))+([\w\\\/\.?&%=\-+]*))"
Now you can check if text has http url in it with Regex.IsMatch or find all of this http urls with Regex.Matches, but we can do more with this regex, we can replace each url with anchor – html link:
Regex.Replace(text, RegexUrl, "<a href='$1'>$1</a>");
Instead of $1 we will have finding http urls (condition in first brackets).
Also we can change hashtags and user nicks in twitter messages with next regex constructions:
const string RegexTwitterUser = @"(@([A-Za-z0-9_]+))";
Regex.Replace(text, RegexTwitterUser, "<a href='http://twitter.com/$2'>$1</a>")
const string RegexTwitterTag = @"(#([A-Za-z0-9_]+))";
Regex.Replace(text, RegexTwitterTag, "<a href='http://twitter.com/#search?q=%23$2'>$1</a>");
In second example we use $1 and $2, where $2 – it is one that we have in second brackets (hashtag without # and nickname without @).
Also I can show you many samples of using Regex. Developers often use regular expressions in validation process, with them and default asp.net (or other technologies) very easy validate user input (email, phones, post indexes, etc). You can find many ready to use regex solutions in internet.
Unit test for regex
I always write combinatory unit tests for regex (combinatory tests – methods with parameters where test runner put values from some data sources). So let look on nUnit (now I’m using this one, because I couldn’t make MbUnit work with Gallio and R# 5 in VS 2010).
public string[][] Urls
{
get
{
return new[]
{
new[] {"Hello http://google.com bye",
"Hello http://google.com bye"},
new[] {"Hello http://mail.google.com bye",
"Hello http://mail.google.com bye"},
new[] {"Hello http://google.com/test/test.aspx",
"Hello http://google.com/test/test.aspx"},
new[] {"Hello http://g0ogle.com/test/test.aspx?q=7&b=0",
"Hello http://g0ogle.com/test/test.aspx?q=7&b=0"}
};
}
}
[Test]
public void twitter_text_to_html_convert([ValueSource("Urls")] string[] url)
{
string replace = HtmlParser.ReplaceHref(url[0]);
Assert.AreEqual(url[1], replace);
}
As you saw this is really small test – just one method twitter_text_to_html_convert. And this test are using data source Urls, in which I always place new test values. Method twitter_text_to_html_convert will be executed as many times as the count of items in data source.
Some time ago I found that my regex didn’t find urls with dash (‘-’). I changed little my regex, and then added new values in data source Urls for test:
new[]
{
"Hello http://ninja-assassin-movie.warnerbros.com",
"Hello http://ninja-assassin-movie.warnerbros.com"
}
Then I run test and checked that all old values are ok.