Evgeny Pokhilko's Weblog

Programmer's den

regular expressions for XML tags


Regular expressions are powerful and can substitute XML libraries for simple tasks. Say, you need to select all elements with a specific name from an XML file. Below is a sample doing that. The program reads XMLFile11, selects three different elements and prints them and their content to the console.

XMLFile1:

<?xml version="1.0" encoding="utf-8" ?>
<Main>
  <Item Name="item1"/>

  <Item Name ="item2">
    <Components>
      <Component/>
      <Component/>
      <Component Name="component10">
        <SubComponent/>
      </Component>
    </Components>

  </Item>
  <Item>
    <Item></Item>
    <Item></Item>
  </Item>
</Main>

C# code:

    class Program
    {
        public Program(string Xml)
        {
            _xml = Xml;
        }

        string _xml;

        void PrintTags(string tagName)
        {
            string expression =
                @"(<{0}\/>)|" + // gets <tagName/>
                @"(<{0}\s[^>]*?\/>)|" + //<tagName[space]BlaBla.../>
                @"(<{0}>[\s\S]*?<\/{0}\s*>)|" + //<tagName>BlaBla...</tagName>
                @"(<{0}\s[\s\S]*?>[\s\S]*?<\/{0}\s*>)"; //<tagName[space]BlaBla...>BlaBla...</tagName>

            Regex regex = new Regex(String.Format(expression, tagName));
            Match match = regex.Match(_xml);
            do
            {
                Console.WriteLine("tag: {0}", tagName);
                Console.WriteLine(match.Value);
                match = match.NextMatch();
            } while (match.Success);
        }

        void Run()
        {
            PrintTags("Item");
            PrintTags("Component");
            PrintTags("Components");
        }

        static void Main(string[] args)
        {
            Program program = new Program(File.ReadAllText("XMLFile1.xml"));
            program.Run();
            Console.Read();
        }
    }

I had to define expresions for four cases (see comments in the PrintTags method).
The following is the output:

Output:

tag: Item
<Item Name="item1"/>
tag: Item
<Item Name ="item2">
    <Components>
      <Component/>
      <Component/>
      <Component Name="component10">
        <SubComponent/>
      </Component>
    </Components>

  </Item>
tag: Component
<Component/>
tag: Component
<Component/>
tag: Component
<Component Name="component10">
        <SubComponent/>
      </Component>
tag: Components
<Components>
      <Component/>
      <Component/>
      <Component Name="component10">
        <SubComponent/>
      </Component>
    </Components>

However this code won’t work if XML file contains nested elements with identical names. See example below.

XML:

<Item>
    <Item></Item>
    <Item></Item>
</Item>

If you call the PrintTags method with “Item”, you will get <Item><Item></Item>. It happens because the regular expression doesn’t count opened and closed tags.

Code for this post

June 20, 2008 - Posted by | .NET | , ,

2 Comments »

  1. Hi.
    Please don’t use this method event for the simplest of task related to XML. Regular expressions are powerful – true, but when you have to work with XML DO USE a XML library! In .NET 3.5 you can use LINQ to XML. It’s so easy to work with XML using Linq to XML.
    Using RegEx will create more problems then benefits.

    Comment by ppetrov | June 30, 2008 | Reply

  2. You are right. I said “they can substitute” but I didn’t recommend doing it if it’s possible to use a good library. LINQ to XML is my favorite one.

    Comment by evpo | July 1, 2008 | Reply


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: