Evgeny Pokhilko's Weblog

Programmer's den

regular expressions for XML tags

Regular expressions are powerful and can substitute XML libraries for simple tasks. Say, you need to select all elements with a specific name from an XML file. Below is a sample doing that. The program reads XMLFile11, selects three different elements and prints them and their content to the console.

XMLFile1:

<?xml version="1.0" encoding="utf-8" ?>
<Main>
  <Item Name="item1"/>

  <Item Name ="item2">
    <Components>
      <Component/>
      <Component/>
      <Component Name="component10">
        <SubComponent/>
      </Component>
    </Components>

  </Item>
  <Item>
    <Item></Item>
    <Item></Item>
  </Item>
</Main>

C# code:

    class Program
    {
        public Program(string Xml)
        {
            _xml = Xml;
        }

        string _xml;

        void PrintTags(string tagName)
        {
            string expression =
                @"(<{0}\/>)|" + // gets <tagName/>
                @"(<{0}\s[^>]*?\/>)|" + //<tagName[space]BlaBla.../>
                @"(<{0}>[\s\S]*?<\/{0}\s*>)|" + //<tagName>BlaBla...</tagName>
                @"(<{0}\s[\s\S]*?>[\s\S]*?<\/{0}\s*>)"; //<tagName[space]BlaBla...>BlaBla...</tagName>

            Regex regex = new Regex(String.Format(expression, tagName));
            Match match = regex.Match(_xml);
            do
            {
                Console.WriteLine("tag: {0}", tagName);
                Console.WriteLine(match.Value);
                match = match.NextMatch();
            } while (match.Success);
        }

        void Run()
        {
            PrintTags("Item");
            PrintTags("Component");
            PrintTags("Components");
        }

        static void Main(string[] args)
        {
            Program program = new Program(File.ReadAllText("XMLFile1.xml"));
            program.Run();
            Console.Read();
        }
    }

I had to define expresions for four cases (see comments in the PrintTags method).
The following is the output:

Output:

tag: Item
<Item Name="item1"/>
tag: Item
<Item Name ="item2">
    <Components>
      <Component/>
      <Component/>
      <Component Name="component10">
        <SubComponent/>
      </Component>
    </Components>

  </Item>
tag: Component
<Component/>
tag: Component
<Component/>
tag: Component
<Component Name="component10">
        <SubComponent/>
      </Component>
tag: Components
<Components>
      <Component/>
      <Component/>
      <Component Name="component10">
        <SubComponent/>
      </Component>
    </Components>

However this code won’t work if XML file contains nested elements with identical names. See example below.

XML:

<Item>
    <Item></Item>
    <Item></Item>
</Item>

If you call the PrintTags method with “Item”, you will get <Item><Item></Item>. It happens because the regular expression doesn’t count opened and closed tags.

Code for this post

June 20, 2008 Posted by | .NET | , , | 2 Comments