Evgeny Pokhilko’s Weblog

Dedicated to software development

regular expressions for XML tags

Regular expressions are powerful and can substitute XML libraries for simple tasks. Say, you need to select all elements with a specific name from an XML file. Below is a sample doing that. The program reads XMLFile11, selects three different elements and prints them and their content to the console.

XMLFile1:

<?xml version="1.0" encoding="utf-8" ?>
<Main>
  <Item Name="item1"/>

  <Item Name ="item2">
    <Components>
      <Component/>
      <Component/>
      <Component Name="component10">
        <SubComponent/>
      </Component>
    </Components>

  </Item>
  <Item>
    <Item></Item>
    <Item></Item>
  </Item>
</Main>

C# code:

    class Program
    {
        public Program(string Xml)
        {
            _xml = Xml;
        }

        string _xml;

        void PrintTags(string tagName)
        {
            string expression =
                @"(<{0}\/>)|" + // gets <tagName/>
                @"(<{0}\s[^>]*?\/>)|" + //<tagName[space]BlaBla.../>
                @"(<{0}>[\s\S]*?<\/{0}\s*>)|" + //<tagName>BlaBla...</tagName>
                @"(<{0}\s[\s\S]*?>[\s\S]*?<\/{0}\s*>)"; //<tagName[space]BlaBla...>BlaBla...</tagName>

            Regex regex = new Regex(String.Format(expression, tagName));
            Match match = regex.Match(_xml);
            do
            {
                Console.WriteLine("tag: {0}", tagName);
                Console.WriteLine(match.Value);
                match = match.NextMatch();
            } while (match.Success);
        }

        void Run()
        {
            PrintTags("Item");
            PrintTags("Component");
            PrintTags("Components");
        }

        static void Main(string[] args)
        {
            Program program = new Program(File.ReadAllText("XMLFile1.xml"));
            program.Run();
            Console.Read();
        }
    }

I had to define expresions for four cases (see comments in the PrintTags method).
The following is the output:

Output:

tag: Item
<Item Name="item1"/>
tag: Item
<Item Name ="item2">
    <Components>
      <Component/>
      <Component/>
      <Component Name="component10">
        <SubComponent/>
      </Component>
    </Components>

  </Item>
tag: Component
<Component/>
tag: Component
<Component/>
tag: Component
<Component Name="component10">
        <SubComponent/>
      </Component>
tag: Components
<Components>
      <Component/>
      <Component/>
      <Component Name="component10">
        <SubComponent/>
      </Component>
    </Components>

However this code won’t work if XML file contains nested elements with identical names. See example below.

XML:

<Item>
    <Item></Item>
    <Item></Item>
</Item>

If you call the PrintTags method with “Item”, you will get <Item><Item></Item>. It happens because the regular expression doesn’t count opened and closed tags.

Code for this post

June 20, 2008 Posted by evpo | .NET | , , | 2 Comments