Sunday, February 3, 2013

Clean Invalid XML Characters in C#

Here's a cool way to clean Large XML files with invalid xml characters.
Note: Stream from is the original xml file, while Stream to is the new xml file with invalid characters removed.
Code:
private void Copy(Stream from, Stream to)   
{   
       TextReader reader = new StreamReader(from);   
       TextWriter writer = new StreamWriter(to);   
       writer.WriteLine(CleanInvalidXmlChars(reader.ReadToEnd()));   
       writer.Flush();   
}   
     
public static string CleanInvalidXmlChars(string text)   
{   
       string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";   
       return Regex.Replace(text, re, "");   
}  
Source: http://social.msdn.microsoft.com/Forums/
Post: Invalid character returned from webservice

5 comments:

  1. There is a mistake in the expression:
    is: \x10000-x10FFFF
    should be: \x10000-\x10FFFF

    ReplyDelete
  2. Sorry, actually it should be:

    const string expression = @"[^\x09\x0A\x0D\u0020-\uD7FF\uE000-\uFFFD]";

    \x10000-\0x7FFF is beyond UTF16 and can't be handled by .NET Regex.
    However it should be OK for most applications to strip these.

    Furthermore the four digits hex must be prefixed by \u, not \x

    This is the unit test:

    [TestMethod]
    public void CleanInvalidCharacters()
    {
    var input = "";

    // illegal range 1 (except nl, cr, tab)
    for (int i = 0; i < 0x20; i++)
    {
    input += ((char)i);
    }
    const string legalCharactersBelowX20 = "\t\n\r";

    // illegal range 2
    for (int i = 0xD800; i < 0xE000; i++)
    {
    input += ((char)i);
    }

    // illegal range 3
    for (int i = 0xFFFE; i < 0x10000; i++)
    {
    input += ((char)i);
    }

    // some legal characters
    var someLegalSampleCharacters = "";
    someLegalSampleCharacters += " abcdefghijklmnopqrstuvwxyzäöüABCDEFGHIJKLMNOPQRSTUVWXZYZÄÖÜ0123456789%&.-_";
    someLegalSampleCharacters += "\uFFFD"; // xFFFD as an example for a high range legal character
    input += someLegalSampleCharacters;

    var output = input.CleanXml10InvalidCharacters();

    Assert.AreEqual(_stringToHex(legalCharactersBelowX20 + someLegalSampleCharacters), _stringToHex(output));
    }

    // format as hex for easier debugging when the test fails
    private string _stringToHex(string s)
    {
    var sb = new StringBuilder();
    foreach (var t in s)
    {
    sb.Append(Convert.ToInt32(t).ToString("x") + " ");
    }
    return sb.ToString();
    }

    ReplyDelete
  3. Cool!
    Thank you for pointing out the corrections. When I tested it for large/bulky xml files, they work just fine.

    Your modified REGEX pattern declared as constant is a scope wise version for UTF-16 which I haven't thought of.

    Thanks.. :)

    Greg

    ReplyDelete
  4. Can yo uprovide some usage? I have a large xml file I want to load in, but not sure how to call copy with the xml file as a stream. Also I've converting to VB.NET

    ReplyDelete
    Replies
    1. Hi,

      Here's a detailed example from MSDN where I derived the contents of this post. It has a C# and VB.NET sample.
      http://msdn.microsoft.com/en-us/library/system.web.services.protocols.soapextension(v=vs.100).aspx


      Psycho Genes

      Delete