Archive

Archive for March, 2012

Pretty print, console output

One of my current personal projects is a console application. I use the excellent command line parser that is built into bizark to parse my arguments. Bizark includes code to format output so that each line is wrapped nicely on word boundaries. Wrapping text may sound like a very easy task but thing quickly gets complicated. A good briefing of the problem is given here and the article concludes that “No word-wrapping algorithm is perfect” which, of course, must be taken as a challenge. 🙂

I sat down and scratched me head for a while and then came up with this one:

static class TextUtils

{

    // ref: http://stackoverflow.com/questions/521146/c-sharp-split-string-but-keep-split-chars-separators

    public static IEnumerable<string> Tokenize(this string text, string delims = @".,;\s")

    {

        var tokens = Regex.Split(text, "(?<=[" + delims + "])");

        foreach (var token in tokens)

        {

            var match = Regex.Match(token, "(?<tok>[^" + delims + "]*)(?<del>[" + delims + "]?)$");

            if (match.Success)

            {

                if (match.Groups["tok"].Length > 0)

                    yield return match.Groups["tok"].Value;

                if (match.Groups["del"].Length > 0)

                    yield return match.Groups["del"].Value;

            }

        }

    }

 

    private static string ExpandTabs(this string val, int startIndex, int tabSize)

    {

        var result = new StringBuilder();

        foreach (var ch in val.ToCharArray())

        {

            if (ch == ‘\t’)

            {

                int i = startIndex + result.Length;

                int count = tabSize – i % tabSize;

                result.Append(new string(‘ ‘, count));

            }

            else

            {

                result.Append(ch);

            }

        }

 

        return result.ToString();

    }

 

    private static IEnumerable<string> SplitLongerThan(this IEnumerable<string> tokens, int maxLength)

    {

        foreach (var token in tokens)

        {

            int i = 0;

            while (i + maxLength < token.Length)

            {

                yield return token.Substring(i, maxLength);

                i += maxLength;

            }

 

            yield return token.Substring(i);

        }

    }

 

    public static IEnumerable<string> MakeLines(this IEnumerable<string> tokens, int maxCol = 80, int tabSize = 4)

    {

        var line = new StringBuilder();

        foreach (var token in tokens.SplitLongerThan(maxCol))

        {

            if (token == "\r\n" || token == "\n")

            {

                yield return line.ToString();

                line = new StringBuilder();

            }

            else

            {

                var tokenEx = token.ExpandTabs(line.Length, tabSize);

                if (line.Length + tokenEx.Length < maxCol)

                {

                    line.Append(tokenEx);

                }

                else

                {

                    yield return line.ToString();

                    line = new StringBuilder();

                    tokenEx = token.TrimStart().ExpandTabs(line.Length, tabSize);

                    line.Append(tokenEx); 

                }

            }

        }

 

        if (line.Length > 0)

            yield return line.ToString();

    }

}

 

which should be used as:

string text = File.ReadAllText("SampleText.txt");

foreach (var line in text.Tokenize().MakeLines())

{

    if (line.Length == 80)

        Console.Write(line);

    else

        Console.WriteLine(line);

}

 

While it may not be perfect (doh), I think it is readable and that it avoids falling into indexing hell and also avoid having too many corner cases. The entire project can be downloaded here:

https://skydrive.live.com/embed?cid=B46559FE42938868&resid=B46559FE42938868%21398&authkey=ALUNjE4U3oUKSGo

Advertisements
Categories: Uncategorized

Tokenize a string using powershell and regular expressions

Just a snippet for creating tokens from a given string and a set of delimiters.

function tokenize($text, $delims = ".,;\s")
{
    # ref: http://stackoverflow.com/questions/521146/c-sharp-split-string-but-keep-split-chars-separators
    [regex]::Split($text, "(?<=[$delims])") | % { 
        if ($_[-1] -match "[$delims]")
        {
            if ($_.Length -gt 1) 
            {
                $_.SubString(0, $_.Length1)
            }
            $matches[0]
        }
        else
        {
            $_
        }
    }   
}

Sample usage:

tokenize "daniel;;;robert,johanna.torkil anna`tigor   adam"

Which will create the following tokens:

"daniel"; ";"; ";"; ";"; "robert"; ","; "johanna"; "."; "torkil"; " "; "anna"; "    "; "igor"; " "; " "; " "; "adam"

Categories: Uncategorized