Extract only viewable text from HTML with C#

If you need to extract the viewable text from an HTML page, you can use the code below. It uses HtmlAgilityPack.

Online I’ve found other methods to extract the text, but they would delete scripts and styles from the nodes in the HtmlNode, wich is’t acceptable to me (I need them).

The code also uses a regex to remove the repeated white-spaces that invariably when the HTML is cleand up.

private static Regex _removeRepeatedWhitespaceRegex = new Regex(@"(\s|\n|\r){2,}", RegexOptions.Singleline | RegexOptions.IgnoreCase);

[Test]
public void Extract_all_text_from_webpage()
{
 HtmlDocument document = new HtmlDocument();
 document.Load(new MemoryStream(File.ReadAllBytes(@"c:\slask\bob.html")));
 Console.WriteLine(ExtractViewableTextCleaned(document.DocumentNode));
}

public static string ExtractViewableTextCleaned(HtmlNode node)
{
 string textWithLotsOfWhiteSpaces = ExtractViewableText(node);
 return _removeRepeatedWhitespaceRegex.Replace(textWithLotsOfWhiteSpaces, " ");
}

public static string ExtractViewableText(HtmlNode node)
{
 StringBuilder sb = new StringBuilder();
 ExtractViewableTextHelper(sb, node);
 return sb.ToString();
}

private static void ExtractViewableTextHelper(StringBuilder sb, HtmlNode node)
{
 if (node.Name != "script" && node.Name != "style")
 {
 if (node.NodeType == HtmlNodeType.Text)
 {
 AppendNodeText(sb, node);
 }

 foreach (HtmlNode child in node.ChildNodes)
 {
 ExtractViewableTextHelper(sb, child);
 }
 }
}

private static void AppendNodeText(StringBuilder sb, HtmlNode node)
{
 string text = ((HtmlTextNode)node).Text;
 if (string.IsNullOrWhiteSpace(text) == false)
 {
 sb.Append(text);

 // If the last char isn't a white-space, add a white space
 // otherwise words will be added ontop of each other when they're only separated by
 // tags
 if (text.EndsWith("\t") || text.EndsWith("\n") || text.EndsWith(" ") || text.EndsWith("\r"))
 {
 // We're good!
 }
 else
 {
 sb.Append(" ");
 }
 }
}

About mfagerlund
Writes code in my sleep - and sometimes it even compiles!

3 Responses to Extract only viewable text from HTML with C#

  1. Martin says:

    i had the need for something similar once – and also a need to learn regex.. so i made this little application: http://www.martinwardener.com/regex/. of course, there was a lot of criticism on stackoverflow about using regex for this: http://stackoverflow.com/questions/3951485/regex-extracting-readable-non-code-text-and-urls-from-html-documents – but, as they failed to grasp, it was a tech test. although it’s possible to break it, it works surprisingly well (and fast).

  2. Martin says:

    ..as a side note – i was using it for this: http://www.martinwardener.com/booyaa/transformer.aspx. it was a toy tool, where you can supply a url along with a list of words (comma separated) to replace with another list of words (also comma separated and in the same order) – to create “spoof” websites that look like their originals, except with certain words in the content replaced. like this: http://www.martinwardener.com/booyaa/?article=33🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: