Extract only viewable text from HTML with C#

If you need to extract the viewable text from an HTML page, you can use the code below. It uses HtmlAgilityPack.

Online I’ve found other methods to extract the text, but they would delete scripts and styles from the nodes in the HtmlNode, wich is’t acceptable to me (I need them).

The code also uses a regex to remove the repeated white-spaces that invariably when the HTML is cleand up.

private static Regex _removeRepeatedWhitespaceRegex = new Regex(@"(\s|\n|\r){2,}", RegexOptions.Singleline | RegexOptions.IgnoreCase);

[Test]
public void Extract_all_text_from_webpage()
{
 HtmlDocument document = new HtmlDocument();
 document.Load(new MemoryStream(File.ReadAllBytes(@"c:\slask\bob.html")));
 Console.WriteLine(ExtractViewableTextCleaned(document.DocumentNode));
}

public static string ExtractViewableTextCleaned(HtmlNode node)
{
 string textWithLotsOfWhiteSpaces = ExtractViewableText(node);
 return _removeRepeatedWhitespaceRegex.Replace(textWithLotsOfWhiteSpaces, " ");
}

public static string ExtractViewableText(HtmlNode node)
{
 StringBuilder sb = new StringBuilder();
 ExtractViewableTextHelper(sb, node);
 return sb.ToString();
}

private static void ExtractViewableTextHelper(StringBuilder sb, HtmlNode node)
{
 if (node.Name != "script" && node.Name != "style")
 {
 if (node.NodeType == HtmlNodeType.Text)
 {
 AppendNodeText(sb, node);
 }

 foreach (HtmlNode child in node.ChildNodes)
 {
 ExtractViewableTextHelper(sb, child);
 }
 }
}

private static void AppendNodeText(StringBuilder sb, HtmlNode node)
{
 string text = ((HtmlTextNode)node).Text;
 if (string.IsNullOrWhiteSpace(text) == false)
 {
 sb.Append(text);

 // If the last char isn't a white-space, add a white space
 // otherwise words will be added ontop of each other when they're only separated by
 // tags
 if (text.EndsWith("\t") || text.EndsWith("\n") || text.EndsWith(" ") || text.EndsWith("\r"))
 {
 // We're good!
 }
 else
 {
 sb.Append(" ");
 }
 }
}
Advertisements

Redust Site-Audit

I’m proud to announce that we’ve added a new tool to Redust called Redust Site-Audit. Redust Site-Audit is a tool that automates the process of repeatedly checking that your page is looking as good as possible to search engines.

Site-Audit will help you

  • Discover all pages on your site
  • Scan those pages for things that will lower your SERP (Search Engine Rank Page)
  • Generate a comprehensive report of what was found on your pages
  • Sort the pages from worst to best, so you’ll know what pages to start working on.

Read more about Site-Audit in this blog post.

Based on Redust SEO-Rules

Redust Site-Audit is based on Redust SEO-Rules, a tool that scans individual pages for SEO rule violations, but Site-Audit is intended for entire sites with tens of pages to thousands of pages.

Most importantly, Site-Audit will help you quickly and easily find violations that may harm your entire site;

  • Verify that the page hasn’t been flagged as malware by Google
    • This can happen if someone hacks your site or uploads bad code in an open forum)
    • Google will drop your page from it’s search result if this happens!
    • Google may punish your entire domain if this happens
    • Google may even drop your entire domain and all pages on your site
    • Firefox will refuse to open your page
  • Verify that the page hasn’t been flagged as phishing by Google
  • Verify that the page doesn’t link to any pages that have been flagged as malware or phishing by Google
    • Even if your page is clean, you’ll be guilty by association for linking
    • Google may punish the SERP for your entire domain if you link to bad pages
    • The linked page might have been safe when you added the link, but what if it’s been taken over by bad guys? How would you notice?
  • Verify that the page doesn’t contain offensive or derogatory language – even in the source code
  • Verify that the page doesn’t contain dead links to images/pages that have been removed

Download Stock Quotes, Free, Easy and with Source Code

Every once in a while, I need stock quotes for some idea I need to try out. There are several methods for downloading quotes that cost money – but the data is freely available from  Yahoo so I figured I’d use that.

Turns out that 93 rows of code is enough – and that includes downloading a list of Nasdaq stock symbols.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Net;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace Cluster.ConsoleApp
{
    public static class MinimalStockDownloader
    {
        public static void DownloadStocks(string path)
        {
            // Allows you to download more items in parallel
            ServicePointManager.DefaultConnectionLimit = 10;
            List<string> symbols = GetStockSymbols();

            // Run in parallel - speeds things up considerably...
            Parallel.ForEach(
                symbols,
                symbol =>
                {
                    Console.WriteLine("Downloading {0}: Working...", symbol);
                    string suri = string.Format(
                        "http://ichart.finance.yahoo.com/table.csv?s={0}&g=d&ignore=.csv", 
                        symbol);                    
                    try
                    {
                        string quotes = Get(new Uri(suri));
                        File.WriteAllText(path + @"\" + symbol + ".csv", quotes);                    
                    }
                    catch (Exception e)
                    {
                        Console.WriteLine(
                            "  Failed to download {0} with exception: \n{1}", 
                            symbol, 
                            e);
                    }
                    
                    Console.WriteLine(
                        "Downloading {0}: Done!", 
                        symbol);
                });
        }

        public static List<string> GetStockSymbols()
        {
            // You'll find a complete list of stocks at
            // http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download
            // referrred to by http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ
            string stocksCsv = Get(new Uri("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download"));

            List<string> symbols = new List<string>();
            Regex regexObj = new Regex("^\"(.*?)\"", RegexOptions.Singleline | RegexOptions.Multiline);
            Match matchResult = regexObj.Match(stocksCsv);
            while (matchResult.Success)
            {
                symbols.Add(matchResult.Groups[1].Value);
                matchResult = matchResult.NextMatch();
            }

            // First item is the text "Symbol".
            symbols = symbols.Skip(1).ToList();
            Console.WriteLine("Found {0} symbols", symbols.Count);
            return symbols;
        }

        public static string Get(Uri address)
        {
            Console.WriteLine("  Get {0}...", address);
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(address);
            request.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.CacheIfAvailable);

            try
            {
                using (WebResponse response = request.GetResponse())
                using (var stream = response.GetResponseStream())
                using (var reader = new StreamReader(stream, Encoding.Default))
                {
                    return reader.ReadToEnd();
                }
            }
            catch
            {
                request.Abort();
                throw;
            }
        }
    }
}