Remove Google Stopwords From String

If you’re anything like me, you’ve often wondered “how do I remove google stopwords from my strings”? Well, maybe you haven’t, but this is the method I came up with to do just that.

Google stopwords are words that are ignored in your query and on indexing your web pages, things like “a”, “about”, “above”. Here’s a list of stopwords assumed to be used by search-engines; http://www.ranks.nl/resources/stopwords.html and here’s another; http://www.link-assistant.com/seo-stop-words.html.

My method converts this text;

This is a text with lots and lots of stopwords. It also talks
a bit about load balancing. Load balancing is not a stopword.

To this text;

      text   lots   lots   stopwords.
talks   bit   load balancing. Load balancing       stopword.

Note that there are many double spaces, which we’ll take care of next.

The actual regex for clearing out stopwords looks like this (I’m only using a few of the words here);

(?<=(\A|\s|\.|,|!|\?))(a|able|about|above|abroad)(?=(\s|\z|\.|,|!|\?))

Btw, I use the awesome Regex Buddy for all my Regex needs- Regex Buddy, you’re my buddy too!

The Regex uses look-ahead and look-behind which not all Regex engines supports, and Regex formats may be differ. The reason it uses look-ahead and look-behind is that we don’t want the regex to “eat” the word separators, because if it did it would lose every second word that matches the regex. Believe me, it’s true, or use – you guessed it – Regex buddy to confirm.

If you’re clever, like me, you’ll be able to read the regex like it was plain english… ok, it’s obtuse and obfuscated, but it works.

I generated it like this (after using aforementioned Regex Buddy to determine the format);

string[] stopwords = File.ReadAllLines("EnglishStopwords.txt");

string regexCode = 
  @"(?<=(\A|\s|\.|,|!|\?))(" + 
  string.Join("|", stopwords) + 
  @")(?=(\s|\z|\.|,|!|\?))";

Regex regex = new Regex(regexCode, RegexOptions.Singleline | RegexOptions.IgnoreCase);

string cleaned = regex.Replace(indata, " ");

As for removing the double spaces, use this;

Regex removeDoubleWhiteSpace = 
  new Regex(@"\s{2,}", RegexOptions.Singleline | RegexOptions.IgnoreCase);

string cleaned = 
  removeDoubleWhiteSpace.Replace(cleaned, " ");

So, this is a rather compact version of what I’m doing;

public static void Test()
{
    string text =
        @"This is a text with lots and lots of stopwords. It also" + Environment.NewLine +
            "talks a bit about load balancing. Load balancing is not a stopword.";

    Console.WriteLine("Before:");
    Console.WriteLine(text);

    string[] stopwords = File.ReadAllLines(@"c:\slask\EnglishStopwords.txt");

    string regexCode = 
        @"(?<=(\A|\s|\.|,|!|\?))(" + 
        string.Join("|", stopwords) + 
        @")(?=(\s|\z|\.|,|!|\?))";

    Regex regex = new Regex(regexCode, RegexOptions.Singleline | RegexOptions.IgnoreCase);
            
    string cleaned = regex.Replace(text, " ");

    Console.WriteLine("\nAfter remove stopwords:");
    Console.WriteLine(text);

    Regex removeDoubleWhiteSpace = 
        new Regex(@"\s{2,}", RegexOptions.Singleline | RegexOptions.IgnoreCase);

    cleaned = 
        removeDoubleWhiteSpace.Replace(cleaned, " ");

    Console.WriteLine("\nAfter remove double white spaces:");
    Console.WriteLine(cleaned);
}

and the output looks like this;

Before:

This is a text with lots and lots of stopwords. It also

talks a bit about load balancing. Load balancing is not a stopword.

After remove stopwords:

This is a text with lots and lots of stopwords. It also

talks a bit about load balancing. Load balancing is not a stopword.

After remove double white spaces:

text lots lots stopwords. talks bit load balancing. Load balancing stopword.

About mfagerlund
Writes code in my sleep - and sometimes it even compiles!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: