Using "Natural": A NLP Module for node.js
Last year, however, brought a new platform to my hobby work: node.js. My, node and its community were young but maturing rapidly.
When the need for natural language facilities arose and I found the pickings pretty slim. I have to be honest. That's *exactly* what I was hoping for; an opportunity to sink my teeth into the algorithms themselves and contribute them back to a young, but growing, community.
Thus I began work on "natural", a module of base natural languages processing algorithms for node.js. The idea was loosely based on the Python NLTK in that all algorithms are in the same package. Initially I didn't think "natural" could be as complete as the NLTK, but as my own understanding as well as community contributions picked up I've become much more hopeful. Also, merging with Rob Ellis's node-nltools back in August of 2011 strengthened "natural" further by rapidly bringing new algorithms and features into the fold.
As of version 0.1.5 Rob, other contributors, and I have managed to get the following feature list together:
- Stemming
- Porter
- Lancaster
- Phonetic
- SoundEx
- Metaphone
- Double Metaphone
- Classification
- Naive Bayes
- Logistic Regression
- String Distance
- Levenshtein (thanks Sid Nallu)
- Jaro-Winkler (thanks Adam Phillabaum)
- Dice's Coefficient (thanks John Crepezzi)
- Tokenization
- Treebank
- Word
- Word-Punctuation
- Inflection
- Numeric
- Nouns Singular/Pluralization
- Present-tense verb Singular/Pluralization
- tf*idf
- n-grams
- WordNet
I'll not cover every single module and feature in this article, but will instead outline what's the most commonly used and most mature.
Installing
Like most node modules "natural" is packaged as an NPM and can be installed from the command line as such:
npm install natural
If you want to install from source (or contribute for that matter) it can be found here on GitHub.
Stemming
The first class of algorithms I'd like to outline is stemming. Stemming is the processes of reducing a word to a root (not necessarily the morphological root). In other words the idea is to boil all conjugations, tenses and forms down to a single root word. That root may not end up looking exactly like the English root, but should be close enough for comparison.
Stemming is a typical step in preparing text for use by other algorithms or storage such as classification or even full-text indexing. Both the Lancaster and Porter algorithms are supported as of 0.1.5. Here's a basic example of stemming a word with a Porter Stemmer.
var natural = require('natural'),
stemmer = natural.PorterStemmer;
var stem = stemmer.stem('stems');
console.log(stem);
stem = stemmer.stem('stemming');
console.log(stem);
stem = stemmer.stem('stemmed');
console.log(stem);
stem = stemmer.stem('stem');
console.log(stem);Above I simply required-up the main "natural" module and grabbed the PorterStemmer sub-module from within. Calling the "stem" function takes an arbitrary string and returns the stem. The above code returns the following output:
stem stem stem stem
For convenience stemmers can patch String with methods to simplify the process by calling the attach method. String objects will then have a stem method.
stemmer.attach(); stem = 'stemming'.stem(); console.log(stem);
It's very possible you'd be interested in stemming a string composed of many words, perhaps an entire document. The attach method provides a tokenizeAndStem method to accomplish this. It breaks the owning string up into an array of strings, one for each word, and stems them all. For example:
var stems = 'stems returned'.tokenizeAndStem(); console.log(stems);
produces the output:
[ 'stem', 'return' ]
Note that the tokenizeAndStem method will omit certain words by default that are considered irrelevant (stop words) from the return array. To instruct the stemmer to not omit stop words pass atrue in to tokenizeAndStem for the keepStops parameter. Consider:
console.log('i stemmed words.'.tokenizeAndStem());
console.log('i stemmed words.'.tokenizeAndStem(true));outputting:[ 'stem', 'word' ] [ 'i', 'stem', 'word' ]
All of the code above would also work with a Lancaster stemmer by requiring the LancasterStemmer module instead, like:
var natural = require('natural'),
stemmer = natural.LancasterStemmer;Of course the actual stems produced could be different depending on the algorithm chosen. The Lancaster stemmer tends to be a bit more agressive resulting in roots that look less like their English equivalents, but will likely perform better.
Phonetics
Phonetic algorithms are also provided to determine what words sound like and compare them accordingly. The old (and I mean pre-electronic computers old... like 1918 old) SoundEx and the more modern Metaphone/Double Metaphone algorithms are supported as of 0.1.5.
The following example compares the string "phonetics" and the intentional misspelling "fonetix" and determines they sound alike according to the Metaphone module but the same pattern could be applied to the DoubleMetaphone or SoundEx modules.
var natural = require('natural'),
phonetic = natural.Metaphone;
var wordA = 'phonetics';
var wordB = 'fonetix';
if(phonetic.compare(wordA, wordB))
console.log('they sound alike!');The raw code the phonetic algorithm produces can be retrieved with the process method:
var phoneticCode = phonetic.process('phonetics');
console.log(phoneticCode);resulting in:
FNTKS
Like the stemming implementations the phonetic modules have an attach method that patches String with shortcut methods, most notably soundsLike for comparison:
phonetic.attach();
if(wordA.soundsLike(wordB))
console.log('they sound alike!');attach also patches in a phonetics and tokenizeAndPhoneticize methods to retrieve the phonetic code for a single word and an entire corpus respectively.
console.log('phonetics'.phonetics());
console.log('phonetics rock'.tokenizeAndPhoneticize());which outputs:
FNTKS [ 'FNTKS', 'RK' ]
The above could could also use SoundEx by substituting the following in for the require.
var natural = require('natural'),
phonetic = natural.SoundEx;Note that SoundEx and Metaphone may have trouble with non-English words, but Double Metaphone should have some degree of success with many other languages.
tf*idf
tf*idf weights can be used to judge how important a given word is to a given document in a broader corpus (collection of documents). There are two components to a tf*idf weight: the term frequency and the inverse document frequency. To guarantee that a frequently-used, albeit semantically less important, word doesn't gain too much favor you'll want to ensure you have many documents in your TfIdf clone.
Consider the following code which adds a few documents to a corpus and then determines how important the words "ruby" and "node" are to them.
var natural = require('natural'),
TfIdf = natural.TfIdf,
tfidf = new TfIdf();
tfidf.addDocument('i code in c.');
tfidf.addDocument('i code in ruby.');
tfidf.addDocument('i code in ruby and node, but node more often.');
tfidf.addDocument('this document is about natural, written in node');
tfidf.addDocument('i code in fortran.');
console.log('node --------------------------------');
tfidf.tfidfs('node', function(i, measure) {
console.log('document #' + i + ' is ' + measure);
});
console.log('ruby --------------------------------');
tfidf.tfidfs('ruby', function(i, measure) {
console.log('document #' + i + ' is ' + measure);
});
The previous code will output the tf*idf weights for "node" and "ruby". The higher the weight the more important the word is to the document.
node -------------------------------- document #0 is 0 document #1 is 0 document #2 is 3.347952867143343 document #3 is 1.6739764335716716 document #4 is 0 ruby -------------------------------- document #0 is 0 document #1 is 1.6739764335716716 document #2 is 1.6739764335716716 document #3 is 0 document #4 is 0
Additionally, you can measure a word against a single document.
console.log(tfidf.tfidf('node', 0 /* document index */));
console.log(tfidf.tfidf('node', 1));You can also get a list of all terms in a document ordered by their importance.
tfidf.listTerms(4 /* document index */).forEach(function(item) {
console.log(item.term + ': ' + item.tfidf);
});yeilding:
fortran: 1.7047480922384253 code: 1.6486586255873816
Inflection
Basic inflectors are in place to convert nouns between plural and singular forms and to turn integers into string counters (i.e. '1st', '2nd', '3rd', '4th 'etc.).
The following example converts the word "radius" into its plural form "radii".
var natural = require('natural'),
nounInflector = new natural.NounInflector();
var plural = nounInflector.pluralize('radius');
console.log(plural);Singularization follows the same pattern as is illustrated in the following example wich converts the word "beers" to its singular form, "beer".
var singular = nounInflector.singularize('beers');
console.log(singular);Just like the stemming and phonetic modules an attach method is provided to patch String with shortcut methods.
nounInflector.attach();
console.log('radius'.pluralizeNoun());
console.log('beers'.singularizeNoun()); A NounInflector instance can do custom conversion if you provide expressions via the addPlural and addSingular methods. Because these conversion aren't always symmetric (sometimes more patterns may be required to singularize forms than pluralize) there needn't be a one-to-one relationship between addPlural and addSingular calls.
nounInflector.addPlural(/(code|ware)/i, '$1z');
nounInflector.addSingular(/(code|ware)z/i, '$1');
console.log('code'.pluralizeNoun());
console.log('ware'.pluralizeNoun());
console.log('codez'.singularizeNoun());
console.log('warez'.singularizeNoun());which would result in:
codez warez code ware
Here's an example of using the CountInflector module to produce string counter for integers.
var natural = require('natural'),
countInflector = natural.CountInflector;
console.log(countInflector.nth(1));
console.log(countInflector.nth(2));
console.log(countInflector.nth(3));
console.log(countInflector.nth(4));
console.log(countInflector.nth(10));
console.log(countInflector.nth(11));
console.log(countInflector.nth(12));
console.log(countInflector.nth(13));
console.log(countInflector.nth(100));
console.log(countInflector.nth(101));
console.log(countInflector.nth(102));
console.log(countInflector.nth(103));
console.log(countInflector.nth(110));
console.log(countInflector.nth(111));
console.log(countInflector.nth(112));
console.log(countInflector.nth(113));producing:
1st 2nd 3rd 4th 10th 11th 12th 13th 100th 101st 102nd 103rd 110th 111th 112th 113th
Classification
Classification is currently supported by the Naive Bayes and Logistic regression algorithms, although natural's Naive Bayes implementation is the most mature of the two. You can use them for tasks like spam detection and sentiment analysis.
There are two fundamental steps involved in using a classifier: training and classification.
The following example takes care of the first step by requiring-up the classifier and training it with data. Naturally, this is only a sample. To do any production tasks you'd want many more training documents (hundreds per class depending on their size).
var natural = require('natural'),
classifier = new natural.BayesClassifier();
classifier.addDocument("my unit-tests failed.", 'software');
classifier.addDocument("tried the program, but it was buggy.", 'software');
classifier.addDocument("the drive has a 2TB capacity.", 'hardware');
classifier.addDocument("i need a new power supply.", 'hardware');
classifier.train();
By default the classifier will tokenize the corpus and stem it with a PorterStemmer. You can use a LancasterStemmer by passing it in to the BayesClassifier constructor as such:
var natural = require('natural'),
stemmer = natural.LancasterStemmer,
classifier = new natural.BayesClassifier(stemmer);With the classifier trained it can now classify documents via the classify method:
console.log(classifier.classify('did the tests pass?'));
console.log(classifier.classify('did you buy a new drive?'));resulting in the output:
software hardware
Similarly the classifier can be trained on arrays rather than strings, bypassing tokenization and stemming. This allows the consumer to perform custom tokenization and stemming if any at all. This is especially useful if the corpus is not English.
classifier.addDocument(['unit', 'test'], 'software'); classifier.addDocument(['bug', 'program'], 'software'); classifier.addDocument(['drive', 'capacity'], 'hardware'); classifier.addDocument(['power', 'supply'], 'hardware'); classifier.train();
It's possible to persist and recall the results of a training via the save method:
var natural = require('natural'),
classifier = new natural.BayesClassifier();
classifier.addDocument(['unit', 'test'], 'software');
classifier.addDocument(['bug', 'program'], 'software');
classifier.addDocument(['drive', 'capacity'], 'hardware');
classifier.addDocument(['power', 'supply'], 'hardware');
classifier.train();
classifier.save('classifier.json', function(err, classifier) {
// the classifier is saved to the classifier.json file!
});
The training could then be recalled later with the load method:
var natural = require('natural'),
classifier = new natural.BayesClassifier();
natural.BayesClassifier.load('classifier.json', null, function(err, classifier) {
console.log(classifier.classify('did the tests pass?'));
});Note that substituting LogisticRegressionClassifier for BayesClassifier should generally work as a drop-in replacement.
n-grams
n-grams are essentially the destructuring of a sentence into overlapping, contiguous lists of n size and are useful for building probabilistic language models. In this case the n-grams are composed of words but outside of "natural" or even natural language processing they could be of other countable objects.
Consider the following examples which illustrate the production of trigrams (n-grams of length 3), bigrams (n-grams of length 2), and arbitrary n-grams using the trigrams, bigrams and ngramsfunctions respectively.
var NGrams = natural.NGrams;
console.log(NGrams.trigrams('some other words here'));
console.log(NGrams.trigrams(['some', 'other', 'words', 'here']));
both of which produce:
[ [ 'some', 'other', 'words' ], [ 'other', 'words', 'here' ] ]
console.log(NGrams.bigrams('some words here'));
console.log(NGrams.bigrams(['some', 'words', 'here']));
both of which produce:
[ [ 'some', 'words' ], [ 'words', 'here' ] ]
console.log(NGrams.ngrams('some other words here for you', 4));
which output:
[ [ 'some', 'other', 'words', 'here' ], [ 'other', 'words', 'here', 'for' ], [ 'words', 'here', 'for', 'you' ] ]
String Distance
"natural" supplies the Dice's coefficient, Levenshtein distance, and Jaro-Winkler distance algorithms for determining string similarity. These algorithms are concerned with orthographic (spelling) similarity, not necessarily phonetics.
Each algorithm produces a number indicating its perception of similarity, but each is determined differently and can even move in opposite directions. For instance, the more dissimilar two strings are the greater the Levenshtein distance, but Jaro-Winkler considers two totally dissimilar strings to have a value of 0 with identical strings having a value of 1.
The following example shows each algorithm's perception of the difference between the words "execution" and "intention".
var natural = require('natural');
console.log(natural.JaroWinklerDistance('execution', 'intention'));
console.log(natural.LevenshteinDistance('execution', 'intention'));
console.log(natural.DiceCoefficient('execution', 'intention'));
resulting in the output:0.48148148148148145 8 0.375
Now to consider totally identical strings.
var natural = require('natural');
console.log(natural.JaroWinklerDistance('same', 'same'));
console.log(natural.LevenshteinDistance('same', 'same'));
console.log(natural.DiceCoefficient('same', 'same'));
which yeilds:
1 0 1
Conclusion and Roadmap
Well, that was a summary of a sizable portion of "natural". Many of the algorithms have additional parameters that can be used to tweak their operation and a few modules weren't represented at all, but the official README can help fill that gap.
There's still plenty in store for "natural". While the current plan is certainly not limited to the following points, these are indeed slated for at least some kind of attention by fall 2012.
- Non-English-specific stemming algorithms
- Pure javascript version
- Maximum entropy classifier
- Clustering algorithms (k-means in development)
- Part of speech tagging
- Punkt sentence segmentation
With the exception of k-means, which is near completion, I'd love community help on nearly every one! To either help out or follow along check out the GitHub repository.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)





