I developed a gem called idftags. The purpose of this gem is to extract significant words of a string based on the tfidf algorithm. The algorithm checks for every word (or so called term) of a document the term frequency and computes based on a collection of documents the inverse document frequency. Based on this to values we calculate a tfidf value, which represents the significance of a term.

This sounds a little abstract hence I want to show some results based on real life scenarios.

Challenges

The idea of this gem was born with the rails rumble. We created a rails app challenge me, where you can create challenges and blog about your progress beating it. Technically spoken a challenge is represented by a model, which looks like the following:

class Challenge < ActiveRecord::Base

validates :title, :presence => true, :length => {:maximum => 140}
validates :description, :presence => true

# some other code ...

end

To make things easier in terms of searching we wanted to add tags to a challenge, so that a user can filter challenges by tags. And to make things even easier we wanted to generate the tags based on the challenge title, taking all the challenge titles into account. With idftags the code would look like the following:

require 'idftags'

document = challenge.title
documents = Challenge.all.map(&:title)

idftags = IDFTags::IDFTags.new 
tags = idftags.tags(document, documents, 3

At the time of writing this we have 16 challenges yielding the following titles

[
  "Participate in RailsRumble", 
  "LOOSE WEIGHT", 
  "TRAVELING AROUND THE WORLD", 
  "Win the soccer world championship", 
  "test", 
  "Sleep 8 hours", 
  "Fun in the kitchen", 
  "I want to save money", 
  "I want to be awesome", 
  "Demo challenge", 
  "Sport", 
  "I want to beat cancer", 
  "100 days (public) streak on Github", 
  "Daily UI", 
  "Learn to solve the Rubik's cube blindfolded", 
  "Learn Clojure and release a Application"
]

And to test idftags I created tags for the following document

'I want to create something amazing and get famous'

By evaluating the tags with different weights I get the most common tags, which were

  1. famouts
  2. get
  3. amazing

Not bad if we consider that we only have 16 documents. The algorithm works better when having more and larger documents to match a term against. This leads to our next case study.

Stackoverflow

I checked out the 6 latest questions on stackoverflow for the tag "Ruby on Rails" and extracted both the title and the original comment of the author as document base. I do not post them here but I can add them as download if you insist ;)

Then I took the latest question, which was

"Can I use api controller for other controller in Ruby on Rails"

and generated tags with idftags. Again by trying out several weights I got the 5 most significant words with

  1. api
  2. use
  3. controller
  4. rails
  5. ruby

Again not bad for only 6 larger documents. Note that I used for both scenarios an appropriate bad word lexicon to filter common words (like 'a', 'is', 'to' and so on).

Summary

idftags works like intended and you may now have a starting point when using this gem. Although sometimes there are some undesired words found as tag (like 'use' in the stackoverflow example). But still most of the time it generates useful output and usually you do not create tags from it and store them away but you use them as suggestions for the user, so that he can accept or decline them.

This brings us to the roadmap. Accepting and declining tags found by the algorithm can help idftag to produce more useful tags in the future by applying several learning algorithms. This is not implemented not even planned yet but I really want to add it in the future. So maybe there will be some updates going on rather soon ;)

For more information checkout the official repo and leave some feedback if you want to support it.