The Similar Stories Plugin v1.0
Created by Ian Stott (I_P_S@hotmail.com)
Based upon the
Count Pages plugin of Grant Drake
Requires calibre >= 0.8.18
Contents
Aim
You have just finished a great book and you wish to read something else like it. In the past you had to hope that books by the same author or with the same tags would yield suitable candidates from your Calibre library. But your only sure way was to start reading some books and see if they were close enough.
This plug-in helps you to find other books within your Calibre library that are similar to your target book. It does this by examining the full text of the books in your library, making use of current Text Informatics techniques.
Usage
Use the plug-in to find books in your library similar to a target book in the following way:
- The first time you use the plug-in you need to identify a "Custom Column" that will hold the results.
If you have not already created a suitable Custom Column, Do this in the usual way by using the "Add your own columns" dialogue box (found by right-clicking on a column heading in the main view).
Select a column type of "floating point numbers" and leave the "Format for numbers" section empty.
The first time you run the plug-in, or when you configure it, you will be able to select this column to hold your results.
- Select the target book. All other selected books will be compared to this one.
- Add to your selection all of the other books that you wish to compare to your target book. If you are having trouble doing this while ensuring that the target book is the first selected, you can edit the Similarity score of your target book (in your selected custom column) so that it is set to 1, then sort by the Similarity score.
- Run the plug-in
- Once it has run, it will ask if you want the results loaded up into your selected custom column. You will have to respond with "Yes" to see the results.
- The results will appear in your selected Custom Column. The higher the score, the better the match that book is to the target. A similarity score of 1 means that the book is a perfect match (and probably the same book). This means that the target book will have a score of 1.
- I find it easiest if you sort the booked by your Similarity score, with the highest (most similar) at the top.
It would probably make life easier for users if you could select a single book to be the target and compare it against the rest of the visible books.
However, I have not been able to identify the function that will return a list of book_ids for all of the books that are currently visible (as opposed to selected).
If some kind person could provide a clue on how to do this, I'll happily update the plug-in behaviour accordingly.
Approach
Warning, this section contains explicit mathematical equations.
To assess the similarity of 2 documents, you require the answer to 3 decisions:
- What representation of words are to be used? e.g. is the word "Vampire" to be considered the same as "vampires", as they differ in 2 ways; one contains an upper case character the other is the plural of the word?
- How are the words of a document to be represented in a way that they can be compared? This can range from as simple as a "bit-string" representation. This is where a 1 at the ith position of the bit-string means the ith word in your vocabulary is present in your document. The upper range of complexity consists of techniques like natural language processing.
- How are you to compare the representations of your two documents, the similarity metric. The options available will depend upon your choice of representation. For example bit-strings are simple to compare and your options are limited to a few dozen metrics. Representations containing more detail, such as word-counts, provide an almost endless opportunities for comparison.
In this first release, I have made the following choices:
- Word representation.
I have selected the simplest approach of removing all formatting and punctuation, converting all words to lower case and removing short words (of 1 or two characters). Everything else is used, this includes the title, authors, the full text and everything else present within your copy of the book.
You are given the choice of which version of a book you wish to use. Though this means that you are currently limited to books in MOBI or EPUB formats. If your favourite book isn't in this format, let me know and I will add the relevant format - or you can use calibre to convert it to a supported format.
- Document representation.
- TF-IDF
The TF-IDF method. it may not be perfect, but it has a number of nice features and works well.
The TF-IDF method is described here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf
In summary:
- TF(book,word): Term Frequency. Each word in a book is counted to determine the number of times it appears in the document.
TF = {number of times a word appears in a book} / {number of words in the book}
- IDF: Inverse Document Frequency. A measure of the importance of the word. For example, the word "for" will probably appear in every book that you are comparing, so it is not useful in determining how similar two books are. Conversely, if the word "vampire" appeared in your target book and only 5% of the rest of the books in your library, then it is probably a good measure for assessing similarity.
IDF(word) = log10({Number of Books being searched} / {number of books that contain "word"})
- TF-IDF(book,word) = TF(book,word) * IDF(word)
The plug-in calculates the TF-IDF score for all of the words used in each of the books being compared, as well as for your target book.
- Binary Fingerprint
A second, simpler (and so probably less suitable) method is to represent each book simply by the presence or absence of a word in the book.
This is like a fingerprint of 1's and 0's denoting the presence or absence of a word within a book.
This is an ideal method for comparing 2 books to see if they are the same. However, as it only looks at the words that are present, as opposed to how often they are used, it is necessarily a crude measure.
- Similarity metric.
As each book can be considered as a vector of floating point numbers, where the length of the vector is the number of distinct words used in all of the books being compared, there are many similarity metrics available. I have implemented 2 in this version:
- Euclid: This calculates the "Euclidean distance" between two books; a & b. i.e.
Distance2(a,b) = Sum{ (TFIDF(a,word) - TFIDF(b,word))2}
when summed over all words. As this gives a value of 0 when comparing 2 identical books, I have then adjusted the score so that the similarity is given by:
1 - 1000*distance.
The scaling of 1000 is so that there is a decent range of similarity scores. To present a set of results which varied from 1.0000 to 0.9997 would not inspire confidence.
- Tanimoto:
This is the default method of choice when performing "chemical structure similarities" as opposed to document similarities and works very well in that field.
T(a,b) = A•B / (A•A + B•B - A•B)
Where A is the TFIDF vector for book a, B is the TFIDF vector for book b. A•B etc is the vector dot product of vectors A & B. ie:
A•B = Sum{ TFIDF(a,word) * TFIDF(b,word) } over all words.
- Cosine:
Returns the cosine of the angle between vectors v and u. This is equal to:
Cos(a,b) = A•B / (|A| + |B|)
Where |A| = sqrt(A•A)
- Tanimoto (binary):
This doesn't use the TF-IDF metric to describe a book but the far simpler method of
just noting the presence of absence of a word in the book.
T(a,b) = Nab / (Na + Nb - Nab)
Na = number of bits (distinct words) in book A.
Nab = number of bits (words) in both A & B, i.e. the intersection of A & B (or A ∩ B).
- PMRA (PubMed related articles):
This technique is the one used in the the related article search feature found in PubMed.
The method is described in full in the paper: PubMed related articles: a probabilistic topic-based model for content similarity.
The method is used to identify the relevant topics of a document
(i.e. the important words that are then used to assess similarity with other documents) in the following fashion:
Whether or not a document is about a particular topic is computed from term frequencies, modeled as Poisson distributions.
Unlike previous probabilistic retrieval models, we do not attempt to estimate relevance–but rather our focus is "relatedness",
the probability that a user would want to examine a particular document given known interest in another.
I have adjusted the published method in only one way: The authors accounted for the differnces in the
length of documents by introducing the term l in equation 8 (using the paper's numbering system):
P(E|k) = (1 + η(μ / λ)k * exp(- (μ - λ)*l )-1
This works fine for documents that are the abstracts of papers, where the length of the documents are in hundreds or a few thousand words.
However, when equation (8) is applied to documents the length of full novels, i.e. hundreds of thousands of words, then P(E|k) becomes zero for almost every word.
I therefore adjusted equation 8 so that l is no longer the length (in words) of a document but the sqrt({length in words}).
Everything else in the implementation is as described in the paper.
Caveats
There are a number of caveats to this plug-in, some obvious, some not.
The main caveat is that this plug-in only attempts to calculate a similarity score between your target book and the other books that you have selected.
This does not guarentee that the most similar book is at all similar to your target, or even if it is, that you will also enjoy it. For example, if you select "Conan the Bararian" as your target
and the individual works of Shakespeare as the other selected books, you should not be surprised if the resulting, most similar book, turns out to
be very different to your target.
The current methods only utilise the individual words within a book. They do not incorporate any of the "sense" or meaning of the text. This will require far more complex methods.
Perhaps future releases will utilise them to assess how well they perform against these simpler approaches.
Despite all the limitations of the approach (and my implementation), this method does seem to work reasonably well.
For example, to test the approach, I selected a book from Jim Butcher's "Harry Dresden" series and compared it against a large library
of other books, as well as the rest of the series. When sorted by similarity score the other Harry Dresden books
are generally at the top of the similarity list. The other ones at the top of the list I am currently reading, to see if they are similar.
Next Steps
As indicated above, there are a multitude of options at each stage of calculating the similarity between two books. Beyond correcting errors that are discovered, the main effort will be to implement additional options for each of these stages. eg removal of plurals, introduction of a bitmap representation and a decent set of similarity metrics like Manhattan, Mahalanobis, Pearson correlation coefficient, Cosine, Dice, etc.
Feedback is welcome, especially if you have a preference on next steps to implement.
Please let me know how well the current methods and implementation perform.
Ian Stott , January 13rd 2012