The Tanimoto similarity ST is computed according to the definition for bit vectors (see Jaccard index at Wikipedia). For weights \(W_i \in \{0, 1\}\) it is the same as the Jaccard similarity. The Tanimoto similarity can be computed for any term vectors, but for 1 - ST to be a proper distance metric satisfying the triangle inequality, \(M_{i,j} \in \{0, W_i\}\) must hold.

The Jaccard similarity is computed using the Tanimoto similarity definition for bit vectors (see Jaccard index at Wikipedia). For the results to be a valid Jaccard similarity, weights must be zero and one. If any weights are different, a warning is issued.

The Cosine similarity *SC* is computed using the Euclidean dot product formula.
See Cosine similarity on Wikipedia.
The metric is valid for any term vectors (columns of the subsumer matrix), i.e.,
\(M_{i,j} \in \{0, W_i\}\) is not required. Note that
1 - *SC* is not a proper distance metric, because it violates the triangle
inequality. First convert to angle to obtain a distance metric.

The Resnik similarity between two terms is the information content (IC) of their most informative common ancestor (MICA), which is the common subsumer with the greatest information content.

tanimoto_similarity(subsumer_mat = NA, terms = NULL, ...) jaccard_similarity(subsumer_mat = NA, terms = NULL, ...) cosine_similarity(subsumer_mat = NA, terms = NULL, ...) resnik_similarity(subsumer_mat = NA, terms = NULL, ..., wt = term_freqs, wt_args = list(), base = 10)

subsumer_mat | matrix or data.frame, the vector-encoded matrix M of
subsumers for which \(M_{i,j} = W_i, W_i > 0\) (W = weights),
if class |
---|---|

terms | character, optionally the list of terms (as IRIs and/or labels) for which to generate a properly encoded subsumer matrix on the fly. |

... | parameters to be passed on to |

wt | numeric or a function. If numeric, weights for the subsumer terms.
For |

wt_args | list, named parameters for the function calculating term
frequencies. Ignored if |

base | integer, the base of the logarithm for calculating information content from term frequency. The default is 10. |

A matrix with M[i,j] = similarity of terms *i* and *j*.

Philip Resnik (1995). "Using information content to evaluate semantic
similarity in a taxonomy". Proceedings of the 14th International Joint
Conference on Artificial Intelligence (IJCAI'95). **1**: 448–453

if (FALSE) { sm <- jaccard_similarity(terms = c("pelvic fin", "pectoral fin", "forelimb", "hindlimb", "dorsal fin", "caudal fin"), .colnames = "label") sm # e.g., turn into distance matrix, cluster, and plot plot(hclust(as.dist(1-sm))) } if (FALSE) { phens <- get_phenotypes("basihyal bone", taxon = "Cyprinidae") sm.ic <- resnik_similarity(terms = phens$id, .colnames = "label", .labels = phens$label, wt_args = list(as = "phenotype", corpus = "taxa")) maxIC <- -log10(1 / corpus_size("taxa")) # normalize by max IC, turn into distance matrix, cluster, and plot plot(hclust(as.dist(1-sm.ic/maxIC))) }