Determines the frequencies for the given input list of terms, based on the selected corpus.
term_freqs(x, as = c("auto", "entity", "quality", "phenotype"), corpus = c("taxon_annotations", "taxa", "gene_annotations", "genes"), decodeIRI = TRUE, ...)
x | a vector or list of one or more terms, either as IRIs or as term objects. |
---|---|
as | the category or categories of the input terms (see |
corpus | the name of the corpus for which to determine frequencies. Supported values are "taxon_annotations", "taxa", "gene_annotations", and "genes". (At present, support for "gene_annotations" is pending support in the Phenoscape API.) The default is "taxon_annotations". |
decodeIRI | boolean. If TRUE (the default), attempt to decode post-composed entity IRIs, and under certain circumstances rewrite the count query according to the results. At present, this is used only for entity IRIs detected as "part_of some X" post-compositions, and only for the "taxon_annotations" corpus. In those cases, the count query will be rewritten to first query for X, then for X including parts, and the resulting count is the result of the latter minus that of the former. The decoding algorithm may be imprecise, so one may want to turn this off, the result of which will usually be a frequency of zero for those IRIs, due to limitations in the Phenoscape KB API. |
... | additional query parameters to be passed to the function querying
for counts, see |
a vector of frequencies as floating point numbers (between zero and 1.0), of the same length (and ordering) as the input list of terms.
Depending on the corpus selected, the frequencies are queried directly
from pre-computed counts through the KB API, or are calculated based on
matching row counts obtained from query results. Currently, the Phenoscape KB
has precomputed counts for corpora "taxa" and "genes". Calculated counts for
the "taxon_annotations" corpus are most reliable for phenotype terms and their
subsumers. For entity terms, subsumers can include many generated
post-composed terms (such as "part_of some X", where X is, for example, an
anatomy term), and at least currently these aren't handled correctly by the
Phenoscape KB, resulting in counts of zero for such terms. For some of these
the implementation here will try to rewrite the query (see parameter
decodeIRI
), but this only works to a limited extent.
Term categories being accurate is vital for obtaining correct counts and
thus frequencies. Auto-determining term categories yields reasonably accurate
results, but with caveats. One, it can be time-consuming, and two, especially
for entity terms and their subsumers it is often not 100
function will try to correct for that by assuming that if not all terms
are determined to be of the same category, but one category holds for more
than 90
If the list of terms is legitimately of different categories, it is best to
determine (and possibly correct) categories beforehand, and then pass the
result as as
. If all terms are of the same category and the category is
known beforehand, it saves time and prevents potential errors to supply this
category using as
.
terms <- c("pectoral fin", "pelvic fin", "dorsal fin", "paired fin") IRIs <- sapply(terms, pk_get_iri, as = "anatomy") term_freqs(IRIs)#> [1] 0.001791597 0.002191650 0.004333911 0.004073383phens <- get_phenotypes(entity = "basihyal bone") term_freqs(phens$id, as = "phenotype", corpus = "taxon_annotations")#> [1] 3.951144e-05 4.938930e-06 1.358206e-05 1.123607e-04 6.173663e-06 #> [6] 2.592938e-05 6.297136e-05 2.210171e-04 2.222519e-04 1.358206e-05 #> [11] 1.358206e-04 4.938930e-06 6.494693e-04 8.890074e-05 2.617633e-04 #> [16] 5.185877e-05 1.852099e-05 1.185343e-04 2.222519e-05 1.024828e-04term_freqs(phens$id, as = "phenotype", corpus = "taxa")#> [1] 0.002547771 0.000000000 0.002547771 0.002547771 0.001273885 0.001273885 #> [7] 0.002547771 0.001273885 0.001273885 0.002547771 0.005095541 0.002547771 #> [13] 0.005095541 0.001273885 0.002547771 0.002547771 0.000000000 0.003821656 #> [19] 0.001273885 0.001273885