Determines the frequencies for the given input list of terms, based on the selected corpus.

term_freqs(x, as = c("auto", "entity", "quality", "phenotype"),
  corpus = c("taxon_annotations", "taxa", "gene_annotations", "genes"),
  decodeIRI = TRUE, ...)

Arguments

x

a vector or list of one or more terms, either as IRIs or as term objects.

as

the category or categories of the input terms (see term_category()). Supported categories are "entity", "quality", and "phenotype". The value must either be a single category (applying to all terms), or a vector of categories (of same length as x). If provided as "auto" (or NULL), the category of each term is automatically determined. The default is "auto".

corpus

the name of the corpus for which to determine frequencies. Supported values are "taxon_annotations", "taxa", "gene_annotations", and "genes". (At present, support for "gene_annotations" is pending support in the Phenoscape API.) The default is "taxon_annotations".

decodeIRI

boolean. If TRUE (the default), attempt to decode post-composed entity IRIs, and under certain circumstances rewrite the count query according to the results. At present, this is used only for entity IRIs detected as "part_of some X" post-compositions, and only for the "taxon_annotations" corpus. In those cases, the count query will be rewritten to first query for X, then for X including parts, and the resulting count is the result of the latter minus that of the former.

The decoding algorithm may be imprecise, so one may want to turn this off, the result of which will usually be a frequency of zero for those IRIs, due to limitations in the Phenoscape KB API.

...

additional query parameters to be passed to the function querying for counts, see pkb_args_to_query(). Currently this is only used for corpus "taxon_annotations", and the only useful parameter is includeRels, which can be used to include historical and serial homologues in the counts. It can also be used to always include parts for entity terms.

Value

a vector of frequencies as floating point numbers (between zero and 1.0), of the same length (and ordering) as the input list of terms.

Details

Depending on the corpus selected, the frequencies are queried directly from pre-computed counts through the KB API, or are calculated based on matching row counts obtained from query results. Currently, the Phenoscape KB has precomputed counts for corpora "taxa" and "genes". Calculated counts for the "taxon_annotations" corpus are most reliable for phenotype terms and their subsumers. For entity terms, subsumers can include many generated post-composed terms (such as "part_of some X", where X is, for example, an anatomy term), and at least currently these aren't handled correctly by the Phenoscape KB, resulting in counts of zero for such terms. For some of these the implementation here will try to rewrite the query (see parameter decodeIRI), but this only works to a limited extent.

Note

Term categories being accurate is vital for obtaining correct counts and thus frequencies. Auto-determining term categories yields reasonably accurate results, but with caveats. One, it can be time-consuming, and two, especially for entity terms and their subsumers it is often not 100 function will try to correct for that by assuming that if not all terms are determined to be of the same category, but one category holds for more than 90 If the list of terms is legitimately of different categories, it is best to determine (and possibly correct) categories beforehand, and then pass the result as as. If all terms are of the same category and the category is known beforehand, it saves time and prevents potential errors to supply this category using as.

Examples

terms <- c("pectoral fin", "pelvic fin", "dorsal fin", "paired fin") IRIs <- sapply(terms, pk_get_iri, as = "anatomy") term_freqs(IRIs)
#> [1] 0.001791597 0.002191650 0.004333911 0.004073383
phens <- get_phenotypes(entity = "basihyal bone") term_freqs(phens$id, as = "phenotype", corpus = "taxon_annotations")
#> [1] 3.951144e-05 4.938930e-06 1.358206e-05 1.123607e-04 6.173663e-06 #> [6] 2.592938e-05 6.297136e-05 2.210171e-04 2.222519e-04 1.358206e-05 #> [11] 1.358206e-04 4.938930e-06 6.494693e-04 8.890074e-05 2.617633e-04 #> [16] 5.185877e-05 1.852099e-05 1.185343e-04 2.222519e-05 1.024828e-04
term_freqs(phens$id, as = "phenotype", corpus = "taxa")
#> [1] 0.002547771 0.000000000 0.002547771 0.002547771 0.001273885 0.001273885 #> [7] 0.002547771 0.001273885 0.001273885 0.002547771 0.005095541 0.002547771 #> [13] 0.005095541 0.001273885 0.002547771 0.002547771 0.000000000 0.003821656 #> [19] 0.001273885 0.001273885