Obtains term frequencies for the Phenoscape KB — term

Determines the frequencies for the given input list of terms, based on the selected corpus and the type (category) of the terms.

term_freqs(
  x,
  as = c("phenotype", "entity", "anatomical_entity", "quality"),
  corpus = c("taxon-variation", "annotated-taxa", "taxon-annotations", "states",
    "gene-annotations", "genes"),
  decodeIRI = FALSE,
  ...
)

Arguments

x

a vector or list of one or more terms, either as IRIs or as term objects.

as

the category or categories (a.k.a. type) of the input terms (see term_category()). Possible values are "anatomical_entity" (synonymous with "entity"), "quality", and "phenotype". Unambiguous abbreviations are acceptable. The value must either be a single category (applying to all terms), or a vector of categories (of same length as x). The default is "phenotype".

Note that at present, support by the KB API for "quality" remains pending and has thus been disabled as of v0.3.0. Also, mixing different categories of terms is not yet supported, and doing so will thus raise an error.

corpus

the name of the corpus for determining how to count, currently one of the following:

"states" (counts character states),
"taxon-variation" (counts taxa with variation profiles, and thus does not include terminal and other taxa that do not have child taxa with phenotype annotations),
"annotated-taxa" (counts taxa with phenotype annotations, and thus primarily those terminal taxa that have annotations),
"taxon-annotations" (counts phenotype annotations to character states and thus taxa),
"gene-annotations" (counts phenotype annotations to genes or alleles), and
"genes" (counts genes)

Unambiguous abbreviations of corpus names are acceptable. The default is "taxon-variation". Note that at present "taxon-annotations" and "gene-annotations" are not yet supported by the KB API and will thus result in an error.

Note that previously "taxa" was allowed as a corpus, but is no longer supported. The "taxon-variation" corpus is the equivalent of the deprecated "taxa" corpus.

decodeIRI

boolean. This parameter is deprecated (as of v0.3.x) and must be set to FALSE (the default). If TRUE is passed an error will be raised. In v0.2.x when TRUE this parameter would attempt to decode post-composed entity IRIs. Due to changes in the IRI returned by the Phenoscape KB v2.x API decoding post-composed entity IRIs is no longer possible. Prior to v0.3.x, the default value for this parameter was TRUE.

...

additional query parameters to be passed to the function querying for counts, see pkb_args_to_query(). This is currently (as of v0.3.0) not used.

Value

a vector of frequencies as floating point numbers (between zero and 1.0), of the same length (and ordering) as the input list of terms.

Details

Depending on the corpus selected, the frequencies are queried directly from pre-computed counts through the KB API, or are calculated based on matching row counts obtained from query results. Currently, the Phenoscape KB has precomputed counts for corpora "annotated-taxa","taxon-variation", "states", and "genes".

Note

Term categories being accurate is vital for obtaining correct counts and thus frequencies. In earlier (<=0.2.x) releases, auto-determining term category was an option, but this is no longer supported, in part because it was potentially time consuming and often inaccurate, in particular for the many post-composed subsumer terms returned by subsumer_matrix(). In the KB v2.0 API, auto-determining the category of a post-composed term is no longer supported. If the list of terms is legitimately of different categories, determine (and possibly correct) categories beforehand using term_category().

In earlier (<=0.2.x) releases one supported corpus was "taxon_annotations", albeit its implementation was very slow and potentially inaccurate because it relied on potentially multiple individudal KB API queries for each term, and this in turn relied on the ability to break down post-composed expressions into their component terms and expressions, which is (at least currently) no longer possible.

Examples

phens <- get_phenotypes(entity = "basihyal bone")
# see which phenotypes we have:
phens$label
#>  [1] "anatomical projection and (part_of some (posterior margin and (part_of some basihyal bone))) absent" 
#>  [2] "anatomical projection and (part_of some (posterior margin and (part_of some basihyal bone))) present"
#>  [3] "anterior margin and (part_of some basihyal bone) straight"                                           
#>  [4] "anterior region and (part_of some basihyal bone) increased size"                                     
#>  [5] "anterior region and (part_of some basihyal bone) increased width"                                    
#>  [6] "basibranchial 1 bone position basihyal bone"                                                         
#>  [7] "basibranchial 2 bone interlocked with basihyal bone"                                                 
#>  [8] "basihyal bone absent"                                                                                
#>  [9] "basihyal bone bifurcated"                                                                            
#> [10] "basihyal bone cylindrical"                                                                           
#> [11] "basihyal bone decreased length"                                                                      
#> [12] "basihyal bone elongated"                                                                             
#> [13] "basihyal bone horizontal"                                                                            
#> [14] "basihyal bone increased length"                                                                      
#> [15] "basihyal bone increased size"                                                                        
#> [16] "basihyal bone oblique orientation"                                                                   
#> [17] "basihyal bone present"                                                                               
#> [18] "basihyal bone right angle to basibranchial 1 element"                                                
#> [19] "basihyal bone shape"                                                                                 
#> [20] "basihyal bone shape and (not (Y-shaped))"                                                            
#> [21] "basihyal bone size"                                                                                  
#> [22] "basihyal bone T-shaped"                                                                              
#> [23] "basihyal bone tightly articulated with basibranchial 1 bone"                                         
#> [24] "basihyal bone triangular"                                                                            
#> [25] "basihyal bone Y-shaped"                                                                              
#> [26] "bone fossa and (part_of some basihyal bone) absent"                                                  
#> [27] "bone fossa and (part_of some basihyal bone) size"                                                    
#> [28] "bone fossa and (part_of some basihyal bone) size"                                                    
#> [29] "dorsal surface and (part_of some basihyal bone) circular"                                            
#> [30] "dorsal surface and (part_of some basihyal bone) concave"                                             
#> [31] "dorsal surface and (part_of some basihyal bone) convex"                                              
#> [32] "dorsal surface and (part_of some basihyal bone) convex"                                              
#> [33] "dorsal surface and (part_of some basihyal bone) flat"                                                
#> [34] "dorsal surface and (part_of some basihyal bone) flat"                                                
#> [35] "parasphenoid structure basihyal bone"                                                                
# frequencies by counting taxa:
freqs.t <- term_freqs(phens$id, as = "phenotype", corpus = "taxon-variation")
freqs.t
#>  [1] 0.005012531 0.005012531 0.002506266 0.000000000 0.000000000 0.002506266
#>  [7] 0.000000000 0.002506266 0.002506266 0.002506266 0.001253133 0.001253133
#> [13] 0.002506266 0.001253133 0.001253133 0.002506266 0.005012531 0.002506266
#> [19] 0.006265664 0.001253133 0.002506266 0.002506266 0.000000000 0.003759398
#> [25] 0.001253133 0.001253133 0.001253133 0.001253133 0.001253133 0.001253133
#> [31] 0.001253133 0.001253133 0.001253133 0.001253133 0.001253133
# we can convert this to absolute counts:
freqs.t * corpus_size("taxon-variation")
#>  [1] 4 4 2 0 0 2 0 2 2 2 1 1 2 1 1 2 4 2 5 1 2 2 0 3 1 1 1 1 1 1 1 1 1 1 1
# frequencies by counting character states:
freqs.s <- term_freqs(phens$id, as = "phenotype", corpus = "states")
freqs.s
#>  [1] 0.0000351358 0.0000351358 0.0000351358 0.0000351358 0.0000351358
#>  [6] 0.0000702716 0.0000351358 0.0001405432 0.0000351358 0.0000351358
#> [11] 0.0000351358 0.0000351358 0.0000351358 0.0000351358 0.0000702716
#> [16] 0.0000351358 0.0004567654 0.0000351358 0.0003162222 0.0000351358
#> [21] 0.0001756790 0.0000351358 0.0000351358 0.0001054074 0.0000702716
#> [26] 0.0000351358 0.0000351358 0.0000351358 0.0000351358 0.0000351358
#> [31] 0.0000351358 0.0000351358 0.0000351358 0.0000351358 0.0000702716
# and as absolute counts:
freqs.s * corpus_size("states")
#>  [1]  1  1  1  1  1  2  1  4  1  1  1  1  1  1  2  1 13  1  9  1  5  1  1  3  2
#> [26]  1  1  1  1  1  1  1  1  1  2
# we can compare the absolute counts by computing a ratio
freqs.s * corpus_size("states") / (freqs.t * corpus_size("taxon-variation"))
#>  [1] 0.25 0.25 0.50  Inf  Inf 1.00  Inf 2.00 0.50 0.50 1.00 1.00 0.50 1.00 2.00
#> [16] 0.50 3.25 0.50 1.80 1.00 2.50 0.50  Inf 1.00 2.00 1.00 1.00 1.00 1.00 1.00
#> [31] 1.00 1.00 1.00 1.00 2.00