Subset functions allow to conveniently split your data. subsetByNumber()
does not subset by anything
specific but simply reduces the number of cells in the object by random selection across specified grouping variables.
This might be useful if the number of cells is to high for certain machine learning algorithms such as clustering and correlation.
See details for more information.
subsetByNumber(
object,
new_name,
across = c("cell_line", "condition"),
n_by_group = NA,
n_total = NA,
weighted = FALSE,
phase = NULL,
verbose = NULL
)
object | A valid cypro object. |
---|---|
new_name | Character value. Denotes the name of the output object. If set to NULL the name of the input object is taken and suffixed with '_subset'. |
across | Character vector. The grouping variables across which to reduce the cell number. This ensures that the randomly selected cells are equally distributed across certain groups. Defaults to 'cell_line' and 'condition'. |
n_by_group | Numeric value or NA If numeric, denotes the number of cells that is randomly selected from every group. |
n_total | Numeric value or NA If numeric, denotes the final number of cells that the subsetted object is supposed to contain. The number of cells that is randomly selected by group is calculated accordingly. |
weighted | Logical value. If set to TRUE and the object is subsetted according to |
verbose | Logical. If set to TRUE informative messages regarding the computational progress will be printed. (Warning messages will always be printed.) |
A cypro object that contains the data for the subsetted cells.
Creating subsets of your data affects analysis results such as clustering and correlation which
is why these results are reset in the subsetted object and must be computed again. To prevent inadvertent overwriting
the default directory is reset as well. Make sure to set a new one via setDefaultDirectory()
.
The mechanism with which you create the subset is stored in the output object. Use printSubsetHistory()
to reconstruct the way from the original object to the current one.
subsetByNumber()
first unites all grouping variables across which the number of cells is supposed to be reduced to one single
new variable. Cell IDs are then grouped by this variable via dplyr::group_by()
. The number of cell IDs is then reduced
via dplyr::slice_sample()
. The exact number of remaining cells can be specified in two different ways by using either argument
n_by_group
or n_total
:
If specified with n_by_group()
: The numeric value is given to argument n
of dplyr::slice_sample()
. E.g.
across
= 'condition' and n_by_group
= 1000, if the cypro object contains 6 different
conditions the returned object contains 6000 randomly selectd cells - 1000 of each condition.
If specified with n_total
: The numeric value given to argument n
of dplyr::slice_sample()
is calculated
like this:
n = n_total
/ number of groups
E.g across
= 'condition' and n_total
= 10.000, if the cypro object contains
4 different conditions 2500 cells of each condition will be in the returned object.
If you want to keep the distribution across a grouping variable as is set argument weighted
to TRUE. In this case every groups proportion of cells is computed and the number of cells
representative for each group is adjusted.
E.g across
= 'condition' and n_total
= 10.000, if the cypro object contains
4 different conditions and condition a represents 40% of all cells while condition b-d each represent
20 % the returned cypro object contains 4000 cells of condition a and each 2000 cells of condition b-d.
In case of experiment set ups with multiple phases:
As creating subsets of your data affects downstream analysis results you have to manually specify the phase you are referring to.
The output object contains data for all phases but only for those cells that resulted from the random selection.