Create data subset by reducing the number of cells

Subset functions allow to conveniently split your data. subsetByNumber() does not subset by anything specific but simply reduces the number of cells in the object by random selection across specified grouping variables. This might be useful if the number of cells is to high for certain machine learning algorithms such as clustering and correlation. See details for more information.

subsetByNumber(
  object,
  new_name,
  across = c("cell_line", "condition"),
  n_by_group = NA,
  n_total = NA,
  weighted = FALSE,
  phase = NULL,
  verbose = NULL
)

Arguments

object	A valid cypro object.
new_name	Character value. Denotes the name of the output object. If set to NULL the name of the input object is taken and suffixed with '_subset'.
across	Character vector. The grouping variables across which to reduce the cell number. This ensures that the randomly selected cells are equally distributed across certain groups. Defaults to 'cell_line' and 'condition'.
n_by_group	Numeric value or NA If numeric, denotes the number of cells that is randomly selected from every group.
n_total	Numeric value or NA If numeric, denotes the final number of cells that the subsetted object is supposed to contain. The number of cells that is randomly selected by group is calculated accordingly.
weighted	Logical value. If set to TRUE and the object is subsetted according to `n_total` it makes sure that the proportion each group specified in argument `across` represents stays the same. See details for more.
verbose	Logical. If set to TRUE informative messages regarding the computational progress will be printed. (Warning messages will always be printed.)

Value

A cypro object that contains the data for the subsetted cells.

Details

Creating subsets of your data affects analysis results such as clustering and correlation which is why these results are reset in the subsetted object and must be computed again. To prevent inadvertent overwriting the default directory is reset as well. Make sure to set a new one via setDefaultDirectory().

The mechanism with which you create the subset is stored in the output object. Use printSubsetHistory() to reconstruct the way from the original object to the current one.

subsetByNumber() first unites all grouping variables across which the number of cells is supposed to be reduced to one single new variable. Cell IDs are then grouped by this variable via dplyr::group_by(). The number of cell IDs is then reduced via dplyr::slice_sample(). The exact number of remaining cells can be specified in two different ways by using either argument n_by_group or n_total:

If specified with n_by_group(): The numeric value is given to argument n of dplyr::slice_sample(). E.g. across = 'condition' and n_by_group = 1000, if the cypro object contains 6 different conditions the returned object contains 6000 randomly selectd cells - 1000 of each condition.

If specified with n_total: The numeric value given to argument n of dplyr::slice_sample() is calculated like this:

n = n_total / number of groups

E.g across = 'condition' and n_total = 10.000, if the cypro object contains 4 different conditions 2500 cells of each condition will be in the returned object.

If you want to keep the distribution across a grouping variable as is set argument weighted to TRUE. In this case every groups proportion of cells is computed and the number of cells representative for each group is adjusted.

E.g across = 'condition' and n_total = 10.000, if the cypro object contains 4 different conditions and condition a represents 40% of all cells while condition b-d each represent 20 % the returned cypro object contains 4000 cells of condition a and each 2000 cells of condition b-d.

Note

In case of experiment set ups with multiple phases:

As creating subsets of your data affects downstream analysis results you have to manually specify the phase you are referring to.

The output object contains data for all phases but only for those cells that resulted from the random selection.