%X It is common to measure large numbers of features to identify those differing between experimental conditions; for example using RNA-Seq to search for differentially expressed genes. Ranking by p-value allows for statistical control, but has well known issues: unreliability without many replicates; and significance of biologically irrelevant effect sizes. As a result prioritization is typically performed in conjunction with effect size; the canonical one being “fold-change” . However fold-change has several issues: division by zero, sensitivity to small values in the denominator, insensitivity to magnitude (1 over 2 equals 100 over 200). To mitigate these problems adding 1 to all values is a widely used heuristic; which we show using real and simulated data is typically highly sub-optimal, while the value 20 is nearly optimal in all cases. From another point of view, adding a fixed “pseudocount” to all values is essentially re-defining effect size from fold-change to something else. To explore this further, we axiomatize the concept of effect size and use this mathematical framework to study the problem in general. We also present the remarkable finding that pseudocounts strike a balance between sorting by fold-change and sorting by difference. Therefore, optimization is equivalent to finding the most harmonious balance between these two extremes. Lastly, the framework is illustrated on a fundamentally different type of problem, that of ranking di-codons by their differential abundance in the ORFeome of different species, where p-values are unavailable and one must solve the problem directly with effect sizes.
