Background. Pay for performance (P4P) incentive schemes are increasingly used world-wide to improve health system performance but results of evaluations vary considerably. A systematic analysis of this variation in the effects of P4P schemes is needed. Methods. Evaluations of P4P schemes from any country were identified by searching for and updating systematic reviews of P4P schemes in health care in four bibliographic databases. Outcomes using different measures of effect were converted into standardised effect sizes and each study was categorised as to whether or not it found a positive effect. Subgroup analysis, meta-regression and multilevel logistic regression were used to investigate factors explaining heterogeneity. Random-effects models were used because they take into account heterogeneity likely to be due to differences between studies rather than just chance. Sensitivity analysis was used to test the effect of different assumptions. Findings. 96 primary studies were identified; 37 were included in the meta-analysis and meta-regression and all 96 in the logistic regression. The proportion of observed variation in study results that can be explained by true heterogeneity (I2) was 99.9%. Estimates of effect of P4P schemes were lower in evaluations using randomised controlled trials (SMD=0∙08; 95% CI: 0∙01 to 0∙15) compared to no controls (0∙15; 95%CI: 0∙09 to 0∙21), and lower for those measuring outcomes (e.g. smoking cessation) (SMD=0∙0; 95%CI: -0∙01 to 0∙01) compared to process measures (e.g. giving cessation advice) (0∙18; 95%CI: 0∙06 to 0∙31).Adjusting for other design features and the evaluation method, the odds of showing a positive effect was three times higher for schemes with larger incentives (>5% of salary/usual budget) (OR = 3∙38; 95%CI: 1∙07 to 10∙64). There were non-statistically significant increases in the odds of success if the incentive is paid to individuals (as opposed to groups) (OR= 2∙0; 95%CI: 0∙62 to 6∙56) and if there is a lower perceived risk of not earning the incentive (OR= 2∙9; 95%CI: 0∙78 to 10∙83). Schemes evaluated using less rigorous designs were 24 times more likely to have positive estimates of effect than those using randomised controlled trials (OR = 24; 95%CI: 6∙3 to 92∙8). Interpretation. Estimates of the effectiveness of incentive schemes on health outcomes are probably inflated due to poorly designed evaluations and a focus on process measures rather than health outcomes. Larger incentives and reducing the perceived risk of non-payment may increase the effect of these schemes on provider behaviour.