Big data also need big concepts

Big data

big dataIn biology, data on species abundance, diversity and traits are collected within large, international collaborative  projects, citizen science projects, and permanent monitoring stations. These data are made openly available in big biodiversity databases: big data. Below I highlight some problems that big data approaches can have, which is particularly worrying if analysis outcomes are used to inform (inter)national policies on conservation strategies.

Problem 1: no mechanistic underpinning

An example: I, together with Matty Berg (Vrije Universiteit Amsterdam), recently used a mechanistic approach to examine functional pathways by which a typical fast life history species displaying high fecundity but short life span – the beach hopper Orchestia gammarellus – and a typical slow life history species displaying low fecundity but long span – the reef manta ray Manta alfredi – differ in their sensitivity to environmental change [1]. Opposite to big data results, we find that the fast life history species was sensitive to frequency of good-food conditions, whereas the slow life history species was sensitive to temporal autocorrelation in environmental conditions. We, however, show that differences in physiology explain why these two species respond differently.

Problem 2: extrapolation

Big data analysis without a mechanistic representation of the biological processes causing observed variation, complicates the necessary extrapolation beyond the existing data range [2, 3], for example when investigating species responses to climate change [4, 5]. Without a theoretical underpinning of data patterns, extrapolating beyond the range of existing data is problematic. However, extrapolation is necessary when enquiring how species respond to novel conditions, like those imposed by climate change.

Problem 3: losing sight of the empirical cycle

Big data receive big interests, almost giving the impression that big data approaches can solve (big) problems in ecology without the need for conventional scientific methods of inquiry [6]. Yet, we should not disregard conventional scientific methods. We should focus on the mechanistic underpinnings of biological variation. The empirical cycle starts with collecting data, but its purpose is to inform theories, not to be a method in itself.

Problem 4: collecting the right data

What are the data collected for? In other words, what is the research question? Often, the research question is formulated independently of the motivation to collect the data. The challenge in big data lies in the pre-processing of the data, transforming and extracting of the data. Experimental data, on the other hand, are typically extensively, often manually, scrutinised, which is often not done with big data.

Problem 5: reliability of the data

How reliable are the data collected in big databases? Kendall et al. [7] recently tackled this question with regard to the parameterisation of matrix population models [8]. Matrix population models are being collected in the COMADRE animal and COMPADRE plant matrix databases [9,10]. Kendall et al. [7], however, identified three significant errors commonly encountered in published matrix populations models. They conclude that many studies based on such models may need to be re-examined [7].

Now what?

Empirical cycle.jpgThe empirical cycle starts with collecting data to inform theories and conceptual models, which in turn help guide the collection and interpretation of data – a process easily sidestepped in big data approaches [6]. And I haven’t even touched upon the issue that correlation patterns found in big data analyses do not say anything about causation. Most importantly, however, I hope that (inter)national policies on conservation strategies will not be misinformed due to any of the problems highlighted above.


  1. Smallegange IM, Berg M. 2019. A functional trait approach to identifying life history patterns in stochastic environments. Ecology and Evolution 9: 9350-9361.
  2. Smallegange IM, Deere JA, Coulson T. 2014. Correlative changes in life-history variables in response to environmental change in a model organism. American Naturalist 186: 784-797.
  3. Smallegange IM, Caswell H, Toorians MEM, de Roos AM. 2017. Mechanistic description of population dynamics using dynamic energy budget theory incorporated into integral projection models. Methods in Ecology and Evolution 8: 146-154.
  4. Hampton SE, Strasser CA, Tewksbury JJ, Gram WK, Budden AE, Batcheller AL, Duke CS, Porter JH. 2013. Big data and the future of ecology. Frontiers in ecology and the environment 11: 156-162.
  5. Kissling WD, et al. 2018. Towards global data products of Essential Biodiversity Variables on species traits. Nature Ecology & Evolution 2: 1531–1540
  6. Coveney PV, Dougherty ER, Highfield RR. 2016. Big data need big theory too. Philosophical Transactions of the Royal Society A: Mathematical, physical and engineering science 374.
  7. Kendall BE, Fujiwara M, Diaz-Lopez J, Schneider S, Voigt J, Wiesner S. Persistent problems in the construction of matrix population models. Ecological Modelling 406: 33-43.
  8. Caswell H. 2001. Matrix population models: Construction, analysis, and interpretation, 2nd Sinauer, Sunderland, MA.
  9. Salguero-Gómez R. et al. 2015. The COMPADRE Plant Matrix Database: an open online repository for plant demography. Journal of Ecology 103: 202–218.
  10. Salguero-Gómez R. et al. 2016. COMADRE: a global data base of animal demography. Journal of Animal Ecology 85: 371–384.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s