Thesis Epilogue

A few reflections from someone who just finnished their PhD

Apr 6, 2022 16 min read reviews

On the 18th of March 2022 I defended my PhD thesis: “Computational methods for analysis of spatial transcriptomics data”. About a month before this date, I finalized the actual document to be defended. Having spent a considerable amount of time writing the thesis, I had two realizations upon submission: (1) aside from my supervisor, the opponent, and my thesis committee - only a handful people would at most read this thing I’ve worked so hard to compile; (2) the part I deemed most valuable was the Epilogue, where I took the freedom to share some personal insights from these years.

Hence, I’ve decided to post an excerpt from the Epilogue here on the website. I like it not because it’s a literary masterpiece (it most certainly is not), but because it contains advice I wish I had known when starting my PhD. It consists of three parts:

What I’ve learnt : lessons learnt from my time as a PhD
What I predict : some humble prediction about the future
What I hope : hopes for the future

For anyone interested in the full thesis, it can be found here.

What I’ve learnt

The “end of history illusion” is a phenomenon in psychology where individuals agree that up until the current point in time they’ve experienced continuous and significant growth, but believe that, henceforth, they will not change by any considerable amount. This illusion is persistent across all ages, and repeatedly proven to be incorrect. We humans are malleable and never seem to solidify. No matter where in life we are, we continue to develop, change, and grow.

I was convinced that I’d learn a lot during my PhD, scientifically – but would I be affected on a personal level? Most likely not. Despite me being aware of the aforementioned illusion, I was impermeable to the idea that this experience would leave much of an imprint on me. I guess that this is at its best described as arrogance and at its worst as stupidity.

Starting my PhD on the 12:th of June 2019, I’ve spent exactly 1010 days – or 2 years, 9 months, and 6 days – pursuing my degree. This time has been nothing short of transformative. Agreeably, approximately three years is not a huge amount of time, but these years have been densely packed with new experiences, encounters, and impressions. I’ve acquired many new skills, but I also leave this era of my life as a very different person than the one who entered it. Below follows a curated list of insights that I’ve collected over the course of my PhD, relating to science as well as personal topics.

A high level of complexity does not equal a high level of success. Among computational methods, it’s rarely the most advanced methods that surface as the most popular ones. If you desire spread and impact, study the field, seek questions that are frequently being asked but rarely answered; then tailor your method towards this. Never develop a method and then invent a question for it to address.
Develop for you audience, not yourself. If you’re capable of formulating a statistical or mathematical model, and then implement it in code, you are likely more proficient in these areas than the average user of your tool. Therefore, if you want people to use your software, make the interface intuitive and provide a layman’s explanation of how it works. Good documentation with loads of examples is key to success. If possible, integrate your method into already existing frameworks, this makes it easy for users to explore without having to learn a new syntax. From my experience, methods that are easy to operate are often favored over less user friendly ones, even though the latter might have much better performance.
Listen to people when they complain. If someone expresses that they are struggling with something, they are likely not alone. Embrace the opportunity and be the one to deliver the solution. This is one of the easiest ways to identify areas where you can make a useful contribution.
Seek diversity and honor others’ expertise. The best collaborations are those where the people involved have complementary strengths and show mutual respect for each other’s skills. There’s a difference to being proud of your expertise and being arrogant about it. A project thrives when the members don’t consider their own contribution more (or less) important than anyone else’s, but acknowledge that everyone is essential for the process to move forward.
Time spent planning is often doubly rewarded. I’m addicted to fast progress, but have learnt that a short pause can save plenty of time. Making informed design choices, and not just blindly throwing yourself at the first idea, almost always results in a more pleasant and faster overall process. A quick fix for the situation at hand might seem tempting, but adapting general solutions usually pays off in the end.
Garbage data will give you garbage results. You wouldn’t pick up a roadkill, cook it, and then expect it to taste like a dish from a Michelin star restaurant. The same should hold true for data; one needs to have reasonable expectations about what information that can be derived from it. There’s a difference between a bioinformatician and a magician, the latter can turn nothing into something, the former cannot. Sometimes, the data is just not good enough to answer certain questions, if such is the case, there are only two reasonable options: (i) ask a different question, or (ii) generate new data.
Don’t bring nuclear weapons to a gun fight. Sometimes enthusiasm and excitement about new powerful methods makes us blind to the fact that the problem at hand likely could be solved with simpler means. For some questions, a simple regression model will do just as fine – and possibly even better – than a fancy deep learning model. It’s easy to be caught up in the storm of buzz words, but take some time to contemplate what level of complexity your problem actually requires.
Aim to be the dumbest person in the room. The best way to grow is to position yourself in an environment where people are more skilled than yourself, it accelerates learning and forces you to be alert. Comfort is truly the enemy of improvement.
Don’t set yourself on fire to keep others warm. I believe one should always strive to help others when we can, but at some point, it can also become problematic. If you consistently are the one who does the extra work, covers for others, and stays late – then you’re not helping, you’re being taken advantage of. We’re all familiar with the airplane safety instructions telling us to put on our own masks before helping someone else, this is equally applicable to the workplace. If you want to have a positive impact on the people around you, the most important thing is that you feel good about your own situation.
Never compromise on health. In January 2021 I experienced something close to a physical collapse, my body simply quit on me. I could barely walk for two months, and for six more months, every day of my life felt like a living hell – I did not enjoy living. Every morning, I put on an alarm that counted down the hours that I had left to be awake and aware of my situation. Still, when night came, I barely slept. Instead, I woke up multiple times having issues breathing or in a state of complete sleep paralysis. A combination of bad nutrition, an extreme (according to some people) amount of exercise, and working ten to twelve hours a day (including weekends) put me in a state of severe exhaustion. It was not until I became a prisoner of my own body that I realized how much my previous freedom meant to me. It’s hard realizing that you’re not an exception, but just as human as everyone else. However, in the end, this realization is healthy. If there’s one thing I will bring with me from these years, it’s that nothing is worth sacrificing one’s well-being or health for.
Perspective is everything. There’s a quote from the, truly awful, series “Pirates of the Caribbean” that reads: “The problem is not the problem. The problem is your attitude about the problem.” Even though I cringe just by thinking about Captain Jack Sparrow, these words have stayed with me. I’ve experienced first hand how you can’t plan every aspect of life. Unexpected things can, and will, happen. Our attitude determines how we experience these events, whether it becomes a tragedy or a lesson. I’ve tried to adopt more of a “gratitude mindset”; instead of being frustrated when things don’t go my way, I try to celebrate what has gone right so far. This attitude is not always easy to maintain, and one is of course allowed to feel anger, but it’s a feeling that becomes toxic if we let it linger for too long. Implementing this mindset have made me a much happier individual and helped me through some really dark times.

What I predict

In 2016, when the Spatial Transcriptomics (ST) technique was published, I had just finished the second year of my bachelor and was yet to hear the term “transcriptomics”. Thus, I’m acutely aware of the fact that I belong to the younger generation of the transcriptomics field, and do not have the same experience as many of my peers. Still, having worked somewhat intensively in the niche of computational method development for spatial transcriptomics, I have a few predictions about the future, which I’ll take the freedom to share here.

Deep learning methods will become staple goods. Although I’m fascinated by deep learning (DL) methods, none of my works have so far exploited the power of these architectures – mainly because I’ve felt as if the questions I had could be addressed with simpler methods, or because the data wasn’t there. However, the trend of access to more data, increasingly sophisticated and user friendly frameworks – paired with the development of new kinds of models – makes me certain that DL will revolutionize the single cell and spatial transcriptomics fields, just as it has many other aspects of our life. Currently, a lot of the DL-based methods simply apply existing general models (e.g., taken from the natural language processing field) to a problem in the transcriptomics sphere. However, I believe we’ll migrate from this approach towards using bespoke models, where prior information about the biological systems are integrated into the model architecture. In the very near future, methods leveraging graph convolutional networks (GCNs) and their aptitude for irregular data will likely become a popular element in many methods for analysis of spatial transcriptomics data.
Emergence of perturbation studies. The majority of publications and projects that include spatial transcriptomics data have so far been observational. A sample is collected, analyzed, and relevant observations presented. At some rare occasions, samples representing case and control exist, but usually with limited meta data and no control over confounding variables. While interesting, this setup mainly permits exploratory data analysis (EDA), but does not lend itself well to infer causal relationships. To go beyond mere associations or correlations, an intervention or perturbation of the system is necessary. Thus, I’m certain that it’s just a question of time until techniques to the likes of Perturb-seq are combined with spatial assays. With the introduction of perturbations, we’ll be able to deduce how gene expression impacts spatial structure, and potentially also the reciprocal relationships. With access to such data, causal inference will likely become an essential tool for modeling and understanding causative effects. This is something I’m genuinely excited about.
Preference of generative models. Many of the models we currently employ are of a discriminative nature, but I anticipate a shift towards generative models. Discriminative models assumes some functional form of the posterior, in contrast, generative models learns the joint probability distribution over all variables. Generative models are more susceptible to incorporation of prior information about the systems being studied, and better at representing causal relationships. Thus, they neatly tie together the two previous statement about a need for bespoke models and causal links.
Challenges of multimodal analysis. To me, the trend in technology development can best be summarized with the Pokémon slogan: “Gotta Catch ‘Em All.” The transcriptome, epigenome, proteome, and metabolome – we want them all, at the same time, from the same cell. Alas, 10x Genomics already have an assay where RNA-seq and ATAC-seq data from the same cell can be obtained, as well as an second assay where spatial RNA-seq information and protein abundance are collected simultaneously. Except for increased ability to resolve cell types and states, very few examples where multimodal data is superior to unimodal data have so far been presented, but there’s no lack of ideas.

One of the commonly mentioned aspirations is to learn relationships between the different modalities, which can be used to predict one modality from another, for example, deducing protein levels from gene expression. Here, I will take a somewhat controversial and conservative stance by stating that: prediction of one modality from another will prove to be more challenging than many expect. I base this statement on two concepts: temporal delays and missing information. I’ll elaborate on both these issues below.

Changes to one part of the central dogma usually don’t manifest immediately in other parts, some form of delay tends to be present. Thus, data ($x_t$) collected from one modality at time $t$ isn’t necessarily informative about the feature values ($y_t$) of a different modality at the same time point. Instead – due to the lag – $x_t$ relates to the values ($y_{t’}$) at a later point $t’$. This discrepancy causes an issue in learning, because the two modalities are related according to: $$y_{t’} = f(x_t)$$ However, in most multimodal assays, we observe $(x_t,y_t)$, meaning the data required to learn $f$ is not available. Potentially, $y_{t’}$ could be inferred from $y_t$ by learning a second map $g$ such that $y_{t’} = g(y_t)$. Then $f$ can be learnt by first transforming $y_t$ through $g$. Now, to find $g$, the derivative $\partial y_{t}/\partial t$ must likely be deduced. To estimate this derivative, at least one more data point close in time (w.r.t. protein turnover timescales) is required. Unfortunately, experimental assays only capture a single snapshot of the system at a particular time. As a consequence, estimation of such derivatives is usually infeasible. The dilemma described above is what I refer to as temporal delay.

Next, I’ll address the second caveat, that of missing data. The path from one modality to another often involves several steps and regulatory mechanisms, not exclusively relying on elements of the observed modality. Thus, the previous equation should be updated to: $$y_{t’} = f(x_t,u_t)$$ Where $u_t$ represent entities with an influence over the regulatory mechanisms (e.g., enzyme levels or metabolic concentrations). Note that it’s possible that $u_t$ and $y_t$ overlaps. Assuming that the above equation is true, data must also be collected on $u_t$ for predictions about $y_{t’}$ to be made, solely relying on $x_t$ is not sufficient. Thus, $x_t$ does not contain all the information needed to predict $y_t$. Of course, if $t \approx t’$ and $f(x_t,u_t) \approx f(x_t)$, the problem is reduced to a much simpler one. Still, when such is not the case, we should accept that the prediction task is challenging. I definitely don’t think it’s beyond our capabilities, but while I expect methods for integration of different data modalities to emerge soon after the experimental technologies, general methods to model intermodal relationships will take more time to mature.
The group before the individual. As mentioned in the background, both internal and external factors influence a cell’s state. In my opinion, there’s still a need for general methods that tries to model how the local environment of a cell affects its behavior. Conditional models for gene expression already exist, one example being those that condition on cell type, often resulting in sets of marker genes or gene signatures. These models could be expanded to also include conditioning on the local environment of a cell, for example, the proportion of different cell types in its neighborhood. Such models add a new, interconnected, layer of information to our understanding of how cells operate in biological systems. Indeed, early attempts to construct models of this kind have already been made (e.g., node-centric expression modeling by the Theis Lab), and I dare to predict an abundance of them in a couple of years from now.

What I hope

Having outlined the lessons I’ve learnt and my predictions for the future, only one thing remains: listing some of the thing I hope for, but am less certain of.

Revised educational programs. In genomics, almost every new technological method is accompanied by a suite of computational tools to analyze the data. Ever more frequently, high impact journals publish purely computational methods designed to unveil previously occluded insights that only emerge by clever modeling of the data. Thus, it’s evident that computational expertise is just as important to advance life science as biological and technical knowledge. If further proof is needed, in 2021, SciLifeLab and the Wallenberg National Program announced several DDLS (data driven life science) fellowships, acknowledging the importance of computational competence. Still, the essential skills needed in computational biology, such as: statistics, mathematics, probability theory, modeling, and programming, are severely underrepresented in many of the biotechnology programs at Swedish universities. We need to step up our game if we want maintain our status within the life sciences as an innovative and leading nation, and remain competitive with international institutions like the Broad or the Wellcome Trust Sanger Institute. The foundation must be laid early on, educating PhD students is not good enough, computational biology tracks should be instituted already at the Master level and potentially even seep into the bachelor programs. I sincerely hope that the educational programs will be updated, to also prepare students – with an interest – for the challenges a computational biologist faces.
Increased diversity. If there’s one thing I’m not stoked about, it’s gender quotation and female-exclusive events; to me they have an opposite effect of their intended purpose. These actions belittle women’s competence and give the impression that we need extra help or special rules to succeed. However, women are clearly underrepresented in the computational field; at many hackathons or meetings, I’ve found myself – as a woman – in a very small minority, and am often assumed to be someone representing the wet-lab side. I’m not upset by this, and have never been met with anything but respect when correcting people, but I don’t think it has to be like this. Girls and young women should be equally encouraged to purse STEM subjects as their male counterparts, and all of us – me included – should probably revise or abolish some of our stereotypes. So, I dearly hope for a future where the computational fields become more diverse and inclusive. Of course, diversity extends beyond gender, the same arguments can – and should – be made about ethnicity, age, religion, sexual identity, etc. Being a white woman living in Sweden, I fully acknowledge my privileges, and that my encounters with prejudice are probably dwarfed by those from other – less fortunate – groups. Still, I can only speak of my own experiences and observations.
Breaking the limit. My third, and final, wish for the future is to pass the qualifying time for the Boston Marathon. To then – of course – complete the race.

reflections

Alma Andersson

Senior AI Scientist

My passion lies in understanding and modelling the latent structures that governs biological systems.