The building blocks of life: identifying atrophy in protein domains

Can studying protein domains help us understand protein evolution? Alex Bateman and William Pearson are authors on two studies recently published in Genome Biology. We asked them to tell us more about their findings.

Can you explain why the study of protein domains is so significant to biologists?

Protein domains are the common currency of protein structure and function between different proteins. Nature has recombined protein domains in different ways, like Lego bricks, to make new proteins. If we can understand the functions of every domain then we are a long way to understanding how all proteins work.

Indeed, one might go a step further, stretching the analogy from Lego bricks to the protein equivalent of chemical elements – the atomic (indivisible) units from which proteins are constructed. Using this analogy, just as chemical atoms combine to build molecules, domains mix and match to form proteins.

Molecules of course can be combined in many more ways than the domain Lego blocks – protein domain blocks can only be combined end-to-end. But there are thousands to tens of thousands of protein domain ‘bricks’, rather than just one hundred or so elements.

What is ‘domain atrophy’ and why did you choose to study this?

The term domain atrophy emerged from a conversation between Ananth Prakash and Alex Bateman when discussing suitable PhD projects. They had both noticed proteins that had been degraded. This then led us to go hunting for these events in a more systematic way. Domain atrophy does not appear to have been systematically studied, with only a few sporadic reports in the literature.

As both of your articles were submitted together, can you describe how they complement each other?

We discovered that both groups were working along similar lines at the poster session at ISMB (Intelligent Systems for Molecular Biology conference). Both of us were interested in the question: since protein domains are often thought of as building blocks with characteristic lengths, what do partial domains look like? Though both groups were asking similar questions, we examined the problem from different perspectives.

Downstream domain-bounded atrophy
Downstream domain-bounded atrophy
Prakash & Bateman, 2015

As Triant and Pearson tried to find ‘true’ partial domains, they encountered artifacts that made it very difficult to identify genuine partial domains (though they found a few), so they ended up focusing on the bioinformatic causes of domain artifacts.

In contrast, Prakash and Bateman pursued a strategy that quickly focused on the highest quality protein annotations, though even with this focus, a lot of manual analysis was required. The approaches used were quite similar and so we tried to use a common description of the types of partial domains such as ‘end bounded atrophy’ to describe where a partial domain was found at the N- or C-terminus of the protein.

What were the challenges in carrying out these studies?

As pointed out in the Pearson and Triant manuscript, partial domains found in sequence databases are largely caused by bioinformatics artifacts. Even after trying to filter these out, Prakash and Bateman found many proteins where the apparent partial domains were caused by various problems with the data.

Cases where the structure of the genes is incorrect, or the definition of the Pfam domain boundaries is incorrect, or there are nested domains or circular domain permutations, can all look like domain atrophy. Like panning for gold, after filtering and manually checking all the results Prakash and Bateman were able to find some real cases of domain atrophy.

Likewise, Triant and Pearson found a small number of cases (18 out of 136) where partial domains were not artifacts. In these cases, Pfam had built a domain from two smaller domains, which could sometimes be found in different sequence contexts.

But again, these instances had to be verified manually. Simply looking for shorter homology was not enough, nor was looking for compact structures. But in some cases, the combination of different alignment patterns and multiple structural domains suggested that smaller mobile structural domains existed.

What were the main findings, relating to the annotation and characterization of protein domains?

Triant and Pearson found that 4% of domain annotations suggested that half or more of the domain is missing (almost a million annotations). Of these, 80-90% are likely to be either alignment errors, which we termed ‘split domains’, or mistakes based on incorrect genome assemblies or gene models (non-split partial domains are almost non-existent in well-annotated genomes like human).

Mutations that cause atrophy can lead to proteins with novel functions

Of the remaining true partials, some are Pfam domains that should be sub-divided, but Prokash and Bateman found that, in very rare cases, other partial domains are able to tolerate mutations that would be expected to be highly disruptive.

Atrophied domains were both unexpected and exciting. Mutations that cause atrophy can lead to proteins with novel functions such as the non-fluorescent protein which although not fluorescent is hypothesized to buffer an inhibitory side product of the reaction of LuxA/LuxB.

Do your results tell us anything new about protein evolution?

Both sets of results strongly support the traditional ‘Lego block’ view of protein domains as indivisible structural and functional building blocks. But domain atrophy is new and exciting, because it raises the possibility that proteins can tolerate very disruptive mutations, which will have implications for understanding disease mutations.

What is next for your research?

It would be fascinating to try to solve the structures of the potential atrophy cases that were identified to help illuminate how proteins can tolerate extreme mutations. Another avenue would be to try and experimentally induce domain atrophy to understand the thermodynamics and folding in these cases.

We also hope that a better understanding of domain annotation artifacts can lead to more accurate domain annotations – partial domains should be viewed with suspicion. These may provide more effective strategies for identifying very distant evolutionary relationships using sequence relationships.

Barbara Cheifet

Editor at BioMed Central
Barbara received her Ph.D from Yale’s department of Molecular, Cellular, and Developmental Biology in 2014, with a background in skin stem cell research. She joined BioMed Central as an assistant editor for Genome Biology in the summer of 2014.

Latest posts by Barbara Cheifet (see all)

View the latest posts on the On Biology homepage