New Protein Sequencing Method Could Transform Biological Research

An ultra-sensitive new method for identifying the series of amino acids in individual proteins (a.k.a. protein sequencing) can accelerate research on biomarkers for cancer and other diseases. Credit: David Steadman/University of Texas at Austin.

A team of researchers at The University of Texas at Austin has demonstrated a new way to sequence proteins that is much more sensitive than existing technology, identifying individual protein molecules rather than requiring millions of molecules at a time. The advance could have a major impact in biomedical research, making it easier to reveal new biomarkers for the diagnosis of cancer and other diseases, as well as enhance our understanding of how healthy cells function.

The team published the results of their proof-of-concept study today in the journal Nature Biotechnology.

"We have created, essentially, a DNA-sequencing-like technology to study proteins," said Edward Marcotte, professor of molecular biosciences and co-inventor of the new technology.

Work on this project began more than six years ago when Marcotte and colleagues first envisioned adapting the methods of next generation gene sequencing to protein sequencing. Next generation gene sequencing is a set of techniques that have made sequencing the entire genome of any living organism fast, accurate and affordable, accelerating biological research — and for the rest of us, enabling at-home genetic testing for ancestry and disease.

In the same way that these earlier advances provided quick and comprehensive information about thousands of genes that influence human health, the new technology provides for rapid and comprehensive information about tens of thousands of proteins that play a role in healthy functioning or in diseases. In many disorders — such as cancer, Alzheimer's, heart failure and diabetes — cells produce proteins and other substances that act as unique biomarkers, akin to fingerprints. Better detection of these biomarkers would help researchers understand the causes of disorders or provide earlier, more accurate diagnoses for patients.

The current laboratory standard for sequencing proteins, using a tool called mass spectrometry, is not sensitive for many applications — it can detect a protein only if there are about a million copies of it. It also has a "low throughput," meaning it can detect only a few thousand distinct protein types in a single sample.

With this new method, called single molecule fluorosequencing, researchers can now sequence millions of individual protein molecules simultaneously in a single sample. Marcotte believes with future refinements the number of molecules that could be detected in a sample could reach into the billions. With higher throughput and much greater sensitivity than existing technology, the tool should allow for greater detection of biomarkers of disease and would also make it possible to study things such as cancer in a whole new way. For example, researchers could look, cell-by-cell, to understand how a tumor evolves from a small mass of identical cells to a soup of genetically divergent cells, each with its own strengths and weaknesses. Such insights could inspire novel ways to attack cancer.

The new protein sequencer was featured by the NSF in a "Four Awesome Discoveries You Probably Didn't Hear About This Week" video:

Graduate student Jagannath Swaminathan joined the research team early on and devoted his entire Ph.D. to leading the effort. Other major contributors included research scientist Angela Bardo, who adapted an imaging technique used in next gen gene sequencing, called TIRF microscopy. Because the method involves attaching chemical tags to the proteins to be analyzed, Marcotte brought on board UT Austin chemist Eric Anslyn. A former graduate student in the Anslyn lab, Erik Hernandez, developed the chemical process for attaching a sequence of fluorescent tags in various colors to different amino acids in proteins such that the tags can survive the harsh chemical processes.

The researchers used supercomputer clusters at UT Austin's Texas Advanced Computing Center for their analyses.

The researchers have founded a startup company called Erisyon Inc. to commercialize the technology. Three team members (Swaminathan, Marcotte, and Anslyn) share a patent with UT Austin on the underlying technology and have filed applications for several related patents.

Erisyon Inc. has four co-founders and shareholders from UT Austin (Swaminathan, Bardo, Marcotte and Anslyn), plus two from outside the university (Zack Simpson and Talli Somekh). UT Austin and Erisyon have signed a licensing agreement to commercialize the technology.

The paper's other co-authors are Alexander Boulgakov, James Bachman, Joseph Marotta and Amber Johnson.

This work was supported by fellowships from the Howard Hughes Medical Institute and the National Science Foundation, and by grants from the Defense Advanced Research Projects Agency, the National Institutes of Health, the Welch Foundation and the Cancer Prevention and Research Institute of Texas.

The University of Texas at Austin is committed to transparency and disclosure of all potential conflicts of interest. University investigators involved in this research have submitted required financial disclosure forms with the university. UT Austin filed patent applications on the technology described in this news release—one was approved in April 2017 (U.S. patent #9,625,469) and a second one is currently pending—in conjunction with the formation of Erisyon Inc., a biotech startup in which Marcotte, Anslyn, Bardo and Swaminathan have equity ownership.

The dots in this picture aren't stars, they're millions of proteins as seen through a microscope. Each protein stands up like a blade of grass with only the tips visible as we look down from above. Proteins are chains of amino acids, so each dot is just the amino acid on the tip. Of the 20 flavors of amino acids that make up proteins, one type is tagged yellow, another is tagged pink, the other 18 aren't tagged, they're black. By removing one amino acid from each protein, taking a new picture and then repeating many times, researchers can record a sequence of amino acids for each protein in a sample of millions simultaneously. Credit: University of Texas at Austin.

How single molecule fluorosequencing works and another application

Proteins are strings of amino acids, like tiny beaded necklaces. There are 20 types of amino acids from which living things can build proteins, and by arranging them in different sequences, they produce tens of thousands of different types of proteins. You can also think about those 20 amino acid types as letters from which messages can be composed. The goal of protein sequencing is to read the message that each protein encodes.

In single molecule fluorosequencing, some of the amino acids in the sample's proteins are first labelled with fluorescent tags. Then the sample is placed under a microscope where a laser causes the tags to glow. Then the amino acids are chemically snipped off one-by-one. Each time an amino acid is removed, a photo is snapped. This cycle repeats a dozen or more times: snip-snap, snip-snap. The photos look like star fields, each dot representing a single protein molecule. Over time, the color and intensity of each dot in the microscope's field of view changes in a way that reveals which amino acids were removed at each step for each protein molecule. It's like reading out a code written on millions of strips of paper by snipping off one letter at a time. By comparing the observed amino acid sequences to an existing database of protein sequences, the researchers can pinpoint the identity of each protein in the sample.

Along the way, the researchers discovered they didn't have to tag all 20 possible amino acids in a protein to accurately identify it. It takes only two to four different tags to produce a partial sequence that provides enough information for a supercomputer to identify the protein. It's as if you were a contestant on "Wheel of Fortune," and you bought some vowels. You'd still have blanks left in the puzzle, but based on your knowledge of words and phrases in the English language, you can figure out what the missing letters must be.

In addition to identifying biomarkers of disease and tracking cancer's evolution, another application for the technology could be in discovering proteins that have never been seen before. Called the "dark proteome," there are thousands of proteins that should be produced by the body, based on recipes that have been discovered written in our genetic code, the human genome, that have never actually been identified in nature. Maybe they're produced only in very rare situations. Maybe the body modifies them or destroys them quickly. Scientists would like to understand what role they play in healthy functions, as well as in disorders.

Comments 2

Guest - John Armstrong on Tuesday, 06 November 2018 10:19

Hey Marc. Great read. It's amazing to see other researchers at the university advancing the field of proteomics. I wasn't aware of Dr. Marcotte's research group on campus. I'm a researcher in the Brodbelt group here on campus using ultraviolet lasers to fragment proteins and analyze them with mass spetrometry and it's fascinating to see how other groups are approaching the problem. I want to say one thing about the article though. You mentioned, "It also has a "low throughput," meaning it can detect only a few thousand distinct protein types in a single sample." This isn't low throughput. High throughput would be the ability to automate analysis to ultimately decrease the time needed to analyze a sample and hence low would mean either the sample is to complicated to program a simple sequence for the instrument and/or that it takes to long to run the analysis. This is more of a fragmentation and bioinformatics issues then it is throughput.

Guest - Edward Marcotte on Thursday, 08 November 2018 10:18

Throughput can indeed be defined in different ways, and perhaps the most relevant measure here is the number of distinct observations made in an experiment (i.e., the number of mass spectra collected on a mass spectrometer versus the number of sequencing reads on a DNA sequencer) that are each capable of uniquely identifying a specific molecule or entity in the sample.

It's not unusual for contemporary mass spectrometry "shotgun-style" proteomics experiments to collect hundreds of thousands of mass spectra from a single sample, very high-throughput indeed compared to historic proteomics experiments. However, this is still many orders of magnitude lower throughput than DNA sequencers, on which researchers can routinely collect billions of sequencing reads per experiment.

Let's put some numbers on this to get a better sense of scale: In 2016, when these numbers were last reported (Nucleic Acids Res. (2016) 44(D1)447-D456), the sum total of all protein mass spectrometry datasets in the PRIDE database (one of the premier protein/peptide mass spectrometry data repositories) added up to 0.7 billion peptide mass spectra. This is obviously only a rough proxy and only captures data put into the repository, but it gives a sense of the magnitude of data collected by the entire proteomics community around the world up to that point. In contrast, just one experiment on just one DNA sequencing machine, such as an Illumina NovaSeq, can produce 20 billion reads in 1-2 days, capturing 6 Terabases of DNA sequence.

The scale-up in throughput offered by sequencing technologies is really nothing short of astonishing. It remains to be seen if protein and peptide sequencing will indeed scale in a similar fashion, but fundamental similarities in the underlying technology platforms suggest this should be possible.

It's not unusual for contemporary mass spectrometry "shotgun-style" proteomics experiments to collect hundreds of thousands of mass spectra from a single sample, very high-throughput indeed compared to historic proteomics experiments. However, this is still many orders of magnitude lower throughput than DNA sequencers, on which researchers can routinely collect billions of sequencing reads per experiment.

Let's put some numbers on this to get a better sense of scale:  In 2016, when these numbers were last reported (Nucleic Acids Res. (2016) 44(D1):D447-D456), the sum total of all protein mass spectrometry datasets in the PRIDE database (one of the premier protein/peptide mass spectrometry data repositories) added up to 0.7 billion peptide mass spectra. This is obviously only a rough proxy and only captures data put into the repository, but it gives a sense of the magnitude of data collected by the entire proteomics community around the world up to that point. In contrast, just one experiment on just one DNA sequencing machine, such as an Illumina NovaSeq, can produce 20 billion reads in 1-2 days, capturing 6 Terabases of DNA sequence.

The scale-up in throughput offered by sequencing technologies is really nothing short of astonishing. It remains to be seen if protein and peptide sequencing will indeed scale in a similar fashion, but fundamental similarities in the underlying technology platforms suggest this should be possible.

Cancel Update Comment

News

New Protein Sequencing Method Could Transform Biological Research

About the author

Marc G Airhart

Comments 2

Read It Now

Video

Connect

Subscribe

Subscribe for
E-News »

In The News

Additional News »

Explore

News By Dept

Contacts

News

New Protein Sequencing Method Could Transform Biological Research

About the author

Marc G Airhart

Comments 2

Read It Now

Video

Connect

Subscribe

Subscribe forE-News »

In The News

Additional News »

Explore

News By Dept

Contacts

Subscribe for
E-News »