Assignment 2
Parsing Biological Data
Implementing the Protein Class
Your task is to create a Protein class that holds important protein data and performs basic
operations on proteins. Below is the same uniprot sample from the rst assignment
ID HBA_HUMAN Reviewed; 142 AA.
AC P69905; P01922; Q1HDT5; Q3MIF5; Q53F97; Q96KF1; Q9NYR7; Q9UCM0;
DT 21-JUL-1986, integrated into UniProtKB/Swiss-Prot.
DT 23-JAN-2007, sequence version 2.
DT 28-JUL-2009, entry version 75.
DE RecName: Full=Hemoglobin subunit alpha;
DE AltName: Full=Hemoglobin alpha chain;
DE AltName: Full=Alpha-globin;
GN Name=HBA1;
GN and
GN Name=HBA2;
OS Homo sapiens (Human).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC Catarrhini; Hominidae; Homo.
OX NCBI_TaxID=9606;
....
SQ SEQUENCE 142 AA; 15258 MW; 15E13666573BBBAE CRC64;
MVLSPADKTN VKAAWGKVGA HAGEYGAEAL ERMFLSFPTT KTYFPHFDLS HGSAQVKGHG
KKVADALTNA VAHVDDMPNA LSALSDLHAH KLRVDPVNFK LLSHCLLVTL AAHLPAEFTP
AVHASLDKFL ASVSTVLTSK YR
//
In addition to the primary accession number and sequence, the Protein class will hold the
description of the protein as well as the source organism. In the sample above, the descrip-
tion and source organism are Hemoglobin subunit alpha and Homo sapiens (Human). In
order to access and modify this information, your class must implement getter and setter
methods for each private instance variable. In addition to these methods, you must override
(implement) the following class methods:
public String toString(): Implementing this method will allow us to directly print
useful information about proteins using System.out.println(protein). This method
should return a String formatted as follows:
AC: <the primary accession number>
DE: <the description>
OS: <the source organism>
SQ: <the sequence>
public boolean equals(Protein otherProtein): When the equals method is called
(e.g. myProtein.equals(otherProtein)), it should return true only if myProtein
and otherProtein are the same protein. Implementing this method correctly will require
us to use the keyword this which we will discuss in lab.
Extending the Parser class
Your next task is to enhance the parser so that it can read in every uniprot le that lives in
some directory (folder). In particular, you must implement the following:
public class Parser{
//global variable that stores all of the proteins.
private ArrayList<Protein> protein_database = new ArrayList<Protein>();
.
.
.
/*
fills the protein_database arraylist with all the proteins
defined in the uniprot files that exist inside the directoryName folder.
*/
public static void parseAll(String directoryName){
//CODE GOES HERE
}
You will need Java's ArrayList and File class to correctly implement the recursive method
parseAll. File methods such as isDirectory, exists, list, and listFiles will be quite
helpful. In a nutshell, an ArrayList is an expandable array that allows us to add elements
without knowing the size of the array in advance.
Since we are no longer concerned with outputting to a le, the main method of your Parser
class must be as follows:
public static void main(String[] args){
//outputs a list of all the proteins defined in the uniprot files
//that reside in the folderName folder
String folderName = args[0];
parseAll(folderName);
System.out.println(protein_database);
}
Specications, notes, and hints
The name of the source code les must be Parser.java and Protein.java . The program
should be executable as: java Parser uniprotDirectory where uniprotDirectory is
the name of the directory (folder) that contains the uniprot les.
Since we are now interested in holding more protein data (other than AC and SQ),
you will need to adjust your parse method to extract this new data.
Do not assume that only uniprot les live in uniprotDirectory. You may assume that
all uniprot les will end in .dat.
ArrayLists will allow you to make your parse method simpler.