Hi All,

I am new to this forum and this is my first thread. I am new to Java as well. This is my requirement:

I have some input text like this:

[NP The/DT U/NNP ]
 
 P/.
 
 [NP Workers/NNPS April/NNP skip/NN ] [PP to/TO ] [NP main/JJ skip/NN ] [PP to/TO ] [NP sidebar/NN ] [NP The/DT U/NNP ]
 
 P/.
 
 Workers/NNPS [NP This/DT site/NN ] [VP is/VBZ ] [ADJP open/JJ ] [PP for/IN ] [NP posting/VBG and/CC comments/NNS ] [PP by/IN ] [NP all/DT rank/NN and/CC file/NN administrative/JJ employees/NNS ] [PP of/IN ] [NP the/DT University/NNP ] [PP of/IN ] [NP the/DT Philippines/NNPS ] and/CC [NP the/DT Philippine/NNP General/NNP Hospital/NNP The/NNP National/NNP University/NNP Hospital/NNP ] [ADVP especially/RB ] [NP the/DT officers/NNS and/CC members/NNS ] [PP of/IN ] [NP the/DT All/NNP U/NNP ]
 
 P/.
 
 [NP Workers/NNPS Union/NNP ]
 
 [NP Friday/NNP April/NNP Stop/NNP Paying/NNP Nuke/NNP Plant/NNP Debt/NNP SC/NNP Justice/NNP Urges/NNPS Gov't/NNP ] [VP Posted/VBD pm/VBN ] [NP Mla/NNP time/NN April/NNP By/NNP Vincent/NNP Cabreza/NNP Inquirer/NNP News/NNP Service/NNP Published/NNP ] [PP on/IN ] [NP page/NN A/NNP ] [PP of/IN ] [NP the/DT Apr/NNP ]
 
But/CC [NP Puno/NNP ] [VP points/VBZ ] [PRT out/RP ] [SBAR that/IN ] [NP the/DT US/NNP law/NN ] [VP bars/VBZ ] [NP the/DT towns/NNS ] [PP from/IN ] [VP issuing/VBG ] [NP new/JJ taxes/NNS ] [VP to/TO pay/VB ] [PP for/IN ] [NP their/PRP$ debts/NNS ] unsafe/JJ
 
www/WRB
 
 
 
-----etc-----------------

I needed to format the text into this format: {This is the desired output format}
The	DT	B-NP
U	NNP	I-NP
 
P
 
Workers	NNPS	B-NP
April	NNP	I-NP
skip	NN	I-NP
to	TO	B-PP
main	JJ	B-NP
skip	NN	I-NP
to	TO	B-PP
sidebar	NN	B-NP
The	DT	B-NP
U	NNP	I-NP
 
P
Workers  NNPS
.........
etc
.......

I have written the code to transform this into a format but the output does not match the above one. So the requirement is not met.

I am using Regex to solve the problem:

Pattern p = Pattern
            .compile("\\[(\\p{Alpha}+) +(\\p{Graph}+)/(\\p{Alpha}+)(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))? ]+(?:(\\./. |\\./.$))?(?: +(\\./. |\\./.$))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?:(\\p{Alnum}+)/(\\p{Alpha}+))?",Pattern.MULTILINE);

Printing the output as:
while (matcher.find()) {
        	//System.out.println();
			System.out.println("For: " +matcher.group())	;		
			System.out.println(matcher.group(2) + "\t" + matcher.group(3)
					+ "\tB-" + matcher.group(1));
 
			if (matcher.group(4) != null) {
				System.out.println(matcher.group(4) + "\t" + matcher.group(5)
						+ "\tI-" + matcher.group(1));
 
			}
-------etc---------------------------------------
The regex looks big as I have trained it to capture all types of words in the brackets []. But it is failing to generate the output when it sees: "But/CC " or this kind of pattern in my text. But when it sees the second one like: "unsafe/JJ" it generates the output.
So currently my output(which is wrong) looks like this(with no gaps after a sentence):

The	DT	B-NP
U	NNP	I-NP
Workers	NNPS	B-NP
April	NNP	I-NP
skip	NN	I-NP
to	TO	B-PP
main	JJ	B-NP
skip	NN	I-NP
to	TO	B-PP
sidebar	NN	B-NP
The	DT	B-NP
U	NNP	I-NP
This	DT	B-NP
site	NN	I-NP
is	VBZ	B-VP
 
-------


You can see that it has omitted some words straightaway.

So I have 2 requirements:

1. How to capture the pattern "But/CC" (or this type) which is not in brackets?
2. After every sentence or pattern we see that there is a line gap in the input text. Thus after a sentence we see a gap. So in the output also, I need to give a line break after each sentence as provided in the input text file. [Also after P/. there should be a line break as is there in the input]

Please refer to the desired output part of this thread. I need to write a Regex code to solve this. Please help me to modify/write the same.

Thanks!