Java2Go
Overview Link to heading
Java2Go is my master’s thesis project: a transpiler that converts Java 8 code into equivalent Go code, built in Kotlin, and enhanced with a T5-based deep neural network for automatic documentation generation.
The goal was to simplify the migration process for Java developers exploring Go — especially given their syntactic and semantic differences — and provide a supportive tool that eases this transition.
Why? Link to heading
While Java remains more popular than Go (Stack Overflow 2022 survey), Go’s performance and simplicity make it an attractive alternative, especially for server-side applications. However, the learning curve can be steep.
Java2Go addresses this by automating translation of:
- mathematical expressions
- conditionals
- loops (
for
,while
,do-while
) - methods and classes
- static vs instance contexts
…making it easier to grasp Go via a side-by-side equivalent translation.
How It Works Link to heading
The system uses ANTLR4 to parse Java source code into an abstract syntax tree (AST), which is then traversed and restructured into Go code. Key components include:
- Lexer & Parser: Based on Java 8 grammar definition.
- AST Traversal: Captures instructions, blocks, and expressions.
- Code Generator: Converts the AST into idiomatic Go syntax.
- Web UI: Built with Angular for an interactive experience.
- Infrastructure: Kubernetes + GitHub Actions + ArgoCD for CI/CD and scalability.
Key Features Link to heading
✅ Transpilation Examples Link to heading
A few examples of converted code:
Simple loop:
// Java
for(int i = 0;
i< 10;i++){
System.out.
println(i);
}
// Go
for i := 0; i < 10; i++ {
fmt.Println(i)
}
Simple static method calling:
// Java
class A {
static void print() {
System.out.println("1111");
}
}
class Main {
public static void main(String[] args) {
A.print();
}
static void print() {
System.out.println("0000");
}
}
// Go
package main
import "fmt"
func A_print() {
fmt.Println("1111")
}
func main() {
A_print()
}
func Main_print() {
fmt.Println("0000")
}
Nested do-while loops:
// Java
class Main {
public static void main(String[] args) {
int outerCounter = 1;
do {
int innerCounter = 1;
do {
int res = outerCounter * innerCounter;
System.out.println(res);
innerCounter++;
} while (innerCounter <= 3);
outerCounter++;
} while (outerCounter <= 2);
}
}
// Go
package main
import "fmt"
func main() {
outerCounter := 1
Loop0:
for {
innerCounter := 1
Loop1:
for {
res := outerCounter * innerCounter
fmt.Println(res)
innerCounter++
if innerCounter <= 3 {
continue Loop1
}
break
}
outerCounter++
if outerCounter <= 2 {
continue Loop0
}
break
}
}
LLM-Generated Documentation Link to heading
The second major feature is automatic documentation generation using a fine-tuned T5 Transformer model from Google. It generates English comments for method headers like:
int max(int a, int b);
to
// Returns the maximum of two integers.
func max(a int, b int) int {
...
}
Training details:
- A dataset of 3,798 method-comment pairs was created (partially using GPT assistance).
- It was expanded with synthetic variants to ~6,800 samples.
- Three T5 variants (Tiny, Mini, Small) were fine-tuned.
- Achieved ROUGE-1 F1 scores over 80%.
Synthetic Example Generation Link to heading
To improve the coverage and generalization of the T5-based documentation model, the original hand‐crafted dataset (≈3 798 examples) was automatically expanded with synthetic examples. In this context, a “synthetic example” is created by taking an existing Java method signature and systematically replacing its parameter names with random identifiers—while leaving the natural‐language comment (label) unchanged, provided that the new identifiers do not already appear in that comment. This process yields a larger set of (method‐signature, comment) pairs, teaching the model to focus on types and method semantics rather than memorizing specific variable names.
Why Synthetic Examples? Link to heading
Diversity: The manually created dataset tends to reuse common parameter names like
a
,b
,n
, etc. Randomizing these names forces the model to learn that:- The meaning of a method (e.g., “returns the maximum of two integers”) depends on its types and structure, not on specific variable names.
- Comments should generalize across any valid identifier.
Scalability: Manually writing thousands of distinct (signature, comment) pairs is time‐consuming. Synthetic generation automates that expansion with minimal human intervention.
Noise Reduction: By only accepting a synthetic variant if the new identifier does not accidentally appear in the original comment, we avoid “label leakage” (where a variable name appears verbatim in its own description).
How It Worked Link to heading
Input Dataset
We started with a curated table of Java method signatures and corresponding English comments. For example:Method Declaration Comment int max(int a, int b)
Returns the maximum of two integers. boolean isEven(int n)
Checks if an integer is even. int countDigits(int n)
Counts the number of digits in an integer. int reverseInteger(int x)
Reverses the digits of an integer. Identifier Pool
A pool of random identifiers (e.g.,xlqdf
,moauo
,eoqfc
, etc.) was generated. Each identifier is guaranteed to:- Be a valid Java variable name (letters only, starting with a lowercase letter).
- Not collide with Java keywords (e.g.,
int
,for
,while
).
Replacement Algorithm
For every original example:- Parse the method signature to extract each parameter’s type and name.
- For each parameter, pick a random replacement name from the pool.
- Form a new “synthetic” method signature by substituting the old names with the new random names.
- Check: If none of the newly chosen names appear anywhere in the original comment text, accept this synthetic variant; otherwise, discard it (to avoid leaking an identifier that could give away the comment’s content).
- Repeat until each original example has at least one or two valid synthetic variants (or up to a predetermined limit).
Resulting Synthetic Examples
Below is a small excerpt showing how synthetic examples were derived:Original Signature Original Comment Synthetic Signature int max(int a, int b)
Returns the maximum of two integers. int max(int zampl, int eoqfc)
boolean isEven(int n)
Checks if an integer is even. boolean isEven(int qljor)
int countDigits(int n)
Counts the number of digits in an integer. int countDigits(int moauo)
int reverseInteger(int x)
Reverses the digits of an integer. int reverseInteger(int xlqdf)
Notice that, for example,
zampl
oreoqfc
do not appear in “Returns the maximum of two integers,” so we can safely pair them.
Benefits Observed Link to heading
Higher Generalization: During validation, the T5 model trained on the combined (original + synthetic) set achieved nearly the same ROUGE-1 F1 score as the model trained on only the original examples—despite being exposed to more “noisy” identifier choices. This indicates that the model truly learned to ignore arbitrary variable names and focus on:
- Parameter types (e.g.,
int
,String
). - Method structure (e.g., two-parameter comparison vs. single-parameter check).
- Common naming patterns (e.g., methods starting with
is
usually yield boolean checks).
- Parameter types (e.g.,
Reduced Overfitting: With more diverse signatures, the model was less likely to overfit on specific combinations of parameter names seen in the small manual set.
Fine-tuning evaluation results Link to heading
Model variant | Sample count | Epochs | Steps | Error | R1-F1 | R2-F1 |
---|---|---|---|---|---|---|
Tiny | 3798 | 30 | 51270 | 0.305094 | 0.783715 | 0.604254 |
Tiny | 6827 | 17 | 52224 | 0.261749 | 0.772460 | 0.630675 |
Mini | 3798 | 7.96 | 13600 | 0.283580 | 0.800067 | 0.627852 |
Mini | 6827 | 5.6 | 17200 | 0.220322 | 0.800122 | 0.666803 |
Small | 3798 | 3.51 | 6000 | 0.275755 | 0.800837 | 0.627314 |
Small | 6827 | 2.54 | 7800 | 0.209636 | 0.804963 | 0.672517 |
Tech Stack Link to heading
- Frontend: Angular
- Backend: Kotlin + ANTLR4
- AI: T5 Transformer (HuggingFace), trained with PyTorch
- Infrastructure: Kubernetes, GitHub Actions, ArgoCD, Docker, Prometheus