Transforming natural language descriptions into executable Python code using the powerful CodeT5 model. This project implements a robust, scalable solution for text-to-code generation using TensorFlow, Hugging Face Transformers, and the Mostly Basic Python Programming (MBPP) dataset.
This project focuses on fine-tuning CodeT5, a transformer-based model specialized for code generation tasks, to convert natural language prompts into Python code. It leverages modern deep-learning techniques such as mixed precision training, distributed strategies, and custom TensorFlow training loops.
- Advanced Code Generation: Converts natural language prompts into executable Python code.
- Custom Training Loops: Fine-tunes the CodeT5 model using a customized TensorFlow training pipeline within a Jupyter Notebook.
- Mixed Precision & XLA Optimization: Accelerates training and reduces memory usage using hardware-specific optimizations.
- Multi-Device Strategy: Supports single-GPU, multi-GPU, and TPU training with TensorFlow's
tf.distributeAPI. - Dynamic Inference: Perform inference on real-time inputs with flexible decoding methods (e.g., Top-p sampling).
- Based on T5's encoder-decoder architecture.
- Pre-trained on CodeSearchNet across multiple programming languages.
- Fine-tuned for Python code generation tasks.
- Contains 1,000 Python problems for evaluating code generation models.
- Includes natural language descriptions, Python code solutions, and automated test cases.
- This project focuses on the verified subset of 426 problems for higher accuracy.
- Ensure proper environment setup with TensorFlow and Hugging Face libraries.
- Supports mixed precision and XLA optimization for faster execution.
pip install tensorflow transformers datasets- Download and preprocess the MBPP dataset using Hugging Face's
datasetslibrary. - Prepare input-output pairs for model training within the notebook.
- Implements a custom training loop with TensorFlow's
GradientTapefor precise control. - Supports warm-up learning rate and AdamW optimizer.
from transformers import TFT5ForConditionalGeneration
model = TFT5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base')- Mixed Precision: Uses
float16for performance boost. - XLA (Accelerated Linear Algebra): Enables faster graph execution.
- Implemented entirely within the Jupyter Notebook using TensorFlow's
tf.function. - Distributed across multiple GPUs using
tf.distribute.MirroredStrategy(). - Includes detailed logging and checkpoint management.
with strategy.scope():
model = create_model()
optimizer = create_optimizer()- Supports both dataset-based and custom input inference directly from the notebook.
def predict_from_text(text):
query = "Generate Python: " + text
generated_code = model.generate(input_ids, max_length=256)
print(tokenizer.decode(generated_code[0]))- Training Efficiency: Leveraged mixed precision and distributed training to improve speed and reduce memory overhead.
- Evaluation: Achieved low validation loss and high generation accuracy.
- Clone the repository and set up the environment:
git clone https://github.com/Shreyash-Gaur/TensorFlow_Python_Code_Generation.git
cd TensorFlow_Python_Code_Generation
pip install -r requirements.txt- Open the Jupyter Notebook and execute the cells step-by-step:
jupyter notebook Python_Code_Generation.ipynb- Perform inference using the provided methods:
predict_from_text(args, "Write a function to concatenate two dictionary"); print()- Custom TensorFlow Loops: Provides granular control over training and debugging.
- Task Prefixing: Improved model generalization by adding task-specific prompts.
- Scalability: Compatible with multi-GPU and TPU environments.
- Enhance generation quality with larger CodeT5 models.
- Explore alternative code benchmarks (e.g., HumanEval, APPS).
- Integrate with real-world AI coding assistants.
Contributions are welcome! Feel free to fork the repository and submit pull requests.
This project is licensed under the MIT License.
🌟 If you found this project useful, give it a star!