GitHub released the GitHub Multilingual Repositories Dataset under CC0-1.0, covering over 80 million classification rows across more than 40 million public repositories. The dataset does not contain repository content. It contains metadata: language classifications of READMEs, the most-commented issue, and the most-commented pull request, each sampled at 150 characters, run through three separate classifiers, fastText, gcld3, and lingua-py, each with a confidence score above 0.5. Portuguese leads non-English READMEs with over 3 million repositories. Korean is the most common non-English language in issues but ranks fifth in READMEs. These gaps are the point.
The decision not to collapse three classifiers into a single label is the most interesting technical choice here. Each classifier has different coverage and confidence calibration, particularly for lower-resource languages. Researchers can require all three to agree for high-precision subsets or accept one for broader recall. That flexibility matters when the goal is building evaluation sets for AI coding tools that need to perform across languages, not just English.
The release ties directly to Microsoft's European Digital Commitments, announced July 20, 2025, and GitHub will present the dataset at the Open Innovation Dialogue Hub in Strasbourg on June 16. The dataset is a discovery tool, not a ground-truth benchmark, and the documentation is explicit about its limits: short samples, mixed-language content, and classifier variance mean it should inform research workflows, not replace them. Read the full release for the breakdown of caveats and the specific use cases GitHub is targeting.
[READ ORIGINAL →]