Skip to content

Conversation

@mbg
Copy link
Member

@mbg mbg commented Nov 19, 2025

This PR adds experimental support for excluding files that are marked as linguist-generated=true in a .gitattributes file from analysis.

Risk assessment

For internal use only. Please select the risk level of this change:

  • Low risk: Changes are fully under feature flags, or have been fully tested and validated in pre-production environments and are highly observable, or are documentation or test only.

Which use cases does this change impact?

Workflow types:

  • Managed - Impacts users with dynamic workflows (Default Setup, CCR, ...).

Products:

  • CCR - The changes impact analyses for Copilot Code Reviews.

Environments:

  • Dotcom - Impacts CodeQL workflows on github.com.

How did/will you validate this change?

  • Test repository - This change will be tested on a test repository before merging.
  • Unit tests - I am depending on unit test coverage (i.e. tests in .test.ts files).
  • End-to-end tests - I am depending on PR checks (i.e. tests in pr-checks).

If something goes wrong after this change is released, what are the mitigation and rollback strategies?

  • Feature flags - All new or changed code paths can be fully disabled with corresponding feature flags.

How will you know if something goes wrong after this change is released?

  • Telemetry - I rely on existing telemetry or have made changes to the telemetry.
    • Dashboards - I will watch relevant dashboards for issues after the release. Consider whether this requires this change to be released at a particular time rather than as part of a regular release.
    • Alerts - New or existing monitors will trip if something goes wrong with this change.

Are there any special considerations for merging or releasing this change?

  • No special considerations - This change can be merged at any time.

Merge / deployment checklist

  • Confirm this change is backwards compatible with existing workflows.
  • Consider adding a changelog entry for this change.
  • Confirm the readme and docs have been updated if necessary.

@github-actions github-actions bot added the size/M Should be of average difficulty to review label Nov 19, 2025
@mbg mbg force-pushed the mbg/ignore-generated branch from 2fac308 to b4db382 Compare November 19, 2025 19:42
@mbg mbg marked this pull request as ready for review November 19, 2025 19:44
@mbg mbg requested a review from a team as a code owner November 19, 2025 19:44
Copilot AI review requested due to automatic review settings November 19, 2025 19:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds experimental support for excluding generated files (marked with linguist-generated=true in .gitattributes) from CodeQL analysis. The feature is controlled by the ignore_generated_files feature flag and is automatically enabled for Copilot Code Reviews (CCR).

Key Changes

  • New git utility functions (listFiles, getGeneratedFiles) to identify generated files via git attributes
  • New isCCR() helper function to detect CCR execution context
  • Integration into config initialization to add generated files to paths-ignore

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/git-utils.ts Adds functions to list tracked files and identify files marked as generated via linguist-generated attribute
src/git-utils.test.ts Adds unit tests for the new git utility functions
src/feature-flags.ts Defines the IgnoreGeneratedFiles feature flag with environment variable CODEQL_ACTION_IGNORE_GENERATED_FILES
src/config-utils.ts Integrates generated file detection into config initialization, adding them to paths-ignore when feature is enabled or in CCR
src/actions-util.ts Adds isCCR() detection function and modifies isDefaultSetup() to exclude CCR scenarios
src/actions-util.test.ts Adds tests for the new workflow detection functions
lib/*.js Auto-generated JavaScript code mirroring TypeScript changes (not reviewed per guidelines)

["ls-files"],
"Unable to list tracked files.",
);
return stdout.split(os.EOL);
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting on os.EOL will include an empty string at the end of the array if the git output ends with a newline (which is typical). This empty string will then be passed to git check-attr in getGeneratedFiles. Consider filtering out empty strings: return stdout.split(os.EOL).filter((line) => line.length > 0);

Suggested change
return stdout.split(os.EOL);
return stdout.split(os.EOL).filter((line) => line.length > 0);

Copilot uses AI. Check for mistakes.

const generatedFiles: string[] = [];
const regex = /^([^:]+): linguist-generated: true$/;
for (const result of stdout.split(os.EOL)) {
Copy link

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting on os.EOL will include an empty string at the end of the array if the git output ends with a newline (which is typical). The regex won't match empty strings, but it's cleaner to filter them out explicitly: for (const result of stdout.split(os.EOL).filter((line) => line.length > 0)). This pattern is used elsewhere in the codebase (e.g., line 284 checks if (line) before processing).

Suggested change
for (const result of stdout.split(os.EOL)) {
for (const result of stdout.split(os.EOL).filter((line) => line.length > 0)) {

Copilot uses AI. Check for mistakes.
}

/** Determines whether we are running in CCR. */
export function isCCR(): boolean {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about introducing an environment variable we set in CCR, rather than relying on the analysis key?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed elsewhere, that is sort of what we are doing here already (and we decide on whether are in CCR in other places in the same way currently).

A better solution would be to add a new analysis kind (or similar) for CCR, but that would be more work and currently the same as code-quality in essentially everything but name.

I'd suggest we should stick with this for the moment (since we also use the same approach elsewhere) and look at improving it longer-term.

export async function getGeneratedFiles(
workingDirectory: string,
): Promise<string[]> {
const files = await listFiles(workingDirectory);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially be a very large number of files, too many to pass on the command line.

If we're mainly interested in CCR, could we filter down to just the diff here?

Alternatively, we could parse globs from the .gitattributes file rather than finding all files that match. That would be more likely to contain one entry for a large directory rather than potentially hundreds.

Or we could add a limit on the number of files on which we'll run check-attr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially be a very large number of files, too many to pass on the command line.

True. The files could be passed in via stdin as well I think, which might be a more robust solution.

If we're mainly interested in CCR, could we filter down to just the diff here?

Potentially. We should first look at whether there is actually enough of a performance hit for large repos for this to make sense though.

Alternatively, we could parse globs from the .gitattributes file rather than finding all files that match. That would be more likely to contain one entry for a large directory rather than potentially hundreds.

I considered this, but I'd like to avoid it if we can get the information from git itself so we don't have to try and duplicate what it does.

Comment on lines +423 to +425
export async function getGeneratedFiles(
workingDirectory: string,
): Promise<string[]> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you measured how long this operation takes overall on a large mono-repo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it should be relatively quick, since Git caches this information IIRC. That said, I haven't explicitly tested it on a large repo, but I can do that.

@mbg mbg requested a review from henrymercer January 19, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Should be of average difficulty to review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants