My pull request on GIT contains several commits done by others. How to prevent this behavior?

Question

I created a branch (branch B) from branch A and did my changes and pushed it to branch B. Then I tried to make a pull request to branch A. But, my pull request contains several previous commits done by others (other team members). Why is this happening? How can I raise a clean PR which contains only my commits?

Edit: The hosting site I am using is Bitbucket.

torek · Accepted Answer

Preliminary note: I don't actually use Bitbucket and some of my terminology may be slightly off. Specifically, I may be using some terms found only on Bitbucket here. But Bitbucket and GitHub handle forks and pull requests (PRs) very similarly, fortunately, and it seems that about 6 or 7 years ago Bitbucket updated their PR machinery to match GitHub's (see How to update a pull request on bitbucket?).

I'm also not sure if you're using a Bitbucket fork, or just working directly in a shared repository with separate branches. I'm going to assume a fork but this should probably work the same in a shared repository (it does on GitHub).

I created a branch (branch B) from branch A and did my changes and pushed it to branch B. Then I tried to make a pull request to branch A. But, my pull request contains several previous commits done by others (other team members). Why is this happening?

At the Git level, what we're really concerned with here are commits. These are the entities that Git deals with, one whole commit at a time. There are three or four things to know about here with each commit, depending on how you count these items:

Every commit has a unique hash ID. This is the big ugly (40 characters, for SHA-1) hexadecimal number you see in git log output. This hash ID is, in effect, the "true name" of the commit. Any branch or tag or other names we use are just ways to let some Git repository find the correct hash ID. Each Git repository—each fork or clone—has its own private names. The fork mechanism on the hosting site provides a way to connect one hosted repository to another and view some or all names out of the other hosted repository, but it's always the hash IDs that really matter.
Commits are made up of snapshots and metadata. Technically, the snapshot is metadata in the commit itself, but let's consider the snapshot first. Each snapshot holds a full copy of every file that goes with that particular commit. There are no changes here, just full snapshots. This isn't too important for any given PR, but is important when we go to make copies of commits.
The metadata in any given commit give things like the author and committer (name, email, and date-and-time tuples) and the log message. That is, the metadata contains the description of the commit. The metadata in any one commit also has a list of parent commit hash IDs. Most commits—the ones Git calls ordinary commits, if it bothers to given them any adjective at all—have exactly one parent hash ID. This is the (single) earlier commit that comes just before the commit we're looking at now.
All parts of any commit are fully read-only. This is how the hash IDs can work: the hash ID is simply a cryptographic checksum of the full contents of the commit. The date-and-time-stamp usually makes a commit unique all by itself, but even if you manage to make multiple commits within a single second, or fake the timestamps, the parent commit hash ID(s) stored in the metadata of this commit differ from the parent commit hash ID(s) stored in the parent commit, and that alone serves to give this commit a unique hash ID.

(Item 4 could be part of item 1 depending on how you want to treat it. Note that Git finds the commit by its hash ID, then verifies that the hash ID matches the checksum of the stored data as it extracts the commit object from the database. This detects even a single bit of change as a corrupted commit, so that no part of any commit can ever change.)

The parent hash IDs wind up stringing commits together as backwards-looking chains. We can draw this fairly simply, assuming ordinary commits, by using uppercase letters to stand in for the actual hash IDs. If we put the latest commit, H, on the right, we get a drawing that looks like this:

... <-F <-G <-H

Commit H contains, as its snapshot, all the files. It contains in its (single) parent hash ID in its metadata the actual hash ID of earlier commit G. Git can thus use H's metadata to find commit G. This has a full snapshot as well, and by comparing the G and H snapshots, Git will find what we changed in commit H. By using the metadata in G, Git can work its way back one more step, to commit F.

Commit F has a snapshot too, of course, and has its own metadata with yet another parent hash ID. By following each commit backwards, one hop at a time, Git can work its way back to the very first commit anyone ever made for this repository. That commit is special in that it has no parent hash IDs in it, which makes it what Git calls a root commit. Git can now stop going backwards, having visited every commit leading to commit H. This is the history in the repository, at least for the branch whose last commit is commit H.

Git needs the hash ID of commit H to do all of this. To get that hash ID, Git can use your branch name. For instance, if H is the last commit of branch master, we might draw it like this:

...--G--H   <-- master

The name master points to (locates) commit H in the database; commit H points back to G; and so on. I stop drawing arrows at this point because of the way I draw branches:

...--G--H   <-- master
         \
          I--J   <-- develop

Here, the name develop points to J; J points to I; and I points to H. Commits up through and including J are on develop, and commits up through and including H are on master. This means many commits are on both branches. That's a weird thing about Git, if you're used to other version control systems: commits appear and disappear from branches depending on where you put the branch name.

It's not the branch names that matter! It's the commits. The branch names are not completely irrelevant, since we use them to find the commits, but they're mostly irrelevant because we can change them around all we like. For instance, GitHub now use main instead of master, and if you are a fan of this, you simply rename your master to main and now all the master commits are main commits instead. In this particular example, we can even just remove master entirely, leaving only develop. That's good enough because commit I leads backwards to commit H.

This isn't the way humans usually think of branches

When we look at a diagram like this:

          I--J   <-- branch1
         /
...--G--H   <-- master
         \
          K   <-- branch2
           \
            L   <-- branch3

a lot of humans will say that commits through H are on master, I-J are the commits that are on branch1, K is the only commit on branch2, and L is the only commit on branch3.

This is not how Git treats them. Commits through H are on all four branches, and commit K is on two branches. The three remaining commits are all just one one branch each. To get Git and humans to agree on these things, what we end up doing is using exclusion rules of the form:

master..branch1

which really means: the set of all commits reachable from J, minus the set of all commits reachable from H. This gives us the I-J pair. Likewise master..branch2 gives us just commit K, and branch2..branch3 gives us just commit L.

Branch names are not the only kinds of names

Besides branch names, Git can find commits by lots of other kinds of names. On GitHub, for instance, pull requests cause names of the form refs/pull/number/head to appear in the repository to which the PR is made. This particular name is linked, on GitHub, to some branch name in some repository—your branch name in your fork, for instance, or your branch name in the same repository via sharing.

(Bitbucket uses slightly different names, but the concepts match up.)

Your situation

We can't see the various names in the various repositories, but we can approximate them. You yourself can only see some of these names sometimes, depending on many things (including whether you made a fork, and the rules of the hosting site). But we know, from your complaint that someone else's commits are contained in your PR, that the actual situation in the repository in which the PR is asking some human to add commits, looks a bit like this one:

...--G--H   <-- master
         \
          I--J   <-- someone-elses-work
              \
               K   <-- your-PR

The one commit you made, commit K, comes after the two commits that someone else made: I-J. Your pull request asks that they—whoever "they" are—incorporate your commit K, into their branch master. GitHub call this the base branch of the pull request.

Because commits can't be changed, it's only possible for them to add commit K by adding all three commits. So your PR asks for all these commits to be added.

How can I raise a clean PR which contains only my commits?

You must make new commits.

[According to the accepted answer to https://stackoverflow.com/q/14034718/1256452], I have to create a new branch, cherry-pick commits from my old branch, and raise a [new] pr

These particular steps are not necessary, and in particular the "raise a new PR" is no longer required at Bitbucket (but was back in 2014). The overall idea, however, is generally right.

Assuming the diagram I drew is accurate (or close enough), commit K is the problem. It adds on to commit J. You want instead a commit that adds on to commit H. To achieve this, you must make a new and improved commit—a cherry-pick, in other words.

Let's re-draw the branch diagram in your repository, though, like this:

...--G--H   <-- master
         \
          I--J   <-- branch-X
              \
               K   <-- feature (HEAD)

(the extra HEAD, in parentheses, indicates that your current branch name is feature, so that K is your current commit).

You could create a new branch name pointing to commit H, and get onto that branch:

...--G--H   <-- feature2 (HEAD), master
         \
          I--J   <-- branch-X
              \
               K   <-- feature

and then cherry-pick K to a copy, which we can call K':

          K'  <-- feature2 (HEAD)
         /
...--G--H   <-- master
         \
          I--J   <-- branch-X
              \
               K   <-- feature

Then you could delete feature entirely, rename feature2 to feature, and use git push -f to update your Bitbucket fork, or the shared Bitbucket repository, which would cause Bitbucket to automatically update your PR.

Or, you can simply run:

git rebase --onto master branch-X

(without ever creating feature2). Your Git will then:

use detached HEAD mode, like this, to get to commit H:

...--G--H   <-- HEAD, master
         \
          I--J   <-- branch-X
              \
               K   <-- feature

having listed out commit K's hash ID (using branch-X..feature internally¹), cherry-pick commit K to get K'; and finally
force the name feature to point to K'.

The final result is:

          K'  <-- feature
         /
...--G--H   <-- master
         \
          I--J   <-- branch-X
              \
               K   [abandoned]

but without as much motion on your part—no separate git checkout and git cherry-pick operations, for instance—and without the minor pain of having to delete the old feature and then rename the temporary branch name, and the associated headaches of losing any associated upstream, and so on. So this is definitely a more user-efficient way to get where you want to go.

¹Actually, the internals for git rebase are much more complicated than this. There's the fork-point magic, the --no-merges part, and the symmetric difference and cherry-mark / git cherry upstream-equivalent elision. But the stop..start stuff is where to start, and for this case, none of the complications should get in the way. Still, that's why tutorials should start with cherry-pick instead of rebase!

The "base branch" is important

I don't know what Bitbucket call it, but when raising a PR, the "base branch" is how GitHub determine the range of commits you're actually asking to merge (by specifying the merge target branch name). GitHub's mechanism for this is overly complicated and they refuse to show the actual commit graph, which makes this quite tricky. You are allowed to change the base branch of a PR; the effects of this are also tricky: it's not clear to me, for instance, what happens if you move to a descendant commit, then back to the original commit, or even one of its ancestors.

My pull request on GIT contains several commits done by others. How to prevent this behavior?

Answers (1)

This isn't the way humans usually think of branches

Branch names are not the only kinds of names

Your situation

The "base branch" is important

Related Questions