The entropy is the expected amount of information you get in a sample of a distribution. You can think of it as "how surprised are you, on average, on seeing a sample of that distribution".
An unbiased coin toss has one bit of entropy, as any way is equally likely, but a toss from a coin where both sides are heads has entropy 0, as you're never surprised when seeing its results.
The entropy can also be seen as a constant minus the information gain of a distribution over the uniform distribution (and this divergence, as you know, is the number of bits you "save" when you use a code based on the actual distribution instead of on the uniform distribution).
it's best to get used to it by seeing how it works in theorems and algorithms, and for this I recommend Davd Mackay's book, you can get the PDF from here.