• Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers
  • Advertising & Talent Reach devs & technologists worldwide about your product, service or employer brand
  • OverflowAI GenAI features for Teams
  • OverflowAPI Train & fine-tune LLMs
  • Labs The future of collective knowledge sharing
  • About the company Visit the blog

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Get early access and see previews of new features.

How exactly are data types represented in a computer?

I'm a beginning programmer reading K&R, and I feel as if the book assumes a lot of previous knowledge. One aspect that confuses me is the actual representation, or should I say existence, of variables in memory. What exactly does a data type specify for a variable? I'm not too sure of how to word this question... but I'll ask a few questions and perhaps someone can come up with a coherent answer for me.

When using getchar(), I was told that it is better to use type "int" than type "char" due to the fact that "int" can hold more values while "char" can hold only 256 values. Since we may need the variable to hold the EOF value, we will need more than 256 or the EOF value will overlap with one of the 256 characters. In my mind, I view this as a bunch of boxes with empty holes. Could someone give me a better representation? Do these "boxes" have index numbers? When EOF overlaps with a value in the 256 available values, can we predict which value it will overlap with?

Also, does this mean that the data type "char" is only fine to use when we are simply assigning a value to a variable manually, such as char c = 'a', when we definitely know that we will only have 256 possible ASCII characters?

Also, what is the actual important difference between "char" and "int"? If we can use "int" type instead of "char" type, why do we decide to use one over the other at certain times? Is it to save "memory" (I use quotes as I do not actually how "memory" exactly works).

Lastly, how exactly is the 256 available values of type char obtained? I read something about modulo 2^n, where n = 8, but why does that work (something to do with binary?). What is the modulo portion of "modulo 2^n" mean (if it has any relevance to modular arithmetic, I can't see the relation...)?

  • kernighan-and-ritchie

Brian Tompsett - 汤莱恩's user avatar

  • "when we definitely know that we will only have 256 possible ASCII characters?" nit-pick: There's only 128 characters in ASCII. –  kusma Commented Jan 9, 2010 at 17:29
  • 1 there is more than just "int".. there is unsigned int (0-65535) and signed int (-32767 to 32767)... plain char in most implementations is 0 to 255 in unsigned. You also have short and long. short is two bytes, int is 4 bytes, and long is 8 bytes. See here: home.att.net/~jackklein/c/inttypes.html –  user195488 Commented Jan 9, 2010 at 17:29
  • Sorry :s Then could I ask why we cannot use the 256 available values of type "char" if we are also using the function getchar() and expecting an EOF at some point? –  withchemicals Commented Jan 9, 2010 at 17:30
  • 1 withchemicals: Because "char" doesn't really imply "ASCII". In C, "char" implies "smallest addressable piece of memory", otherwise known as "byte". getchar() simply gives you a byte from a stream (or EOF). –  kusma Commented Jan 9, 2010 at 17:41
  • 2 Many of the answers below assume two's complement representation of numbers. C makes no such guarantees. It can run (and does run) on ones' complement machines, and sign-magnitude machines (or any other weird encodings there are!). –  Alok Singhal Commented Jan 10, 2010 at 2:51

11 Answers 11

Great questions. K&R was written back in the days when there was a lot less to know about computers, and so programmers knew a lot more about the hardware. Every programmer ought to be familiar with this stuff, but (understandably) many beginning programmers aren't.

At Carnegie Mellon University they developed an entire course to fill in this gap in knowledge, which I was a TA for. I recommend the textbook for that class: "Computer Systems: A Programmer's Perspective" http://amzn.com/013034074X/

The answers to your questions are longer than can really be covered here, but I'll give you some brief pointers for your own research.

Basically, computers store all information--whether in memory (RAM) or on disk--in binary, a base-2 number system (as opposed to decimal, which is base 10). One binary digit is called a bit. Computers tend to work with memory in 8-bit chunks called bytes.

A char in C is one byte. An int is typically four bytes (although it can be different on different machines). So a char can hold only 256 possible values, 2^8. An int can hold 2^32 different values.

For more, definitely read the book, or read a few Wikipedia pages:

  • http://en.wikipedia.org/wiki/Binary_numeral_system
  • http://en.wikipedia.org/wiki/Twos_complement

Best of luck!

UPDATE with info on modular arithmetic as requested:

First, read up on modular arithmetic: http://en.wikipedia.org/wiki/Modular_arithmetic

Basically, in a two's complement system, an n-bit number really represents an equivalence class of integers modulo 2^n.

If that seems to make it more complicated instead of less, then the key things to know are simply:

  • An unsigned n-bit number holds values from 0 to 2^n-1. The values "wrap around", so e.g., when you add two numbers and get 2^n, you really get zero. (This is called "overflow".)
  • A signed n-bit number holds values from -2^(n-1) to 2^(n-1)-1. Numbers still wrap around, but the highest number wraps around to the most negative, and it starts counting up towards zero from there.

So, an unsigned byte (8-bit number) can be 0 to 255. 255 + 1 wraps around to 0. 255 + 2 ends up as 1, and so forth. A signed byte can be -128 to 127. 127 + 1 ends up as -128. (!) 127 + 2 ends up as -127, etc.

jasoncrawford's user avatar

  • Thanks! Could you explain the "modulo" portion of 2^n? –  withchemicals Commented Jan 9, 2010 at 17:51
  • I would rather have said "back in the days when programming was a lot lower level, closer to the hardware, so learning programming quickly required (and resulted in) a good basic understanding of the underlying hardware". –  L. Cornelius Dol Commented Jan 9, 2010 at 18:55
  • Software Monkey: well-said, I think that's more exact than what I wrote. –  jasoncrawford Commented Jan 10, 2010 at 2:25
  • withchemicals: Updated answer with some info on modular arithmetic; hope this helps. –  jasoncrawford Commented Jan 10, 2010 at 2:38
  • BTW, on some platforms there is at least one character in the C character set whose value exceeds the maximum value for a signed char. For example, in EBCDIC, '0' is 0xF0. On such machines, 'char' must be unsigned. On some other platforms (e.g. some DSP's), sizeof(char)==sizeof(int), and both are able to hold values -32767..32767 (and perhaps other values as well). On such machines, 'char' must be signed. Note further that on such machines it would be possible for -1 to be a valid character value, and for EOF to be some value other than -1. –  supercat Commented Feb 15, 2011 at 20:41
One aspect that confuses me is the actual representation, or should I say existence, of variables in memory. What exactly does a data type specify for a variable?

At the machine level, the difference between int and char is only the size, or number of bytes, of the memory allocated for it by the programming language. In C, IIRC, a char is one byte while an int is 4 bytes. If you were to "look" at these inside the machine itself, you would see a sequence of bits for each. Being able to treat them as int or char depends on how the language decides to interpret them (this is also why its possible to convert back and forth between the two types).

When using getchar(), I was told that it is better to use type "int" than type "char" due to the fact that "int" can hold more values while "char" can hold only 256 values.

This is because there are 2^8, or 256 combinations of 8 bits (because a bit can have two possible values), whereas there are 2^32 combinations of 32 bits. The EOF constant (as defined by C) is a negative value, not falling within the range of 0 and 255. If you try to assign this negative value to a char (this squeezing its 4 bytes into 1), the higher-order bits will be lost and you will end up with a valid char value that is NOT the same as EOF. This is why you need to store it into an int and check before casting to a char.

Also, does this mean that the data type "char" is only fine to use when we are simply assigning a value to a variable manually, such as 0char c = 'a', when we definitely know that we will only have 256 possible ASCII characters?

Yes, especially since in that case you are assigning a character literal.

Also, what is the actual important difference between "char" and "int"? If we can use "int" type instead of "char" type, why do we decide to use one over the other at certain times?

Most importantly, you would pick int or char at the language level depending on whether you wanted to treat the variable as a number or a letter (to switch, you would need to cast to the other type). If you wanted an integer value that took up less space, you could use a short int (which I believe is 2 bytes), or if you were REALLY concerned with memory usage you could use a char , though mostly this is not necessary.

Edit : here's a link describing the different data types in C and modifiers that can be applied to them. See the table at the end for sizes and value ranges.

danben's user avatar

  • Nitpick: to handle characters you'd stay the hell away from char and use a higher-level abstraction from a library like GLib. –  Tobu Commented Jan 9, 2010 at 17:34
  • 2 Sure, but I still think its important to understand what's actually going on at the lower levels. –  danben Commented Jan 9, 2010 at 17:36
  • 2 In C, an int can be 4 bytes, or more, or less. int must be able to represent values between -32767 and +32767 . –  Alok Singhal Commented Jan 9, 2010 at 17:40
  • 3 int is not necessary 4 bytes. All C says is: short <= int <= long and short >= 2 bytes and long >= 4 bytes. See "The C Programming Language", ANSI C Version, by K&R, page 36. –  Dave O. Commented Jan 9, 2010 at 17:49
  • 2 Also, in C, char may be signed, in which case it can store EOF , but of course char may be unsigned as well, and that's why we use int in this case. –  Alok Singhal Commented Jan 9, 2010 at 17:50

Basically, system memory is one huge series of bits, each of which can be either "on" or "off". The rest is conventions and interpretation.

First of all, there is no way to access individual bits directly; instead they are grouped into bytes, usually in groups of 8 (there are a few exotic systems where this is not the case, but you can ignore that for now), and each byte gets a memory address. So the first byte in memory has address 0, the second has address 1, etc.

A byte of 8 bits has 2^8 possible different values, which can be interpreted as a number between 0 and 255 (unsigned byte), or as a number between -128 and +127 (signed byte), or as an ASCII character. A variable of type char per C standard has a size of 1 byte.

But bytes are too small for a lot of things, so other types have been defined that are larger (i.e. they consist of multiple bytes), and CPUs support these different types through special hardware constructs. An int is typically 4 bytes nowadays (though the C standard does not specify it and ints can be smaller or bigger on different systems) because 4 bytes are 32 bits, and until recently that was what mainstream CPUs supported as their "word size".

So a variable of type int is 4 bytes large. That means when its memory address is e.g. 1000, then it actually covers the bytes at addresses 1000, 1001, 1002, and 1003. In C, it is possible to address those individual bytes as well at the same time, and that is how variables can overlap.

As a sidenote, most systems require larger types to be "word-aligned", i.e. their addresses have to be multiples of the word size, because that makes things easier for the hardware. So it is not possible to have an int variable start at address 999, or address 17 (but 1000 and 16 are OK).

Michael Borgwardt's user avatar

  • 1 Again, int may be 4 bytes or 2 or even 1, or anything. It must be able to represent the range +-32767. –  Alok Singhal Commented Jan 9, 2010 at 17:52
  • wouldn't it be 2^7 and not 2^8? –  user195488 Commented Jan 9, 2010 at 17:57
  • @Alok, yes that's what I say one paragraph higher. @Roboto: Nope. 8 bits means 2^8 different values. One bit has 2 values (2^1), each additional bit doubles this. –  Michael Borgwardt Commented Jan 9, 2010 at 18:06

I'm not going to completely answer Your question, but I would like to help You understand variables, as I had the same problems understanding them, when I began to program by myself.

For the moment, don't bother with the electronic representation of variables in memory. Think of memory as a continuous block of 1-byte-cells, each storing an bit-pattern (consisting of 0s and 1s).

By solely looking at the memory, You can't determine, what the bits in it represent! They are just arbitrary sequences of 0s and 1s. It is YOU, who specifies, HOW to interpret those bit patterns! Take a look at this example:

You could have written the following as well:

In both cases, the variables a, b and c are stored somewhere in the memory (and You can't tell their type). Now, when the compiler compiles Your code (that is translating Your program into machine instructions), it makes sure, to translate the "+" into integer_add in the first case and float_add in the second case, thus the CPU will interpret the bit patterns correctly and perform, what You desired.

Variable types are like glasses , that let the CPU look at a bit patterns from different perspectives.

Dave O.'s user avatar

To go deeper, I'd highly recommend Charles Petzold's excellent book " Code "

It covers more than what you ask, all of which leads to a better understanding of what's actually happening under the covers.

Rob Wells's user avatar

Really, datatypes are an abstraction that allows your programming language to treat a few bytes at some address as some kind of numeric type. Consider the data type as a lens that lets you see a piece of memory as an int, or a float. In reality, it's all just bits to the computer.

dicroce's user avatar

  • The OP framed it in terms of hardware, but I agree with you; the individual questions were all better answered with a short introduction to type theory. –  Tobu Commented Feb 27, 2011 at 17:04
  • In C, EOF is a "small negative number".
  • In C, char type may be unsigned, meaning that it cannot represent negative values.
  • For unsigned types, when you try to assign a negative value to them, they are converted to an unsigned value. If MAX is the maximum value an unsigned type can hold, then assigning -n to such a type is equivalent to assigning MAX - (n % MAX) + 1 to it. So, to answer your specific question about predicting, "yes you can". For example, let's say char is unsigned, and can hold values 0 to 255 inclusive. Then assigning -1 to a char is equivalent to assigning 255 - 1 + 1 = 255 to it.

Given the above, to be able to store EOF in c , c can't be char type. Thus, we use int , because it can store "small negative values". Particularly, in C, int is guaranteed to store values in the range -32767 and +32767 . That is why getchar() returns int .

If you are assigning values directly, then the C standard guarantees that expressions like 'a' will fit in a char . Note that in C, 'a' is of type int , not char, but it's okay to do char c = 'a' , because 'a' is able to fit in a char type.

About your question as to what type a variable should hold, the answer is: use whatever type that makes sense. For example, if you're counting, or looking at string lengths, the numbers can only be greater than or equal to zero. In such cases, you should use an unsigned type. size_t is such a type.

Note that it is sometimes hard to figure out the type of data, and even the "pros" may make mistakes. gzip format for example, stores the size of the uncompressed data in the last 4 bytes of a file. This breaks for huge files > 4GB in size, which are fairly common these days.

You should be careful about your terminology. In C, a char c = 'a' assigns an integer value corresponding to 'a' to c , but it need not be ASCII. It depends upon whatever encoding you happen to use.

About the "modulo" portion, and 256 values of type char : if you have n binary bits in a data type, each bit can encode 2 values: 0 and 1. So, you have 2*2*2...*2 ( n times) available values, or 2 n . For unsigned types, any overflow is well-defined, it is as if you divided the number by (the maximum possible value+1), and took the remainder. For example, let's say unsigned char can store values 0..255 (256 total values). Then, assigning 257 to an unsigned char will basically divide it by 256, take the remainder (1), and assign that value to the variable. This relation holds true for unsigned types only though. See my answer to another question for more.

Finally, you can use char arrays to read data from a file in C, even though you might end up hitting EOF , because C provides other ways of detecting EOF without having to read it in a variable explicitly, but you will learn about it later when you have read about arrays and pointers (see fgets() if you're curious for one example).

Community's user avatar

According to "stdio.h" getchars() return value is int and EOF is defined as -1. Depending on the actual encoding all values between 0..255 can occur, there for unsigned char is not enough to represent the -1 and int is used. Here is a nice table with detailed information http://en.wikipedia.org/wiki/ISO/IEC_8859

stacker's user avatar

The beauty of K&R is it's conciseness and readability, writers always have to make concessions for their goals; rather than being a 2000 page reference manual it serves as a basic reference and an excellent way to learn the language in general. I recommend Harbinson and Steele "C: A Reference Manual" for an excellent C reference book for details, and the C standard of course.

You need to be willing to google this stuff. Variables are represented in memory at specific locations and are known to the program of which they are a part of within a given scope. A char will typically be stored in 8 bits of memory (on some rare platforms this isn't necessarily true). 2^8 represents 256 distinct posibilities for variables. Different CPU/compilers/etc represent the basic types int, long of varying sizes. I think the C standard might specify minimum sizes for these, but not maximum sizes. I think for double it specifies at least 64 bits, but this doesn't preclude intel from using 80 bits in a floating point unit. Anyway, typical sizes in memory on 32bit intel platforms would be 32 bits (4 bytes) for unsigned/signed int and float, 64 bits (8 bytes) for double, 8 bits for char (signed/unsigned). You should also look up memory alignment if you are really interested on the topic. You can also at the exact layout in your debugger by getting the address of your variable with the "&" operator and then peeking at that address. Intel platforms may confuse you a little when looking at values in memory so please look up little endian/big endian as well. I am sure stack overflow has some good summaries of this as well.

dudez's user avatar

All of the characters needed in a language are respresented by ASCII and Extended ASCII. So there is no character beyond the Extended ASCII.

While using char, there is probability of getting garbage value as it directly stores the character but using int, there is less probability of it as it stores the ASCII value of the character.

Fahad Naeem's user avatar

For your last question about modulo:

Lastly, how exactly is the 256 available values of type char obtained? I read something about modulo 2^n, where n = 8, but why does that work (something to do with binary?). What is the modulo portion of "modulo 2^n" mean (if it has any relevance to modular arithmetic, I can't see the relation...)?

Think about modulo as a clock, where adding hours eventually results in you starting back at 0. Adding an hour for each step, you go from 00:00 to 01:00 to 02:00 to 03:00 to ... to 23:00 and then add one more hour to get back to 00:00. The "wrap-around" or "roll-over" is called modulo, and in this case is modulo 24.

With modulo, that largest number is never reached; as soon as you "reach" that number, the number wraps around to the beginning (24:00 is really 00:00 in the time example).

As another example, modern humanity's numbering system is Base 10 (i.e., Decimal), where we have digits 0 through 9. We don't have a singular digit that represents value 10. We need two digits to store 10.

Let's say we only have a one-digit adder, where the output can only store a single digit. We can add any two single-digit numbers together, like 1+2 , or 5+4 . 1+2=3 , as expected. 5+4=9 , as expected. But what happens if we add 5+5 or 9+1 or 9+9? To calculate 5+5 , our machine computes 10 , but it can't store the 1 due to its lack of memory capabilities, so the computer treats the 1 as an "overflow digit" and throws it away, only storing the 0 as the result. So, looking at your output for the computation 5+5 , you see the result is 0 , which probably isn't what you were expecting. To calculate 9+9 , your single-digit adding machine would correctly calculate 18 , but, it again, due to hardware memory limitations of storing a maximum of one digit, doesn't have the ability to store the 1 , so it throws it away. The adder CAN however store the 8 , so your result of 9+9 produces 8 .Your single-digit adder is modulo'ing by 10. Notice how you can never reach the number 10 in your output, even when your result should be 10 or bigger. The same issue occurs in binary, but with different modulo values.

As an aside, this "overflow" issue is especially bad with multiplication since you need twice the length of your biggest input to multiply two numbers (whether the numbers are binary or decimal or some other standard base) with all the result's digits intact. I.e., if you're multiplying a 32-bit number by another 32-bit number, your result might take 64 bits of memory instead of a convenient 32! E.g., 999 (3 digit input) times 999 (3 digit input) = 998,001 (6 digit output). Notice how the output requires double the number of digits of storage compared to one of the inputs' lengths (number of digits).

Back to binary modulo, A char in the programming language C is defined as the smallest accessible unit (source: I made it up based off what I've been told a hundred times). AFAIK, a single char is always the length of a single Byte. A byte is 8 bits, which is to say, a byte is an ordered group of eight 1s and/or 0s. E.g., 11001010 is a byte. Again, the order matters, meaning that 01 is not the same as 10, much like how 312 is not the same as 321 in Base 10.

Each bit you add gives you twice as many possible states. With 1 bit, you have 2^ 1 = 2 possible states (0,1). With 2 bits, you have 2^ 2 = 4 states (00,01,10,11), With 3 bits, you have 2^ 3 = 8 states (000,001,010,011,100,101,110,111). With 4 bits, you have 2^4 = 16 states, etc. With 8 bits (the length of a Byte, and also the length of a char), you have 2^ 8 = 256 possible states.

The largest value you can store in a char is 255 because a char has only 8 bits, meaning you can store all 1s to get the maximum value, which will be 11111111(bin) = 255(dec). As soon as you try to store a larger number, like by adding 1, we get the same overflow issue mentioned in the 1-digit adder example. 255+1 = 256 = 1 0000 0000 (spaces added for readability). 256 takes 9 bits to represent, but only the lower 8 bits can be stored since we're dealing with chars, so the most significant bit (the only 1 in the sequence of bits) gets truncated and we're left with 0000 0000 = 0. We could've added any number to the char, but the resulting char will always be between values 0 (all bits are 0s) and 255 (all bits are 1s).

Since a maximum value of 255 can be stored, we can say that operations that output/result in a char are mod256 (256 can never be reached, but everything below that value can (see the clock example)). Even if we add a million to a char, the final result will be between 0 and 255 (after a lot of truncation happens). Your compiler may give you a warning if you do a basic operation that causes overflow, but don't depend on it.

I said earlier that char can store up to 256 values, 0 through 255 - this is only partially true. You might notice that you get strange numbers when you try to do operations like char a = 255; .Printing out the number as an integer ( char a = 128; printf("%d",a); ) should tell you that the result is -128. Did the computer think I added a negative by accident? No. This happened because a char is naturally signed, meaning it's able to be negative. 128 is actually overflow because the range of 0 to 255 is split roughly in half, from -128 to +127. The maximum value being +127, and then adding 1 to reach 128 (which makes it overflow by exactly 1) tells us that the resulting number of char a = 128; will be the minimum value a char can store, which is -128. If we had added 2 instead of 1 (like if we tried to do char a = 129; ), then it would overflow by 2, meaning the resulting char would have stored -127. The maximum value will always wrap around to the minimum value in non-floating point numbers.

  • Floating-point numbers still work based on place values (but it's called a mantissa and functions differently) like ints, shorts, chars, and longs, but floating-point numbers also have a dedicated sign bit (which has no additive value) and an exponent.

If you choose to look at the raw binary when setting variables equal to literal values like 128 or -5000

For signed non-floating-point numbers, the largest place value get assigned a 1 when the overall number is negative, and that place value gets treated as a negative version of that typical place value. E.g., -5 (Decimal) would be 1xx...x in binary (where each x is a placeholder for either a 0 or 1). As another example, instead of place values being 8,4,2,1 for an unsigned number, they become -8,4,2,1 for a signed number, meaning you now have a "negative 8's place".

2's Complement to switch between + and - values : Flip (i.e., "Complement") all bits (i.e., each 1 gets flipped to a 0, and, simultaneously, each 0 gets flipped to a 1)(e.g., -12 = -16 + 4 = 10100 -> 01011). After flipping, add value 1 (place value of 1). (e.g., 01011 + 1 = 01100 = 0+8+4+0+0 = +12). Summary: Flip bits, then add 1.

Examples of using 2's Complement to convert binary numbers into EQUIVALENT signed decimal numbers :

  • 11 = (-2)+1 = -1
  • 111 = (-4)+2+1 = -1
  • 1111 = (-8)+4+2+1 = -1
  • 1000 = (-8)+0+0+0 = -8
  • 0111 = 0+4+2+1 = 7
  • Notice how 0111 and 111 are treated differently when they're designated as "signed" numbers (instead of "unsigned" (always positive) numbers). Leading 0s are important.

If you see the binary number 1111, you might think, "Oh, that's 8+4+2+1 = 15". However, you don't have enough information to assume that. It could be a negative number. If you see "(signed) 1111", then you still don't know the number for certain due to One's Complement existing, but you can assume it means "(signed 2's Complement) 1111", which would be (-8)+4+2+1 = -1. The same sequence of bits, 1111, can be interpreted as either -1 or 15, depending on its signedness. This is why the unsigned keyword in unsigned char is important. When you write char , you are implicitly telling the computer that you want a signed char.

  • unsigned char - Can store numbers between 0 and 255
  • (signed) char - Can store number between -128 and +127 (same span, but shifted to allow negatives)

Stev's user avatar

Your Answer

Reminder: Answers generated by artificial intelligence tools are not allowed on Stack Overflow. Learn more

Sign up or log in

Post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged c types kernighan-and-ritchie kr-c or ask your own question .

  • The Overflow Blog
  • Battling ticket bots and untangling taxes at the frontiers of e-commerce
  • Ryan Dahl explains why Deno had to evolve with version 2.0
  • Featured on Meta
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Bringing clarity to status tag usage on meta sites
  • Feedback requested: How do you use tag hover descriptions for curating and do...

Hot Network Questions

  • Momentary solo circuit
  • Time travel, with compensation for planetary position and velocity
  • The meaning of “manage” in this context
  • Did anyone ever ask Neil Armstrong whether he said "for man" or "for a man?"
  • Homotopy groups of the space of diffeomorphisms
  • How did this zucchini plant cling to the zip tie?
  • Traveling to Montenegro with Single-Entry Schengen Visa
  • Should I be worried about this giant crack?
  • How to stop Windows from changing date modified when copying files from network?
  • Arranging people in a jeep.
  • Could a gas giant still be low on the horizon from the equator of a tidally locked moon?
  • Is it possible for a company to dilute my shares to the point they are insignificant
  • Has any spacecraft ever been severely damaged by a micrometeorite?
  • Why is Ganondorf in LOZ Windwaker?
  • Did the French janitor at the University of Hanoi receive a higher base pay than a Vietnamese professor?
  • No Kippa while in a car
  • Can there be clouds of free electrons in space?
  • Seven different digits are placed in a row. The products of the first 3, middle 3 and last 3 are all equal. What is the middle digit?
  • Why does the size of a struct change depending on whether an initial value is used?
  • Fitting the 9th piece into the pizza box
  • Rearrange these numbers and symbols to make a true equation
  • What do all branches of Mathematics have in common to be considered "Mathematics", or parts of the same field?
  • Can mending be used to attach the unbroken ends of a pole together?
  • Choosing a relative large density subsequence from a low density sequence

representation of a data type

TABLE OF CONTENTS (HIDE)

A tutorial on data representation, integers, floating-point numbers, and characters, number systems.

Human beings use decimal (base 10) and duodecimal (base 12) number systems for counting and measurements (probably because we have 10 fingers and two big toes). Computers use binary (base 2) number system, as they are made from binary digital components (known as transistors) operating in two states - on and off. In computing, we also use hexadecimal (base 16) or octal (base 8) number systems, as a compact form for representing binary numbers.

Decimal (Base 10) Number System

Decimal number system has ten symbols: 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , and 9 , called digit s. It uses positional notation . That is, the least-significant digit (right-most digit) is of the order of 10^0 (units or ones), the second right-most digit is of the order of 10^1 (tens), the third right-most digit is of the order of 10^2 (hundreds), and so on, where ^ denotes exponent. For example,

We shall denote a decimal number with an optional suffix D if ambiguity arises.

Binary (Base 2) Number System

Binary number system has two symbols: 0 and 1 , called bits . It is also a positional notation , for example,

We shall denote a binary number with a suffix B . Some programming languages denote binary numbers with prefix 0b or 0B (e.g., 0b1001000 ), or prefix b with the bits quoted (e.g., b'10001111' ).

A binary digit is called a bit . Eight bits is called a byte (why 8-bit unit? Probably because 8=2 3 ).

Hexadecimal (Base 16) Number System

Hexadecimal number system uses 16 symbols: 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , A , B , C , D , E , and F , called hex digits . It is a positional notation , for example,

We shall denote a hexadecimal number (in short, hex) with a suffix H . Some programming languages denote hex numbers with prefix 0x or 0X (e.g., 0x1A3C5F ), or prefix x with hex digits quoted (e.g., x'C3A4D98B' ).

Each hexadecimal digit is also called a hex digit . Most programming languages accept lowercase 'a' to 'f' as well as uppercase 'A' to 'F' .

Computers uses binary system in their internal operations, as they are built from binary digital electronic components with 2 states - on and off. However, writing or reading a long sequence of binary bits is cumbersome and error-prone (try to read this binary string: 1011 0011 0100 0011 0001 1101 0001 1000B , which is the same as hexadecimal B343 1D18H ). Hexadecimal system is used as a compact form or shorthand for binary bits. Each hex digit is equivalent to 4 binary bits, i.e., shorthand for 4 bits, as follows:

Hexadecimal Binary Decimal
0 0000 0
1 0001 1
2 0010 2
3 0011 3
4 0100 4
5 0101 5
6 0110 6
7 0111 7
8 1000 8
9 1001 9
A 1010 10
B 1011 11
C 1100 12
D 1101 13
E 1110 14
F 1111 15

Conversion from Hexadecimal to Binary

Replace each hex digit by the 4 equivalent bits (as listed in the above table), for examples,

Conversion from Binary to Hexadecimal

Starting from the right-most bit (least-significant bit), replace each group of 4 bits by the equivalent hex digit (pad the left-most bits with zero if necessary), for examples,

It is important to note that hexadecimal number provides a compact form or shorthand for representing binary bits.

Conversion from Base r to Decimal (Base 10)

Given a n -digit base r number: d n-1 d n-2 d n-3 ...d 2 d 1 d 0 (base r), the decimal equivalent is given by:

For examples,

Conversion from Decimal (Base 10) to Base r

Use repeated division/remainder. For example,

The above procedure is actually applicable to conversion between any 2 base systems. For example,

Conversion between Two Number Systems with Fractional Part

  • Separate the integral and the fractional parts.
  • For the integral part, divide by the target radix repeatably, and collect the remainder in reverse order.
  • For the fractional part, multiply the fractional part by the target radix repeatably, and collect the integral part in the same order.

Example 1: Decimal to Binary

Example 2: Decimal to Hexadecimal

Exercises (Number Systems Conversion)

  • 101010101010

Answers: You could use the Windows' Calculator ( calc.exe ) to carry out number system conversion, by setting it to the Programmer or scientific mode. (Run "calc" ⇒ Select "Settings" menu ⇒ Choose "Programmer" or "Scientific" mode.)

  • 1101100B , 1001011110000B , 10001100101000B , 6CH , 12F0H , 2328H .
  • 218H , 80H , AAAH , 536D , 128D , 2730D .
  • 10101011110011011110B , 1001000110100B , 100000001111B , 703710D , 4660D , 2063D .
  • ?? (You work it out!)

Computer Memory & Data Representation

Computer uses a fixed number of bits to represent a piece of data, which could be a number, a character, or others. A n -bit storage location can represent up to 2^ n distinct entities. For example, a 3-bit memory location can hold one of these eight binary patterns: 000 , 001 , 010 , 011 , 100 , 101 , 110 , or 111 . Hence, it can represent at most 8 distinct entities. You could use them to represent numbers 0 to 7, numbers 8881 to 8888, characters 'A' to 'H', or up to 8 kinds of fruits like apple, orange, banana; or up to 8 kinds of animals like lion, tiger, etc.

Integers, for example, can be represented in 8-bit, 16-bit, 32-bit or 64-bit. You, as the programmer, choose an appropriate bit-length for your integers. Your choice will impose constraint on the range of integers that can be represented. Besides the bit-length, an integer can be represented in various representation schemes, e.g., unsigned vs. signed integers. An 8-bit unsigned integer has a range of 0 to 255, while an 8-bit signed integer has a range of -128 to 127 - both representing 256 distinct numbers.

It is important to note that a computer memory location merely stores a binary pattern . It is entirely up to you, as the programmer, to decide on how these patterns are to be interpreted . For example, the 8-bit binary pattern "0100 0001B" can be interpreted as an unsigned integer 65 , or an ASCII character 'A' , or some secret information known only to you. In other words, you have to first decide how to represent a piece of data in a binary pattern before the binary patterns make sense. The interpretation of binary pattern is called data representation or encoding . Furthermore, it is important that the data representation schemes are agreed-upon by all the parties, i.e., industrial standards need to be formulated and straightly followed.

Once you decided on the data representation scheme, certain constraints, in particular, the precision and range will be imposed. Hence, it is important to understand data representation to write correct and high-performance programs.

Rosette Stone and the Decipherment of Egyptian Hieroglyphs

RosettaStone

Egyptian hieroglyphs (next-to-left) were used by the ancient Egyptians since 4000BC. Unfortunately, since 500AD, no one could longer read the ancient Egyptian hieroglyphs, until the re-discovery of the Rosette Stone in 1799 by Napoleon's troop (during Napoleon's Egyptian invasion) near the town of Rashid (Rosetta) in the Nile Delta.

The Rosetta Stone (left) is inscribed with a decree in 196BC on behalf of King Ptolemy V. The decree appears in three scripts: the upper text is Ancient Egyptian hieroglyphs , the middle portion Demotic script, and the lowest Ancient Greek . Because it presents essentially the same text in all three scripts, and Ancient Greek could still be understood, it provided the key to the decipherment of the Egyptian hieroglyphs.

The moral of the story is unless you know the encoding scheme, there is no way that you can decode the data.

Reference and images: Wikipedia.

Integer Representation

Integers are whole numbers or fixed-point numbers with the radix point fixed after the least-significant bit. They are contrast to real numbers or floating-point numbers , where the position of the radix point varies. It is important to take note that integers and floating-point numbers are treated differently in computers. They have different representation and are processed differently (e.g., floating-point numbers are processed in a so-called floating-point processor). Floating-point numbers will be discussed later.

Computers use a fixed number of bits to represent an integer. The commonly-used bit-lengths for integers are 8-bit, 16-bit, 32-bit or 64-bit. Besides bit-lengths, there are two representation schemes for integers:

  • Unsigned Integers : can represent zero and positive integers.
  • Sign-Magnitude representation
  • 1's Complement representation
  • 2's Complement representation

You, as the programmer, need to decide on the bit-length and representation scheme for your integers, depending on your application's requirements. Suppose that you need a counter for counting a small quantity from 0 up to 200, you might choose the 8-bit unsigned integer scheme as there is no negative numbers involved.

n -bit Unsigned Integers

Unsigned integers can represent zero and positive integers, but not negative integers. The value of an unsigned integer is interpreted as " the magnitude of its underlying binary pattern ".

Example 1: Suppose that n =8 and the binary pattern is 0100 0001B , the value of this unsigned integer is 1×2^0 + 1×2^6 = 65D .

Example 2: Suppose that n =16 and the binary pattern is 0001 0000 0000 1000B , the value of this unsigned integer is 1×2^3 + 1×2^12 = 4104D .

Example 3: Suppose that n =16 and the binary pattern is 0000 0000 0000 0000B , the value of this unsigned integer is 0 .

An n -bit pattern can represent 2^ n distinct integers. An n -bit unsigned integer can represent integers from 0 to (2^ n )-1 , as tabulated below:

n Minimum Maximum
8 0 (2^8)-1  (=255)
16 0 (2^16)-1 (=65,535)
32 0 (2^32)-1 (=4,294,967,295) (9+ digits)
64 0 (2^64)-1 (=18,446,744,073,709,551,615) (19+ digits)

Signed Integers

Signed integers can represent zero, positive integers, as well as negative integers. Three representation schemes are available for signed integers:

In all the above three schemes, the most-significant bit (msb) is called the sign bit . The sign bit is used to represent the sign of the integer - with 0 for positive integers and 1 for negative integers. The magnitude of the integer, however, is interpreted differently in different schemes.

n -bit Sign Integers in Sign-Magnitude Representation

In sign-magnitude representation:

  • The most-significant bit (msb) is the sign bit , with value of 0 representing positive integer and 1 representing negative integer.
  • The remaining n -1 bits represents the magnitude (absolute value) of the integer. The absolute value of the integer is interpreted as "the magnitude of the ( n -1)-bit binary pattern".

Example 1 : Suppose that n =8 and the binary representation is 0 100 0001B .    Sign bit is 0 ⇒ positive    Absolute value is 100 0001B = 65D    Hence, the integer is +65D

Example 2 : Suppose that n =8 and the binary representation is 1 000 0001B .    Sign bit is 1 ⇒ negative    Absolute value is 000 0001B = 1D    Hence, the integer is -1D

Example 3 : Suppose that n =8 and the binary representation is 0 000 0000B .    Sign bit is 0 ⇒ positive    Absolute value is 000 0000B = 0D    Hence, the integer is +0D

Example 4 : Suppose that n =8 and the binary representation is 1 000 0000B .    Sign bit is 1 ⇒ negative    Absolute value is 000 0000B = 0D    Hence, the integer is -0D

sign-magnitude representation

The drawbacks of sign-magnitude representation are:

  • There are two representations ( 0000 0000B and 1000 0000B ) for the number zero, which could lead to inefficiency and confusion.
  • Positive and negative integers need to be processed separately.

n -bit Sign Integers in 1's Complement Representation

In 1's complement representation:

  • Again, the most significant bit (msb) is the sign bit , with value of 0 representing positive integers and 1 representing negative integers.
  • for positive integers, the absolute value of the integer is equal to "the magnitude of the ( n -1)-bit binary pattern".
  • for negative integers, the absolute value of the integer is equal to "the magnitude of the complement ( inverse ) of the ( n -1)-bit binary pattern" (hence called 1's complement).

Example 1 : Suppose that n =8 and the binary representation 0 100 0001B .    Sign bit is 0 ⇒ positive    Absolute value is 100 0001B = 65D    Hence, the integer is +65D

Example 2 : Suppose that n =8 and the binary representation 1 000 0001B .    Sign bit is 1 ⇒ negative    Absolute value is the complement of 000 0001B , i.e., 111 1110B = 126D    Hence, the integer is -126D

Example 3 : Suppose that n =8 and the binary representation 0 000 0000B .    Sign bit is 0 ⇒ positive    Absolute value is 000 0000B = 0D    Hence, the integer is +0D

Example 4 : Suppose that n =8 and the binary representation 1 111 1111B .    Sign bit is 1 ⇒ negative    Absolute value is the complement of 111 1111B , i.e., 000 0000B = 0D    Hence, the integer is -0D

1's complement

Again, the drawbacks are:

  • There are two representations ( 0000 0000B and 1111 1111B ) for zero.
  • The positive integers and negative integers need to be processed separately.

n -bit Sign Integers in 2's Complement Representation

In 2's complement representation:

  • for negative integers, the absolute value of the integer is equal to "the magnitude of the complement of the ( n -1)-bit binary pattern plus one " (hence called 2's complement).

Example 2 : Suppose that n =8 and the binary representation 1 000 0001B .    Sign bit is 1 ⇒ negative    Absolute value is the complement of 000 0001B plus 1 , i.e., 111 1110B + 1B = 127D    Hence, the integer is -127D

Example 4 : Suppose that n =8 and the binary representation 1 111 1111B .    Sign bit is 1 ⇒ negative    Absolute value is the complement of 111 1111B plus 1 , i.e., 000 0000B + 1B = 1D    Hence, the integer is -1D

2's complement

Computers use 2's Complement Representation for Signed Integers

We have discussed three representations for signed integers: signed-magnitude, 1's complement and 2's complement. Computers use 2's complement in representing signed integers. This is because:

  • There is only one representation for the number zero in 2's complement, instead of two representations in sign-magnitude and 1's complement.
  • Positive and negative integers can be treated together in addition and subtraction. Subtraction can be carried out using the "addition logic".

Example 1: Addition of Two Positive Integers: Suppose that n=8, 65D + 5D = 70D

Example 2: Subtraction is treated as Addition of a Positive and a Negative Integers: Suppose that n=8, 65D - 5D = 65D + (-5D) = 60D

Example 3: Addition of Two Negative Integers: Suppose that n=8, -65D - 5D = (-65D) + (-5D) = -70D

Because of the fixed precision (i.e., fixed number of bits ), an n -bit 2's complement signed integer has a certain range. For example, for n =8 , the range of 2's complement signed integers is -128 to +127 . During addition (and subtraction), it is important to check whether the result exceeds this range, in other words, whether overflow or underflow has occurred.

Example 4: Overflow: Suppose that n=8, 127D + 2D = 129D (overflow - beyond the range)

Example 5: Underflow: Suppose that n=8, -125D - 5D = -130D (underflow - below the range)

The following diagram explains how the 2's complement works. By re-arranging the number line, values from -128 to +127 are represented contiguously by ignoring the carry bit.

signed integer

Range of n -bit 2's Complement Signed Integers

An n -bit 2's complement signed integer can represent integers from -2^( n -1) to +2^( n -1)-1 , as tabulated. Take note that the scheme can represent all the integers within the range, without any gap. In other words, there is no missing integers within the supported range.

n minimum maximum
8 -(2^7)  (=-128) +(2^7)-1  (=+127)
16 -(2^15) (=-32,768) +(2^15)-1 (=+32,767)
32 -(2^31) (=-2,147,483,648) +(2^31)-1 (=+2,147,483,647)(9+ digits)
64 -(2^63) (=-9,223,372,036,854,775,808) +(2^63)-1 (=+9,223,372,036,854,775,807)(18+ digits)

Decoding 2's Complement Numbers

  • Check the sign bit (denoted as S ).
  • If S=0 , the number is positive and its absolute value is the binary value of the remaining n -1 bits.
  • If S=1 , the number is negative. you could "invert the n -1 bits and plus 1" to get the absolute value of negative number. Alternatively, you could scan the remaining n -1 bits from the right (least-significant bit). Look for the first occurrence of 1. Flip all the bits to the left of that first occurrence of 1. The flipped pattern gives the absolute value. For example, n = 8, bit pattern = 1 100 0100B S = 1 → negative Scanning from the right and flip all the bits to the left of the first occurrence of 1 ⇒ 011 1 100B = 60D Hence, the value is -60D

Big Endian vs. Little Endian

Modern computers store one byte of data in each memory address or location, i.e., byte addressable memory. An 32-bit integer is, therefore, stored in 4 memory addresses.

The term"Endian" refers to the order of storing bytes in computer memory. In "Big Endian" scheme, the most significant byte is stored first in the lowest memory address (or big in first), while "Little Endian" stores the least significant bytes in the lowest memory address.

For example, the 32-bit integer 12345678H (305419896 10 ) is stored as 12H 34H 56H 78H in big endian; and 78H 56H 34H 12H in little endian. An 16-bit integer 00H 01H is interpreted as 0001H in big endian, and 0100H as little endian.

Exercise (Integer Representation)

  • What are the ranges of 8-bit, 16-bit, 32-bit and 64-bit integer, in "unsigned" and "signed" representation?
  • Give the value of 88 , 0 , 1 , 127 , and 255 in 8-bit unsigned representation.
  • Give the value of +88 , -88 , -1 , 0 , +1 , -128 , and +127 in 8-bit 2's complement signed representation.
  • Give the value of +88 , -88 , -1 , 0 , +1 , -127 , and +127 in 8-bit sign-magnitude representation.
  • Give the value of +88 , -88 , -1 , 0 , +1 , -127 and +127 in 8-bit 1's complement representation.
  • [TODO] more.
  • The range of unsigned n -bit integers is [0, 2^n - 1] . The range of n -bit 2's complement signed integer is [-2^(n-1), +2^(n-1)-1] ;
  • 88 (0101 1000) , 0 (0000 0000) , 1 (0000 0001) , 127 (0111 1111) , 255 (1111 1111) .
  • +88 (0101 1000) , -88 (1010 1000) , -1 (1111 1111) , 0 (0000 0000) , +1 (0000 0001) , -128 (1000 0000) , +127 (0111 1111) .
  • +88 (0101 1000) , -88 (1101 1000) , -1 (1000 0001) , 0 (0000 0000 or 1000 0000) , +1 (0000 0001) , -127 (1111 1111) , +127 (0111 1111) .
  • +88 (0101 1000) , -88 (1010 0111) , -1 (1111 1110) , 0 (0000 0000 or 1111 1111) , +1 (0000 0001) , -127 (1000 0000) , +127 (0111 1111) .

Floating-Point Number Representation

A floating-point number (or real number) can represent a very large value (e.g., 1.23×10^88 ) or a very small value (e.g., 1.23×10^-88 ). It could also represent very large negative number (e.g., -1.23×10^88 ) and very small negative number (e.g., -1.23×10^-88 ), as well as zero, as illustrated:

Representation_FloatingPointNumbers

A floating-point number is typically expressed in the scientific notation, with a fraction ( F ), and an exponent ( E ) of a certain radix ( r ), in the form of F×r^E . Decimal numbers use radix of 10 ( F×10^E ); while binary numbers use radix of 2 ( F×2^E ).

Representation of floating point number is not unique. For example, the number 55.66 can be represented as 5.566×10^1 , 0.5566×10^2 , 0.05566×10^3 , and so on. The fractional part can be normalized . In the normalized form, there is only a single non-zero digit before the radix point. For example, decimal number 123.4567 can be normalized as 1.234567×10^2 ; binary number 1010.1011B can be normalized as 1.0101011B×2^3 .

It is important to note that floating-point numbers suffer from loss of precision when represented with a fixed number of bits (e.g., 32-bit or 64-bit). This is because there are infinite number of real numbers (even within a small range of says 0.0 to 0.1). On the other hand, a n -bit binary pattern can represent a finite 2^ n distinct numbers. Hence, not all the real numbers can be represented. The nearest approximation will be used instead, resulted in loss of accuracy.

It is also important to note that floating number arithmetic is very much less efficient than integer arithmetic. It could be speed up with a so-called dedicated floating-point co-processor . Hence, use integers if your application does not require floating-point numbers.

In computers, floating-point numbers are represented in scientific notation of fraction ( F ) and exponent ( E ) with a radix of 2, in the form of F×2^E . Both E and F can be positive as well as negative. Modern computers adopt IEEE 754 standard for representing floating-point numbers. There are two representation schemes: 32-bit single-precision and 64-bit double-precision.

IEEE-754 32-bit Single-Precision Floating-Point Numbers

In 32-bit single-precision floating-point representation:

  • The most significant bit is the sign bit ( S ), with 0 for positive numbers and 1 for negative numbers.
  • The following 8 bits represent exponent ( E ).
  • The remaining 23 bits represents fraction ( F ).

float

Normalized Form

Let's illustrate with an example, suppose that the 32-bit pattern is 1 1000 0001 011 0000 0000 0000 0000 0000 , with:

  • E = 1000 0001
  • F = 011 0000 0000 0000 0000 0000

In the normalized form , the actual fraction is normalized with an implicit leading 1 in the form of 1.F . In this example, the actual fraction is 1.011 0000 0000 0000 0000 0000 = 1 + 1×2^-2 + 1×2^-3 = 1.375D .

The sign bit represents the sign of the number, with S=0 for positive and S=1 for negative number. In this example with S=1 , this is a negative number, i.e., -1.375D .

In normalized form, the actual exponent is E-127 (so-called excess-127 or bias-127). This is because we need to represent both positive and negative exponent. With an 8-bit E, ranging from 0 to 255, the excess-127 scheme could provide actual exponent of -127 to 128. In this example, E-127=129-127=2D .

Hence, the number represented is -1.375×2^2=-5.5D .

De-Normalized Form

Normalized form has a serious problem, with an implicit leading 1 for the fraction, it cannot represent the number zero! Convince yourself on this!

De-normalized form was devised to represent zero and other numbers.

For E=0 , the numbers are in the de-normalized form. An implicit leading 0 (instead of 1) is used for the fraction; and the actual exponent is always -126 . Hence, the number zero can be represented with E=0 and F=0 (because 0.0×2^-126=0 ).

We can also represent very small positive and negative numbers in de-normalized form with E=0 . For example, if S=1 , E=0 , and F=011 0000 0000 0000 0000 0000 . The actual fraction is 0.011=1×2^-2+1×2^-3=0.375D . Since S=1 , it is a negative number. With E=0 , the actual exponent is -126 . Hence the number is -0.375×2^-126 = -4.4×10^-39 , which is an extremely small negative number (close to zero).

In summary, the value ( N ) is calculated as follows:

  • For 1 ≤ E ≤ 254, N = (-1)^S × 1.F × 2^(E-127) . These numbers are in the so-called normalized form. The sign-bit represents the sign of the number. Fractional part ( 1.F ) are normalized with an implicit leading 1. The exponent is bias (or in excess) of 127 , so as to represent both positive and negative exponent. The range of exponent is -126 to +127 .
  • For E = 0, N = (-1)^S × 0.F × 2^(-126) . These numbers are in the so-called denormalized form. The exponent of 2^-126 evaluates to a very small number. Denormalized form is needed to represent zero (with F=0 and E=0 ). It can also represents very small positive and negative number close to zero.
  • For E = 255 , it represents special values, such as ±INF (positive and negative infinity) and NaN (not a number). This is beyond the scope of this article.

Example 1: Suppose that IEEE-754 32-bit floating-point representation pattern is 0 10000000 110 0000 0000 0000 0000 0000 .

Example 2: Suppose that IEEE-754 32-bit floating-point representation pattern is 1 01111110 100 0000 0000 0000 0000 0000 .

Example 3: Suppose that IEEE-754 32-bit floating-point representation pattern is 1 01111110 000 0000 0000 0000 0000 0001 .

Example 4 (De-Normalized Form): Suppose that IEEE-754 32-bit floating-point representation pattern is 1 00000000 000 0000 0000 0000 0000 0001 .

Exercises (Floating-point Numbers)

  • Compute the largest and smallest positive numbers that can be represented in the 32-bit normalized form.
  • Compute the largest and smallest negative numbers can be represented in the 32-bit normalized form.
  • Repeat (1) for the 32-bit denormalized form.
  • Repeat (2) for the 32-bit denormalized form.
  • Largest positive number: S=0 , E=1111 1110 (254) , F=111 1111 1111 1111 1111 1111 . Smallest positive number: S=0 , E=0000 00001 (1) , F=000 0000 0000 0000 0000 0000 .
  • Same as above, but S=1 .
  • Largest positive number: S=0 , E=0 , F=111 1111 1111 1111 1111 1111 . Smallest positive number: S=0 , E=0 , F=000 0000 0000 0000 0000 0001 .

Notes For Java Users

You can use JDK methods Float.intBitsToFloat(int bits) or Double.longBitsToDouble(long bits) to create a single-precision 32-bit float or double-precision 64-bit double with the specific bit patterns, and print their values. For examples,

IEEE-754 64-bit Double-Precision Floating-Point Numbers

The representation scheme for 64-bit double-precision is similar to the 32-bit single-precision:

  • The following 11 bits represent exponent ( E ).
  • The remaining 52 bits represents fraction ( F ).

double

The value ( N ) is calculated as follows:

  • Normalized form: For 1 ≤ E ≤ 2046, N = (-1)^S × 1.F × 2^(E-1023) .
  • Denormalized form: For E = 0, N = (-1)^S × 0.F × 2^(-1022) . These are in the denormalized form.
  • For E = 2047 , N represents special values, such as ±INF (infinity), NaN (not a number).

More on Floating-Point Representation

There are three parts in the floating-point representation:

  • The sign bit ( S ) is self-explanatory (0 for positive numbers and 1 for negative numbers).
  • For the exponent ( E ), a so-called bias (or excess ) is applied so as to represent both positive and negative exponent. The bias is set at half of the range. For single precision with an 8-bit exponent, the bias is 127 (or excess-127). For double precision with a 11-bit exponent, the bias is 1023 (or excess-1023).
  • The fraction ( F ) (also called the mantissa or significand ) is composed of an implicit leading bit (before the radix point) and the fractional bits (after the radix point). The leading bit for normalized numbers is 1; while the leading bit for denormalized numbers is 0.

Normalized Floating-Point Numbers

In normalized form, the radix point is placed after the first non-zero digit, e,g., 9.8765D×10^-23D , 1.001011B×2^11B . For binary number, the leading bit is always 1, and need not be represented explicitly - this saves 1 bit of storage.

In IEEE 754's normalized form:

  • For single-precision, 1 ≤ E ≤ 254 with excess of 127. Hence, the actual exponent is from -126 to +127 . Negative exponents are used to represent small numbers (< 1.0); while positive exponents are used to represent large numbers (> 1.0).     N = (-1)^S × 1.F × 2^(E-127)
  • For double-precision, 1 ≤ E ≤ 2046 with excess of 1023. The actual exponent is from -1022 to +1023 , and     N = (-1)^S × 1.F × 2^(E-1023)

Take note that n-bit pattern has a finite number of combinations ( =2^n ), which could represent finite distinct numbers. It is not possible to represent the infinite numbers in the real axis (even a small range says 0.0 to 1.0 has infinite numbers). That is, not all floating-point numbers can be accurately represented. Instead, the closest approximation is used, which leads to loss of accuracy .

The minimum and maximum normalized floating-point numbers are:

Precision Normalized N(min) Normalized N(max)
Single 0080 0000H
0 00000001 00000000000000000000000B
E = 1, F = 0
N(min) = 1.0B × 2^-126
(≈1.17549435 × 10^-38)
7F7F FFFFH
0 11111110 00000000000000000000000B
E = 254, F = 0
N(max) = 1.1...1B × 2^127 = (2 - 2^-23) × 2^127
(≈3.4028235 × 10^38)
Double 0010 0000 0000 0000H
N(min) = 1.0B × 2^-1022
(≈2.2250738585072014 × 10^-308)
7FEF FFFF FFFF FFFFH
N(max) = 1.1...1B × 2^1023 = (2 - 2^-52) × 2^1023
(≈1.7976931348623157 × 10^308)

real numbers

Denormalized Floating-Point Numbers

If E = 0 , but the fraction is non-zero, then the value is in denormalized form, and a leading bit of 0 is assumed, as follows:

  • For single-precision, E = 0 ,     N = (-1)^S × 0.F × 2^(-126)
  • For double-precision, E = 0 ,     N = (-1)^S × 0.F × 2^(-1022)

Denormalized form can represent very small numbers closed to zero, and zero, which cannot be represented in normalized form, as shown in the above figure.

The minimum and maximum of denormalized floating-point numbers are:

Precision Denormalized D(min) Denormalized D(max)
Single 0000 0001H
0 00000000 00000000000000000000001B
E = 0, F = 00000000000000000000001B
D(min) = 0.0...1 × 2^-126 = 1 × 2^-23 × 2^-126 = 2^-149
(≈1.4 × 10^-45)
007F FFFFH
0 00000000 11111111111111111111111B
E = 0, F = 11111111111111111111111B
D(max) = 0.1...1 × 2^-126 = (1-2^-23)×2^-126
(≈1.1754942 × 10^-38)
Double 0000 0000 0000 0001H
D(min) = 0.0...1 × 2^-1022 = 1 × 2^-52 × 2^-1022 = 2^-1074
(≈4.9 × 10^-324)
001F FFFF FFFF FFFFH
D(max) = 0.1...1 × 2^-1022 = (1-2^-52)×2^-1022
(≈4.4501477170144023 × 10^-308)

Special Values

Zero : Zero cannot be represented in the normalized form, and must be represented in denormalized form with E=0 and F=0 . There are two representations for zero: +0 with S=0 and -0 with S=1 .

Infinity : The value of +infinity (e.g., 1/0 ) and -infinity (e.g., -1/0 ) are represented with an exponent of all 1's ( E = 255 for single-precision and E = 2047 for double-precision), F=0 , and S=0 (for +INF ) and S=1 (for -INF ).

Not a Number (NaN) : NaN denotes a value that cannot be represented as real number (e.g. 0/0 ). NaN is represented with Exponent of all 1's ( E = 255 for single-precision and E = 2047 for double-precision) and any non-zero fraction.

Character Encoding

In computer memory, character are "encoded" (or "represented") using a chosen "character encoding schemes" (aka "character set", "charset", "character map", or "code page").

For example, in ASCII (as well as Latin1, Unicode, and many other character sets):

  • code numbers 65D (41H) to 90D (5AH) represents 'A' to 'Z' , respectively.
  • code numbers 97D (61H) to 122D (7AH) represents 'a' to 'z' , respectively.
  • code numbers 48D (30H) to 57D (39H) represents '0' to '9' , respectively.

It is important to note that the representation scheme must be known before a binary pattern can be interpreted. E.g., the 8-bit pattern " 0100 0010B " could represent anything under the sun known only to the person encoded it.

The most commonly-used character encoding schemes are: 7-bit ASCII (ISO/IEC 646) and 8-bit Latin-x (ISO/IEC 8859-x) for western european characters, and Unicode (ISO/IEC 10646) for internationalization (i18n).

A 7-bit encoding scheme (such as ASCII) can represent 128 characters and symbols. An 8-bit character encoding scheme (such as Latin-x) can represent 256 characters and symbols; whereas a 16-bit encoding scheme (such as Unicode UCS-2) can represents 65,536 characters and symbols.

7-bit ASCII Code (aka US-ASCII, ISO/IEC 646, ITU-T T.50)

  • ASCII (American Standard Code for Information Interchange) is one of the earlier character coding schemes.
  • ASCII is originally a 7-bit code. It has been extended to 8-bit to better utilize the 8-bit computer memory organization. (The 8th-bit was originally used for parity check in the early computers.)
Hex 0 1 2 3 4 5 6 7 8 9 A B C D E F
2 SP!" #$%&'()*+,-./
3 0123456789:;<=> ?
4 @ABCDEFGHIJKLMNO
5 PQRSTUVWXYZ[\]^_
6 `abcdefghijklmno
7 pqrstuvwxyz{|}~ 
Dec 0 1 2 3 4 5 6 7 8 9
3     SP ! " # $ % & '
4 ( ) * + , - . / 0 1
5 2 3 4 5 6 7 8 9 : ;
6 < = > ? @ A B C D E
7 F G H I J K L M N O
8 P Q R S T U V W X Y
9 Z [ \ ] ^ _ ` a b c
10 d e f g h i j k l m
11 n o p q r s t u v w
12 x y z { | } ~      
  • Code number 32D (20H) is the blank or space character.
  • '0' to '9' : 30H-39H (0011 0001B to 0011 1001B) or (0011 xxxxB where xxxx is the equivalent integer value )
  • 'A' to 'Z' : 41H-5AH (0101 0001B to 0101 1010B) or (010x xxxxB) . 'A' to 'Z' are continuous without gap.
  • 'a' to 'z' : 61H-7AH (0110 0001B to 0111 1010B) or (011x xxxxB) . 'A' to 'Z' are also continuous without gap. However, there is a gap between uppercase and lowercase letters. To convert between upper and lowercase, flip the value of bit-5.
  • 09H for Tab ( '\t' ).
  • 0AH for Line-Feed or newline (LF or '\n' ) and 0DH for Carriage-Return (CR or 'r' ), which are used as line delimiter (aka line separator , end-of-line ) for text files. There is unfortunately no standard for line delimiter: Unixes and Mac use 0AH (LF or " \n "), Windows use 0D0AH (CR+LF or " \r\n "). Programming languages such as C/C++/Java (which was created on Unix) use 0AH (LF or " \n ").
  • In programming languages such as C/C++/Java, line-feed ( 0AH ) is denoted as '\n' , carriage-return ( 0DH ) as '\r' , tab ( 09H ) as '\t' .
DECHEXMeaningDECHEXMeaning
000NULNull1711DC1Device Control 1
101SOHStart of Heading1812DC2Device Control 2
202STXStart of Text1913DC3Device Control 3
303ETXEnd of Text2014DC4Device Control 4
404EOTEnd of Transmission2115NAKNegative Ack.
505ENQEnquiry2216SYNSync. Idle
606ACKAcknowledgment2317ETBEnd of Transmission
707BELBell2418CANCancel
808BS Back Space 2519EMEnd of Medium
261ASUBSubstitute
271BESCEscape
110BVTVertical Feed281CIS4File Separator
120CFFForm Feed 291DIS3Group Separator
301EIS2Record Separator
140ESOShift Out311FIS1Unit Separator
150FSIShift In        
1610DLEDatalink Escape 127 7F DEL Delete

8-bit Latin-1 (aka ISO/IEC 8859-1)

ISO/IEC-8859 is a collection of 8-bit character encoding standards for the western languages.

ISO/IEC 8859-1, aka Latin alphabet No. 1, or Latin-1 in short, is the most commonly-used encoding scheme for western european languages. It has 191 printable characters from the latin script, which covers languages like English, German, Italian, Portuguese and Spanish. Latin-1 is backward compatible with the 7-bit US-ASCII code. That is, the first 128 characters in Latin-1 (code numbers 0 to 127 (7FH)), is the same as US-ASCII. Code numbers 128 (80H) to 159 (9FH) are not assigned. Code numbers 160 (A0H) to 255 (FFH) are given as follows:

Hex 0 1 2 3 4 5 6 7 8 9 A B C D E F
A NBSP ¡¢ £¤¥¦§¨©ª«¬SHY®¯
B °±²³´µ·¸¹º»¼½¾ ¿
C ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
D ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
E àáâãäåæçèéêëìíîï
F ðñòóôõö÷øùúûüýþÿ

ISO/IEC-8859 has 16 parts. Besides the most commonly-used Part 1, Part 2 is meant for Central European (Polish, Czech, Hungarian, etc), Part 3 for South European (Turkish, etc), Part 4 for North European (Estonian, Latvian, etc), Part 5 for Cyrillic, Part 6 for Arabic, Part 7 for Greek, Part 8 for Hebrew, Part 9 for Turkish, Part 10 for Nordic, Part 11 for Thai, Part 12 was abandon, Part 13 for Baltic Rim, Part 14 for Celtic, Part 15 for French, Finnish, etc. Part 16 for South-Eastern European.

Other 8-bit Extension of US-ASCII (ASCII Extensions)

Beside the standardized ISO-8859-x, there are many 8-bit ASCII extensions, which are not compatible with each others.

ANSI (American National Standards Institute) (aka Windows-1252 , or Windows Codepage 1252): for Latin alphabets used in the legacy DOS/Windows systems. It is a superset of ISO-8859-1 with code numbers 128 (80H) to 159 (9FH) assigned to displayable characters, such as "smart" single-quotes and double-quotes. A common problem in web browsers is that all the quotes and apostrophes (produced by "smart quotes" in some Microsoft software) were replaced with question marks or some strange symbols. It it because the document is labeled as ISO-8859-1 (instead of Windows-1252), where these code numbers are undefined. Most modern browsers and e-mail clients treat charset ISO-8859-1 as Windows-1252 in order to accommodate such mis-labeling.

Hex 0 1 2 3 4 5 6 7 8 9 A B C D E F
8   ƒ ˆ Š Œ   Ž  
9     š œ   ž Ÿ

EBCDIC (Extended Binary Coded Decimal Interchange Code): Used in the early IBM computers.

Unicode (aka ISO/IEC 10646 Universal Character Set)

Before Unicode, no single character encoding scheme could represent characters in all languages. For example, western european uses several encoding schemes (in the ISO-8859-x family). Even a single language like Chinese has a few encoding schemes (GB2312/GBK, BIG5). Many encoding schemes are in conflict of each other, i.e., the same code number is assigned to different characters.

Unicode aims to provide a standard character encoding scheme, which is universal, efficient, uniform and unambiguous. Unicode standard is maintained by a non-profit organization called the Unicode Consortium (@ www.unicode.org ). Unicode is an ISO/IEC standard 10646.

Unicode is backward compatible with the 7-bit US-ASCII and 8-bit Latin-1 (ISO-8859-1). That is, the first 128 characters are the same as US-ASCII; and the first 256 characters are the same as Latin-1.

Unicode originally uses 16 bits (called UCS-2 or Unicode Character Set - 2 byte), which can represent up to 65,536 characters. It has since been expanded to more than 16 bits, currently stands at 21 bits. The range of the legal codes in ISO/IEC 10646 is now from U+0000H to U+10FFFFH (21 bits or about 2 million characters), covering all current and ancient historical scripts. The original 16-bit range of U+0000H to U+FFFFH (65536 characters) is known as Basic Multilingual Plane (BMP), covering all the major languages in use currently. The characters outside BMP are called Supplementary Characters , which are not frequently-used.

Unicode has two encoding schemes:

  • UCS-2 (Universal Character Set - 2 Byte): Uses 2 bytes (16 bits), covering 65,536 characters in the BMP. BMP is sufficient for most of the applications. UCS-2 is now obsolete.
  • UCS-4 (Universal Character Set - 4 Byte): Uses 4 bytes (32 bits), covering BMP and the supplementary characters.

DataRep_Unicode.png

UTF-8 (Unicode Transformation Format - 8-bit)

The 16/32-bit Unicode (UCS-2/4) is grossly inefficient if the document contains mainly ASCII characters, because each character occupies two bytes of storage. Variable-length encoding schemes, such as UTF-8, which uses 1-4 bytes to represent a character, was devised to improve the efficiency. In UTF-8, the 128 commonly-used US-ASCII characters use only 1 byte, but some less-commonly characters may require up to 4 bytes. Overall, the efficiency improved for document containing mainly US-ASCII texts.

The transformation between Unicode and UTF-8 is as follows:

Bits Unicode UTF-8 Code Bytes
7 00000000 0xxxxxxx 0xxxxxxx 1 (ASCII)
11 00000yyy yyxxxxxx 110yyyyy 10xxxxxx 2
16 zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx 3
21 000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx 4

In UTF-8, Unicode numbers corresponding to the 7-bit ASCII characters are padded with a leading zero; thus has the same value as ASCII. Hence, UTF-8 can be used with all software using ASCII. Unicode numbers of 128 and above, which are less frequently used, are encoded using more bytes (2-4 bytes). UTF-8 generally requires less storage and is compatible with ASCII. The drawback of UTF-8 is more processing power needed to unpack the code due to its variable length. UTF-8 is the most popular format for Unicode.

  • UTF-8 uses 1-3 bytes for the characters in BMP (16-bit), and 4 bytes for supplementary characters outside BMP (21-bit).
  • The 128 ASCII characters (basic Latin letters, digits, and punctuation signs) use one byte. Most European and Middle East characters use a 2-byte sequence, which includes extended Latin letters (with tilde, macron, acute, grave and other accents), Greek, Armenian, Hebrew, Arabic, and others. Chinese, Japanese and Korean (CJK) use three-byte sequences.
  • All the bytes, except the 128 ASCII characters, have a leading '1' bit. In other words, the ASCII bytes, with a leading '0' bit, can be identified and decoded easily.

Example : 您好 (Unicode: 60A8H 597DH)

UTF-16 (Unicode Transformation Format - 16-bit)

UTF-16 is a variable-length Unicode character encoding scheme, which uses 2 to 4 bytes. UTF-16 is not commonly used. The transformation table is as follows:

Unicode UTF-16 Code Bytes
xxxxxxxx xxxxxxxx Same as UCS-2 - no encoding 2
000uuuuu zzzzyyyy yyxxxxxx
(uuuuu≠0)
110110ww wwzzzzyy 110111yy yyxxxxxx
(wwww = uuuuu - 1)
4

Take note that for the 65536 characters in BMP, the UTF-16 is the same as UCS-2 (2 bytes). However, 4 bytes are used for the supplementary characters outside the BMP.

For BMP characters, UTF-16 is the same as UCS-2. For supplementary characters, each character requires a pair 16-bit values, the first from the high-surrogates range, ( \uD800-\uDBFF ), the second from the low-surrogates range ( \uDC00-\uDFFF ).

UTF-32 (Unicode Transformation Format - 32-bit)

Same as UCS-4, which uses 4 bytes for each character - unencoded.

Formats of Multi-Byte (e.g., Unicode) Text Files

Endianess (or byte-order) : For a multi-byte character, you need to take care of the order of the bytes in storage. In big endian , the most significant byte is stored at the memory location with the lowest address (big byte first). In little endian , the most significant byte is stored at the memory location with the highest address (little byte first). For example, 您 (with Unicode number of 60A8H ) is stored as 60 A8 in big endian; and stored as A8 60 in little endian. Big endian, which produces a more readable hex dump, is more commonly-used, and is often the default.

BOM (Byte Order Mark) : BOM is a special Unicode character having code number of FEFFH , which is used to differentiate big-endian and little-endian. For big-endian, BOM appears as FE FFH in the storage. For little-endian, BOM appears as FF FEH . Unicode reserves these two code numbers to prevent it from crashing with another character.

Unicode text files could take on these formats:

  • Big Endian: UCS-2BE, UTF-16BE, UTF-32BE.
  • Little Endian: UCS-2LE, UTF-16LE, UTF-32LE.
  • UTF-16 with BOM. The first character of the file is a BOM character, which specifies the endianess. For big-endian, BOM appears as FE FFH in the storage. For little-endian, BOM appears as FF FEH .

UTF-8 file is always stored as big endian. BOM plays no part. However, in some systems (in particular Windows), a BOM is added as the first character in the UTF-8 file as the signature to identity the file as UTF-8 encoded. The BOM character ( FEFFH ) is encoded in UTF-8 as EF BB BF . Adding a BOM as the first character of the file is not recommended, as it may be incorrectly interpreted in other system. You can have a UTF-8 file without BOM.

Formats of Text Files

Line Delimiter or End-Of-Line (EOL) : Sometimes, when you use the Windows NotePad to open a text file (created in Unix or Mac), all the lines are joined together. This is because different operating platforms use different character as the so-called line delimiter (or end-of-line or EOL). Two non-printable control characters are involved: 0AH (Line-Feed or LF) and 0DH (Carriage-Return or CR).

  • Windows/DOS uses OD0AH (CR+LF or " \r\n ") as EOL.
  • Unix and Mac use 0AH (LF or " \n ") only.

End-of-File (EOF) : [TODO]

Windows' CMD Codepage

Character encoding scheme (charset) in Windows is called codepage . In CMD shell, you can issue command "chcp" to display the current codepage, or "chcp codepage-number" to change the codepage.

Take note that:

  • The default codepage 437 (used in the original DOS) is an 8-bit character set called Extended ASCII , which is different from Latin-1 for code numbers above 127.
  • Codepage 1252 (Windows-1252), is not exactly the same as Latin-1. It assigns code number 80H to 9FH to letters and punctuation, such as smart single-quotes and double-quotes. A common problem in browser that display quotes and apostrophe in question marks or boxes is because the page is supposed to be Windows-1252, but mislabelled as ISO-8859-1.
  • For internationalization and chinese character set: codepage 65001 for UTF8, codepage 1201 for UCS-2BE, codepage 1200 for UCS-2LE, codepage 936 for chinese characters in GB2312, codepage 950 for chinese characters in Big5.

Chinese Character Sets

Unicode supports all languages, including asian languages like Chinese (both simplified and traditional characters), Japanese and Korean (collectively called CJK). There are more than 20,000 CJK characters in Unicode. Unicode characters are often encoded in the UTF-8 scheme, which unfortunately, requires 3 bytes for each CJK character, instead of 2 bytes in the unencoded UCS-2 (UTF-16).

Worse still, there are also various chinese character sets, which is not compatible with Unicode:

  • GB2312/GBK: for simplified chinese characters. GB2312 uses 2 bytes for each chinese character. The most significant bit (MSB) of both bytes are set to 1 to co-exist with 7-bit ASCII with the MSB of 0. There are about 6700 characters. GBK is an extension of GB2312, which include more characters as well as traditional chinese characters.
  • BIG5: for traditional chinese characters BIG5 also uses 2 bytes for each chinese character. The most significant bit of both bytes are also set to 1. BIG5 is not compatible with GBK, i.e., the same code number is assigned to different character.

For example, the world is made more interesting with these many standards:

  Standard Characters Codes
Simplified GB2312 和谐 BACD  D0B3
UCS-2 和谐 548C  8C10
UTF-8 和谐 E5928C  E8B090
Traditional BIG5 和諧 A94D  BFD3
UCS-2 和諧 548C  8AE7
UTF-8 和諧 E5928C  E8ABA7

Notes for Windows' CMD Users : To display the chinese character correctly in CMD shell, you need to choose the correct codepage, e.g., 65001 for UTF8, 936 for GB2312/GBK, 950 for Big5, 1201 for UCS-2BE, 1200 for UCS-2LE, 437 for the original DOS. You can use command " chcp " to display the current code page and command " chcp codepage_number " to change the codepage. You also have to choose a font that can display the characters (e.g., Courier New, Consolas or Lucida Console, NOT Raster font).

Collating Sequences (for Ranking Characters)

A string consists of a sequence of characters in upper or lower cases, e.g., "apple" , "BOY" , "Cat" . In sorting or comparing strings, if we order the characters according to the underlying code numbers (e.g., US-ASCII) character-by-character, the order for the example would be "BOY" , "apple" , "Cat" because uppercase letters have a smaller code number than lowercase letters. This does not agree with the so-called dictionary order , where the same uppercase and lowercase letters have the same rank. Another common problem in ordering strings is "10" (ten) at times is ordered in front of "1" to "9" .

Hence, in sorting or comparison of strings, a so-called collating sequence (or collation ) is often defined, which specifies the ranks for letters (uppercase, lowercase), numbers, and special symbols. There are many collating sequences available. It is entirely up to you to choose a collating sequence to meet your application's specific requirements. Some case-insensitive dictionary-order collating sequences have the same rank for same uppercase and lowercase letters, i.e., 'A' , 'a' ⇒ 'B' , 'b' ⇒ ... ⇒ 'Z' , 'z' . Some case-sensitive dictionary-order collating sequences put the uppercase letter before its lowercase counterpart, i.e., 'A' ⇒ 'B' ⇒ 'C' ... ⇒ 'a' ⇒ 'b' ⇒ 'c' ... . Typically, space is ranked before digits '0' to '9' , followed by the alphabets.

Collating sequence is often language dependent, as different languages use different sets of characters (e.g., á, é, a, α) with their own orders.

For Java Programmers - java.nio.Charset

JDK 1.4 introduced a new java.nio.charset package to support encoding/decoding of characters from UCS-2 used internally in Java program to any supported charset used by external devices.

Example : The following program encodes some Unicode texts in various encoding scheme, and display the Hex codes of the encoded byte sequences.

For Java Programmers - char and String

The char data type are based on the original 16-bit Unicode standard called UCS-2. The Unicode has since evolved to 21 bits, with code range of U+0000 to U+10FFFF. The set of characters from U+0000 to U+FFFF is known as the Basic Multilingual Plane ( BMP ). Characters above U+FFFF are called supplementary characters. A 16-bit Java char cannot hold a supplementary character.

Recall that in the UTF-16 encoding scheme, a BMP characters uses 2 bytes. It is the same as UCS-2. A supplementary character uses 4 bytes. and requires a pair of 16-bit values, the first from the high-surrogates range, ( \uD800-\uDBFF ), the second from the low-surrogates range ( \uDC00-\uDFFF ).

In Java, a String is a sequences of Unicode characters. Java, in fact, uses UTF-16 for String and StringBuffer . For BMP characters, they are the same as UCS-2. For supplementary characters, each characters requires a pair of char values.

Java methods that accept a 16-bit char value does not support supplementary characters. Methods that accept a 32-bit int value support all Unicode characters (in the lower 21 bits), including supplementary characters.

This is meant to be an academic discussion. I have yet to encounter the use of supplementary characters!

Displaying Hex Values & Hex Editors

At times, you may need to display the hex values of a file, especially in dealing with Unicode characters. A Hex Editor is a handy tool that a good programmer should possess in his/her toolbox. There are many freeware/shareware Hex Editor available. Try google "Hex Editor".

I used the followings:

  • NotePad++ with Hex Editor Plug-in: Open-source and free. You can toggle between Hex view and Normal view by pushing the "H" button.
  • PSPad: Freeware. You can toggle to Hex view by choosing "View" menu and select "Hex Edit Mode".
  • TextPad: Shareware without expiration period. To view the Hex value, you need to "open" the file by choosing the file format of "binary" (??).
  • UltraEdit: Shareware, not free, 30-day trial only.

Let me know if you have a better choice, which is fast to launch, easy to use, can toggle between Hex and normal view, free, ....

The following Java program can be used to display hex code for Java Primitives (integer, character and floating-point):

System.out.println("Hex is " + Integer.toHexString(i)); // 3039 System.out.println("Binary is " + Integer.toBinaryString(i)); // 11000000111001 System.out.println("Octal is " + Integer.toOctalString(i)); // 30071 System.out.printf("Hex is %x\n", i); // 3039 System.out.printf("Octal is %o\n", i); // 30071 char c = 'a'; System.out.println("Character is " + c); // a System.out.printf("Character is %c\n", c); // a System.out.printf("Hex is %x\n", (short)c); // 61 System.out.printf("Decimal is %d\n", (short)c); // 97 float f = 3.5f; System.out.println("Decimal is " + f); // 3.5 System.out.println(Float.toHexString(f)); // 0x1.cp1 (Fraction=1.c, Exponent=1) f = -0.75f; System.out.println("Decimal is " + f); // -0.75 System.out.println(Float.toHexString(f)); // -0x1.8p-1 (F=-1.8, E=-1) double d = 11.22; System.out.println("Decimal is " + d); // 11.22 System.out.println(Double.toHexString(d)); // 0x1.670a3d70a3d71p3 (F=1.670a3d70a3d71 E=3) } }

In Eclipse, you can view the hex code for integer primitive Java variables in debug mode as follows: In debug perspective, "Variable" panel ⇒ Select the "menu" (inverted triangle) ⇒ Java ⇒ Java Preferences... ⇒ Primitive Display Options ⇒ Check "Display hexadecimal values (byte, short, char, int, long)".

Summary - Why Bother about Data Representation?

Integer number 1 , floating-point number 1.0 character symbol '1' , and string "1" are totally different inside the computer memory. You need to know the difference to write good and high-performance programs.

  • In 8-bit signed integer , integer number 1 is represented as 00000001B .
  • In 8-bit unsigned integer , integer number 1 is represented as 00000001B .
  • In 16-bit signed integer , integer number 1 is represented as 00000000 00000001B .
  • In 32-bit signed integer , integer number 1 is represented as 00000000 00000000 00000000 00000001B .
  • In 32-bit floating-point representation , number 1.0 is represented as 0 01111111 0000000 00000000 00000000B , i.e., S=0 , E=127 , F=0 .
  • In 64-bit floating-point representation , number 1.0 is represented as 0 01111111111 0000 00000000 00000000 00000000 00000000 00000000 00000000B , i.e., S=0 , E=1023 , F=0 .
  • In 8-bit Latin-1, the character symbol '1' is represented as 00110001B (or 31H ).
  • In 16-bit UCS-2, the character symbol '1' is represented as 00000000 00110001B .
  • In UTF-8, the character symbol '1' is represented as 00110001B .

If you "add" a 16-bit signed integer 1 and Latin-1 character '1' or a string "1", you could get a surprise.

Exercises (Data Representation)

For the following 16-bit codes:

Give their values, if they are representing:

  • a 16-bit unsigned integer;
  • a 16-bit signed integer;
  • two 8-bit unsigned integers;
  • two 8-bit signed integers;
  • a 16-bit Unicode characters;
  • two 8-bit ISO-8859-1 characters.

Ans: (1) 42 , 32810 ; (2) 42 , -32726 ; (3) 0 , 42 ; 128 , 42 ; (4) 0 , 42 ; -128 , 42 ; (5) '*' ; '耪' ; (6) NUL , '*' ; PAD , '*' .

REFERENCES & RESOURCES

  • (Floating-Point Number Specification) IEEE 754 (1985), "IEEE Standard for Binary Floating-Point Arithmetic".
  • (ASCII Specification) ISO/IEC 646 (1991) (or ITU-T T.50-1992), "Information technology - 7-bit coded character set for information interchange".
  • (Latin-I Specification) ISO/IEC 8859-1, "Information technology - 8-bit single-byte coded graphic character sets - Part 1: Latin alphabet No. 1".
  • (Unicode Specification) ISO/IEC 10646, "Information technology - Universal Multiple-Octet Coded Character Set (UCS)".
  • Unicode Consortium @ http://www.unicode.org .

Last modified: January, 2014

Basic Data Types in Python

Basic Data Types in Python: A Quick Exploration

Table of Contents

Python’s Basic Data Types

Integer literals, integer methods, the built-in int() function, floating-point literals, floating-point numbers representation, floating-point methods, the built-in float() function, complex number literals, complex number methods, the built-in complex() function, regular string literals, escape sequences in strings, raw string literals, f-string literals, string methods, common sequence operations on strings, the built-in str() and repr() functions, bytes literals, the built-in bytes() function, the built-in bytearray() function, bytes and bytearray methods, boolean literals, the built-in bool() function.

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Basic Data Types in Python

Python has several basic data types that are built into the language. With these types, you can represent numeric values, text and binary data, and Boolean values in your code. So, these data types are the basic building blocks of most Python programs and projects.

In this tutorial, you’ll learn about:

  • Numeric types, such as int , float , and complex
  • The str data type, which represents textual data
  • The bytes and bytearray data types for storing bytes
  • Boolean values with the bool data type

In this tutorial, you’ll learn only the basics of each data type. To learn more about a specific data type, you’ll find useful resources in the corresponding section.

Get Your Code: Click here to download the free sample code that you’ll use to learn about basic data types in Python.

Take the Quiz: Test your knowledge with our interactive “Basic Data Types in Python: A Quick Exploration” quiz. You’ll receive a score upon completion to help you track your learning progress:

Interactive Quiz

Take this quiz to test your understanding of the basic data types that are built into Python, like numbers, strings, bytes, and Booleans.

Python has several built-in data types that you can use out of the box because they’re built into the language. From all the built-in types available, you’ll find that a few of them represent basic objects, such as numbers, strings and characters, bytes, and Boolean values.

Note that the term basic refers to objects that can represent data you typically find in real life, such as numbers and text. It doesn’t include composite data types, such as lists , tuples , dictionaries , and others.

In Python, the built-in data types that you can consider basic are the following:

Class Basic Type
Integer numbers
Floating-point numbers
Complex numbers
Strings and characters
, Bytes
Boolean values

In the following sections, you’ll learn the basics of how to create, use, and work with all of these built-in data types in Python.

Integer Numbers

Integer numbers are whole numbers with no decimal places. They can be positive or negative numbers. For example, 0 , 1 , 2 , 3 , -1 , -2 , and -3 are all integers. Usually, you’ll use positive integer numbers to count things.

In Python, the integer data type is represented by the int class:

In the following sections, you’ll learn the basics of how to create and work with integer numbers in Python.

When you need to use integer numbers in your code, you’ll often use integer literals directly. Literals are constant values of built-in types spelled out literally, such as integers. Python provides a few different ways to create integer literals. The most common way is to use base-ten literals that look the same as integers look in math:

Here, you have three integer numbers: a positive one, a negative one, and zero. Note that to create negative integers, you need to prepend the minus sign ( - ) to the number.

Python has no limit to how long an integer value can be. The only constraint is the amount of memory your system has. Beyond that, an integer can be as long as you need:

For a really, really long integer, you can get a ValueError when converting it to a string:

If you need to print an integer number beyond the 4300-digit limit, then you can use the sys.set_int_max_str_digits() function to increase the limit and make your code work.

When you’re working with long integers, you can use the underscore character to make the literals more readable:

With the underscore as a thousands separator, you can make your integer literals more readable for fellow programmers reading your code.

You can also use other bases to represent integers. You can prepend the following characters to an integer value to indicate a base other than 10 :

Prefix Representation Base
or (Zero + b or B) Binary 2
or (Zero + o or O) Octal 8
or (Zero + x or X) Hexadecimal 16

Using the above characters, you can create integer literals using binary , octal , and hexadecimal representations. For example:

Note that the underlying type of a Python integer is always int . So, in all cases, the built-in type() function returns int , irrespective of the base you use to build the literal.

The built-in int type has a few methods that you can use in some situations. Here’s a quick summary of these methods:

Method Description
Returns a pair of integers whose ratio is equal to the original integer and has a positive denominator
Returns the number of ones in the binary representation of the absolute value of the integer
Returns the number of bits necessary to represent an integer in binary, excluding the sign and leading zeros
Returns the integer represented by the given array of bytes
Returns an array of bytes representing an integer
Returns

When you call the .as_integer_ratio() method on an integer value, you get the integer as the numerator and 1 as the denominator. As you’ll see in a moment, this method is more useful in floating-point numbers.

Note that the int type also has a method called .is_integer() , which always returns True . This method exists for duck typing compatibility with floating-point numbers, which have the method as part of their public interface .

Note: To access an integer method on a literal, you need to wrap the literal in parentheses:

The parentheses are required because the dot character ( . ) also defines floating-point numbers, as you’ll learn in a moment. If you don’t use the parentheses, then you get a SyntaxError .

The .bit_count() and .bit_length() methods can help you when working on digital signal processing . For example, you may want every transmitted signal to have an even number of set bits:

In this toy example, you use .bit_count() to ensure that the received signal has the correct parity. This way, you implement a basic error detection mechanism.

Finally, the .from_bytes() and .to_bytes() methods can be useful in network programming. Often, you need to send and receive data over the network in binary format. To do this, you can use .to_bytes() to convert the message for network transmission. Similarly, you can use .from_bytes() to convert the message back.

The built-in int() function provides another way to create integer values using different representations. With no arguments, the function returns 0 :

This feature makes int() especially useful when you need a factory function for classes like defaultdict from the collections module.

Note: In Python, the built-in functions associated with data types, such as int() , float() , str() , and bytes() , are classes with a function-style name. The Python documentation calls them functions, so you’ll follow that practice in this tutorial. However, keep in mind that something like int() is really a class constructor rather than a regular function.

The int() function is commonly used to convert other data types into integers, provided that they’re valid numeric values:

In these examples, you first use int() to convert a floating-point number into an integer. Then, you convert a string into an integer. Note that when it comes to strings, you must ensure that the input string is a valid numeric value. Otherwise, you’ll get a ValueError exception.

Note: When you use the int() function to convert floating-point numbers, you must be aware that the function just removes the decimal or fractional part.

This function can take an additional argument called base , which defaults to 10 for decimal integers. This argument allows you to convert strings that represent integer values, which are expressed using a different base:

In this case, the first argument must be a string representing an integer value with or without a prefix. Then, you must provide the appropriate base in the second argument to run the conversion. Once you call the function, you get the resulting integer value.

Floating-Point Numbers

Floating-point numbers , or just float , are numbers with a decimal place. For example, 1.0 and 3.14 are floating-point numbers. You can also have negative float numbers, such as -2.75 . In Python, the name of the float class represents floating-point numbers:

In the following sections, you’ll learn the basics of how to create and work with floating-point numbers in Python.

The float type in Python designates floating-point numbers. To create these types of numbers, you can also use literals, similar to what you use in math. However, in Python, the dot character ( . ) is what you must use to create floating-point literals:

In these quick examples, you create floating-point numbers in three different ways. First, you have a literal build using an integer part, the dot, and the decimal part. You can also create a literal using the dot without specifying the decimal part, which defaults to 0 . Finally, you make a literal without specifying the integer part, which also defaults to 0 .

You can also have negative float numbers:

To create a negative floating-point number using a literal, you need to prepend the minus sign ( - ) to the number.

Similar to integer numbers, if you’re working with long floating-point numbers, you can use the underscore character as a thousands separator:

By using an underscore, you can make your floating-point literals more readable for humans, which is great.

Optionally, you can use the characters e or E followed by a positive or negative integer to express the number using scientific notation :

By using the e or E character, you can represent any floating-point number using scientific notation, as you did in the above examples.

Now, you can take a more in-depth look at how Python internally represents floating-point numbers. You can readily use floating-point numbers in Python without understanding them to this level, so don’t worry if this seems overly complicated. The information in this section is only meant to satisfy your curiosity.

Note: For additional information on the floating-point representation in Python and the potential pitfalls, see Floating Point Arithmetic: Issues and Limitations in the Python documentation.

Almost all platforms represent Python float values as 64-bit (double-precision) values, according to the IEEE 754 standard. In that case, a floating-point number’s maximum value is approximately 1.8 ⨉ 10 308 . Python will indicate this number, and any numbers greater than that, by the "inf" string:

The closest a nonzero number can be to zero is approximately 5.0 ⨉ 10 -324 . Anything closer to zero than that is effectively considered to be zero:

Python internally represents floating-point numbers as binary (base-2) fractions. Most decimal fractions can’t be represented exactly as binary fractions. So, in most cases, the internal representation of a floating-point number is an approximation of its actual value.

In practice, the difference between the actual and represented values is small and should be manageable. However, check out Make Python Lie to You for some challenges you should be aware of.

The built-in float type has a few methods and attributes which can be useful in some situations. Here’s a quick summary of them:

Method Description
Returns a pair of integers whose ratio is exactly equal to the original
Returns if the float instance is finite with integral value, and otherwise
Returns a representation of a floating-point number as a hexadecimal string
Builds the from a hexadecimal string

The .as_integer_ratio() method on a float value returns a pair of integers whose ratio equals the original number. You can use this method in scientific computations that require high precision. In these situations, you may need to avoid precision loss due to floating-point rounding errors.

For example, say that you need to perform computations with the gravitational constant:

With this exact ratio, you can perform calculations and prevent floating-point errors that may alter the results of your research.

The .is_integer() method allows you to check whether a given float value is an integer:

When the number after the decimal point is 0 , the .is_integer() method returns True . Otherwise, it returns False .

Finally, the .hex() and .fromhex() methods allow you to work with floating-point values using a hexadecimal representation:

The .hex() method returns a string that represents the target float value as a hexadecimal value. Note that .hex() is an instance method. The .fromhex() method takes a string that represents a floating-point number as an argument and builds an actual float number from it.

In both methods, the hexadecimal string has the following format:

In this template, apart from the integer identifier, the components are optional. Here’s what they mean:

  • sign defines whether the number is positive or negative. It may be either + or - . Only the - sign is required because + is the default.
  • "0x" is the hexadecimal prefix.
  • integer is a string of hexadecimal digits representing the whole part of the float number.
  • "." is a dot that separates the whole and fractional parts.
  • fraction is a string of hexadecimal digits representing the fractional part of the float number.
  • "p" allows for adding an exponent value.
  • exponent is a decimal integer with an optional leading sign.

With these components, you’ll be able to create valid hexadecimal strings to process your floating-point numbers with the .hex() and .fromhex() methods.

The built-in float() function provides another way to create floating-point values. When you call float() with no argument, then you get 0.0 :

Again, this feature of float() allows you to use it as a factory function.

The float() function also helps you convert other data types into float , provided that they’re valid numeric values:

In these examples, you first use float() to convert an integer number into a float. Then, you convert a string into a float. Again, with strings, you need to make sure that the input string is a valid numeric value. Otherwise, you get a ValueError exception.

Complex Numbers

Python has a built-in type for complex numbers . Complex numbers are composed of real and imaginary parts. They have the form a + bi , where a and b are real numbers, and i is the imaginary unit . In Python, you’ll use a j instead of an i . For example:

In this example, the argument to type() may look like an expression. However, it’s a literal of a complex number in Python. If you pass the literal to the type() function, then you’ll get the complex type back.

Note: To dive deeper into complex numbers, check out the Simplify Complex Numbers With Python tutorial.

In the following sections, you’ll learn the basics of creating complex numbers in Python. You’ll also explore the methods of this data type.

In Python, you can define complex numbers using literals that look like a + bj , where a is the real part, and bj is the imaginary part:

As you can conclude from these examples, there are many ways to create complex numbers using literals. The key is that you need to use the j letter in one of the components. Note that the j can’t be used alone. If you try to do so, you get a NameError exception because Python thinks that you’re creating an expression . Instead, you need to write 1j .

In Python, the complex type has a single method called .conjugate() . When you call this method on a complex number, you get the conjugate:

The conjugate() method flips the sign of the imaginary part, returning the complex conjugate .

You can also use the built-in complex() function to create complex numbers by providing the real and imaginary parts as arguments:

When you call complex() with no argument, you get 0j . If you call the function with a single argument, that argument is the real part, and the imaginary part will be 0j . If you want only the imaginary part, you can pass 0 as the first argument. Note that you can also use negative numbers. In general, you can use integers and floating-point numbers as arguments to complex() .

You can also use complex() to convert strings to complex numbers:

To convert strings into complex numbers, you must provide a string that follows the format of complex numbers. For example, you can’t have spaces between the components. If you add spaces, then you get a ValueError exception.

Finally, note that you can’t use strings to provide the imaginary part of complex numbers. If you do that, then you get a TypeError exception.

Strings and Characters

In Python, strings are sequences of character data that you can use to represent and store textual data. The string type in Python is called str :

In this example, the argument to type() is a string literal that you commonly create using double quotes to enclose some text.

Note: Check out the Strings and Character Data in Python tutorial to dive deeper into Python strings.

In the following sections, you’ll learn the basics of how to create, use, format, and manipulate strings in Python.

You can also use literals to create strings. To build a single-line string literal, you can use double ( "" ) or single quotes ( '' ) and, optionally, a sequence of characters in between them. All the characters between the opening and closing quotes are part of the string:

Python’s strings can contain as many characters as you need. The only limit is your computer’s memory.

You can define empty strings by using the quotes without placing characters between them:

An empty string doesn’t contain any characters, so when you use the built-in len() function with an empty string as an argument, you get 0 as a result.

There is yet another way to delimit strings in Python. You can create triple-quoted string literals, which can be delimited using either three single quotes or three double quotes. Triple-quoted strings are commonly used to build multiline string literals. However, you can also use them to create single-line literals:

Even though you can use triple-quoted strings to create single-line string literals, the main use case of them would be to create multiline strings. In Python code, probably the most common use case for these string literals is when you need to provide docstrings for your packages, modules , functions , classes , and methods .

What if you want to include a quote character as part of the string itself? Your first impulse might be to try something like this:

As you can see, that doesn’t work so well. The string in this example opens with a single quote, so Python assumes the next single quote—the one in parentheses—is the closing delimiter. The final single quote is then a stray, which causes the syntax error shown.

If you want to include either type of quote character within the string, then you can delimit the string with the other type. In other words, if a string is to contain a single quote, delimit it with double quotes and vice versa:

In these examples, your first string includes a single quote as part of the text. To do this, you use double quotes to delimit the literal. In the second example, you do the opposite.

Sometimes, you want Python to interpret a character or sequence of characters within a string differently. This may occur in one of two ways. You may want to:

  • Apply special meaning to characters
  • Suppress special character meaning

You can accomplish these goals by using a backslash ( \ ) character to indicate that the characters following it should be interpreted specially. The combination of a backslash and a specific character is called an escape sequence . That’s because the backslash causes the subsequent character to escape its usual meaning.

You already know that if you use single quotes to delimit a string, then you can’t directly embed a single quote character as part of the string because, for that string, the single quote has a special meaning —it terminates the string. You can eliminate this limitation by using double quotes to delimit the string.

Alternatively, you can escape the quote character using a backslash:

In this example, the backslash escapes the single quote character by suppressing its usual meaning. Now, Python knows that your intention isn’t to terminate the string but to embed the single quote.

The following is a table of escape sequences that cause Python to suppress the usual special interpretation of a character in a string:

Character Usual Interpretation Escape Sequence Escaped Interpretation
Delimit a string literal Literal single quote ( ) character
Delimit a string literal Literal double quote ( ) character
Terminates the input line Newline is ignored
Introduces an escape sequence Literal backslash ( ) character

You already have an idea of how the first two escape sequences work. Now, how does the newline escape sequence work? Usually, a newline character terminates a physical line of input. So, pressing Enter in the middle of a string will cause an error:

When you press Enter after typing Hello , you get a SyntaxError . If you need to break up a string over more than one line, then you can include a backslash before each new line:

By using a backslash before pressing enter, you make Python ignore the new line and interpret the whole construct as a single line.

Finally, sometimes you need to include a literal backslash character in a string. If that backslash doesn’t precede a character with a special meaning, then you can insert it right away:

In this example, the character after the backslash doesn’t match any known escape sequence, so Python inserts the actual backslash for you. Note how the resulting string automatically doubles the backslash. Even though this example works, the best practice is to always double the backslash when you need this character in a string.

However, you may have the need to include a backslash right before a character that makes up an escape sequence:

Because the sequence \" matches a known escape sequence, your string fails with a SyntaxError . To avoid this issue, you can double the backslash:

In this update, you double the backslash to escape the character and prevent Python from raising an error.

Note: When you use the built-in print() function to print a string that includes an escaped backslash, then you won’t see the double backslash in the output:

In this example, the output only displays one backslash, producing the desired effect.

Up to this point, you’ve learned how to suppress the meaning of a given character by escaping it. Suppose you need to create a string containing a tab character. Some text editors may allow you to insert a tab character directly into your code. However, this is considered a poor practice for several reasons:

  • Computers can distinguish between tabs and a sequence of spaces, but human beings can’t because these characters are visually indistinguishable.
  • Some text editors automatically eliminate tabs by expanding them to an appropriate number of spaces.
  • Some Python REPL environments will not insert tabs into code.

In Python, you can specify a tab character by the \t escape sequence:

The \t escape sequence changes the usual meaning of the letter t . Instead, Python interprets the combination as a tab character.

Here is a list of escape sequences that cause Python to apply special meaning to some characters instead of interpreting them literally:

Escape Sequence Escaped Interpretation
ASCII Bell ( ) character
ASCII Backspace ( ) character
ASCII Formfeed ( ) character
ASCII Linefeed ( ) character
Character from Unicode database with given
ASCII Carriage return ( ) character
ASCII Horizontal tab ( ) character
Unicode character with 16-bit hex value
Unicode character with 32-bit hex value
ASCII Vertical tab ( ) character
Character with octal value
Character with hex value

Of these escape sequences, the newline or linefeed character ( \n ) is probably the most popular. This sequence is commonly used to create nicely formatted text outputs.

Here are a few examples of the escape sequences in action:

These escape sequences are typically useful when you need to insert characters that aren’t readily generated from the keyboard or aren’t easily readable or printable.

A raw string is a string that doesn’t translate the escape sequences. Any backslash characters are left in the string.

Note: To learn more about raw strings, check out the What Are Python Raw Strings? tutorial.

To create a raw string, you can precede the literal with an r or R :

The raw string suppresses the meaning of the escape sequence and presents the characters as they are. This behavior comes in handy when you’re creating regular expressions because it allows you to use several different characters that may have special meanings without restrictions.

Python has another type of string literal called formatted strings or f-strings for short. F-strings allow you to interpolate values into your strings and format them as you need.

Note: To dive deeper into f-strings, check out the Python’s F-String for String Interpolation and Formatting tutorial

To build f-string literals, you must prepend an f or F letter to the string literal. Because the idea behind f-strings is to interpolate values and format them into the final string, you need to use something called a replacement field in your string literal. You create these fields using curly brackets.

Here’s a quick example of an f-string literal:

In this example, you interpolate the variable name into your string using an f-string literal and a replacement field.

You can also use f-strings to format the interpolated values. To do that, you can use format specifiers that use the syntax defined in Python’s string format mini-language . For example, here’s how you can present numeric values using a currency format:

Inside the replacement field, you have the variable you want to interpolate and the format specifier, which is the string that starts with a colon ( : ). In this example, the format specifier defines a floating-point number with two decimal places.

Python’s str data type is probably the built-in type with the most available methods . In fact, you’ll find methods for most string processing operations. Here’s a summary of the methods that perform some string processing and return a transformed string object:

Method Description
Converts the first character to uppercase and the rest to lowercase
Converts the string into lowercase
Centers the string between using
Encodes the string using the specified
Replaces tab characters with spaces according to
Interpolates and formats the specified values
Interpolates and formats the specified values using a dictionary
Joins the items in an iterable with the string as a separator
Returns a left-justified version of the string
Returns a right-justified version of the string
Converts the string into lowercase
Trims the string by removing from the beginning and end
Trims the string by removing from the beginning
Trims the string by removing from the end
Removes from the beginning of the string
Removes from the end of the string
Returns a string where the substring is replaced with
Converts lowercase letters to uppercase letters and vice versa
Converts the first character of each word to uppercase and the rest to lowercase
Converts a string into uppercase
Fills the string with a specified number of zeroes at the beginning

All the above methods allow you to perform a specific transformation on an existing string. In all cases, you get a new string as a result:

As you can see, the methods in these examples perform a specific transformation on the original string and return a new string object.

You’ll also find that the str class has several Boolean-valued methods or predicate methods:

Method Result
if the string ends with the specified suffix, otherwise
if the string starts with the specified prefix, otherwise
if all characters in the string are alphanumeric, otherwise
if all characters in the string are letters, otherwise
if the string is empty or all characters in the string are ASCII, otherwise
if all characters in the string are decimals, otherwise
if all characters in the string are digits, otherwise
if the string is a valid Python name, otherwise
if all characters in the string are lowercase, otherwise
if all characters in the string are numeric, otherwise
if all characters in the string are printable, otherwise
if all characters in the string are whitespaces, otherwise
if the string follows title case, otherwise
if all characters in the string are uppercase, otherwise

All these methods allow you to check for various conditions in your strings. Here are a few demonstrative examples:

In these examples, the methods check for specific conditions in the target string and return a Boolean value as a result.

Finally, you’ll find a few other methods that allow you to run several other operations on your strings:

Method Description
Returns the number of occurrences of a substring
Searches the string for a specified value and returns the position of where it was found
Searches the string for a specified value and returns the last position of where it was found
Searches the string for a specified value and returns the position of where it was found
Searches the string for a specified value and returns the last position of where it was found
Splits the string at the specified separator and returns a list
Splits the string at line breaks and returns a list
Splits the string at the first occurance of
Splits the string at the last occurance of
Splits the string at the specified separator and returns a list
Returns a translation table to be used in translations
Returns a translated string

The first method counts the number of repetitions of a substring in an existing string. Then, you have four methods that help you find substrings in a string.

The .split() method is especially useful when you need to split a string into a list of individual strings using a given character as a separator, which defaults to whitespaces. You can also use .partition() or .rpartition() if you need to divide the string in exactly two parts:

In these toy examples, you’ve used the .split() method to build a list of words from a sentence. Note that by default, the method uses whitespace characters as separators. You also used .partition() and .rpartition() to separate out the first and last number from a string with numbers.

The .maketrans() and .translate() are nice tools for playing with strings . For example, say that you want to implement the Cesar cipher algorithm . This algorithm allows for basic text encryption by shifting the alphabet by a number of letters. For example, if you shift the letter a by three, then you get the letter d , and so on.

The following code implements cipher() , a function that takes a character and rotates it by three:

In this example, you use .maketrans() to create a translation table that matches the lowercase alphabet to a shifted alphabet. Then, you apply the translation table to a string using the .translate() method.

Python’s strings are sequences of characters. As other built-in sequences like lists and tuples , strings support a set of operations that are known as common sequence operations . The table below is a summary of all the operations that are common to most sequence types in Python:

Operation Example Result
The length of
The item at index
A slice of from index to
A slice of from index to with step
The smallest item of
The largest item of
if an item of is equal to , else
if an item of is equal to , else
The concatenation of and
or The repetition of a number of times specified by
The index of the first occurrence of in
The total number of occurrences of in

Sometimes, you need to determine the number of characters in a string. In this situation, you can use the built-in len() function:

When you call len() with a string as an argument, you get the number of characters in the string at hand.

Another common operation you’d run on strings is retrieving a single character or a substring from an existing string. In these situations, you can use indexing and slicing, respectively:

To retrieve a character from an existing string, you use the indexing operator [index] with the index of the target character. Note that indices are zero-based, so the first character lives at index 0 .

To retrieve a slice or substring from an existing string, you use the slicing operator with the appropriate indices. In the example above, you don’t provide the start index i , so Python assumes that you want to start from the beginning of the string. Then, you give the end index j to tell Python where to stop the slicing.

You can take a leap and try the rest of the operations by yourself. It will be a great learning exercise!

When it comes to creating and working with strings, you have two functions that can help you out and make your life easier:

The built-in str() function allows you to create new strings and also convert other data types into strings:

In these examples, you use the str() function to convert objects from different built-in types into strings. In the first example, you use the function to create an empty string. In the other examples, you get strings consisting of the object’s literals between quotes, which provide user-friendly representations of the objects.

At first glance, these results may not seem useful. However, there are use cases where you need to use str() .

For example, say that you have a list of numeric values and want to join them using the str.join() method. This method only accepts iterables of strings, so you need to convert the numbers:

If you try to pass a list of numeric values to .join() , then you get a TypeError exception because the function only joins strings. To work around this issue, you use a generator expression to convert each number to its string representation.

Behind the str() function, you’ll have the .__str__() special method . In other words, when you call str() , Python automatically calls the .__str__() special method on the underlying object. You can use this special method to support str() in your own classes.

Consider the following Person class:

In this class, you have two instance attributes , .name and .age . Then, you have .__str__() special methods to provide user-friendly string representations for your class.

Here’s how this class works:

In this code snippet, you create an instance of Person . Then, you call str() using the object as an argument. As a result, you get a descriptive message back, which is the user-friendly string representation of your class.

Note: To learn more about objects’ string representations in Python, check out the When Should You Use .__repr__() vs .__str__() in Python? tutorial.

Similarly, when you pass an object to the built-in repr() function, you get a developer-friendly string representation of the object itself:

In the case of built-in types, the string representation you get with repr() is the same as the one you get with the str() function. This is because the representations are the literals of each object, and you can directly use them to re-create the object at hand.

Ideally, you should be able to re-create the current object using this representation. To illustrate, go ahead and update the Person class:

The .__repr__() special method allows you to provide a developer-friendly string representation for your class:

You should be able to copy and paste the resulting representation to re-create the object. That’s why this string representation is said to be developer-friendly.

Bytes and Bytearrays

Bytes are immutable sequences of single bytes. In Python, the bytes class allows you to build sequences of bytes. This data type is commonly used for manipulating binary data, encoding and decoding text, processing file input and output, and communicating through networks.

Python also has a bytearray class as a mutable counterpart to bytes objects:

In the following sections, you’ll learn the basics of how to create and work with bytes and bytearray objects in Python.

To create a bytes literal, you’ll use a syntax that’s largely the same as that for string literals. The difference is that you need to prepend a b to the string literal. As with string literals, you can use different types of quotes to define bytes literals:

There is yet another difference between string literals and bytes literals. To define bytes literals, you can only use ASCII characters. If you need to insert binary values over the 127 characters, then you have to use the appropriate escape sequence:

In this example, \xc3\xb1 is the escape sequence for the letter ñ in the Spanish word "España" . Note that if you try to use the ñ directly, you get a SyntaxError .

The built-in bytes() function provides another way to create bytes objects. With no arguments, the function returns an empty bytes object:

You can use the bytes() function to convert string literals to bytes objects:

In these examples, you first use bytes() to convert a string into a bytes object. Note that for this to work, you need to provide the appropriate character encoding . In this example, you use the UTF-8 encoding. If you try to convert a string literal without providing the encoding, then you get a TypeError exception.

You can also use bytes() with an iterable of integers where each number is the Unicode code point of the individual characters:

In this example, each number in the list you use as an argument to bytes() is the code point for a specific letter. For example, 65 is the code point for A , 66 for B , and so on. You can get the Unicode code point of any character using the built-in ord() function.

Python doesn’t have dedicated literal syntax for bytearray objects. To create them, you’ll always use the class constructor bytearray() , which is also known as a built-in function in Python. Here are a few examples of how to create bytearray objects using this function:

In the first example, you call bytearray() without an argument to create an empty bytearray object. In the second example, you call the function with an integer as an argument. In this case, you create a bytearray with five zero-filled items.

Next, you use a list of code points to create a bytearray . This call works the same as with bytes objects. Finally, you use a bytes literal to build up the bytearray object.

In Python, bytes and bytearray objects are quite similar to strings. Instead of being sequences of characters, bytes and bytearray objects are sequences of integer numbers, with values from 0 to 255 .

Because of their similarities with strings, the bytes and bytearray types support mostly the same methods as strings , so you won’t repeat them in this section. If you need detailed explanations of specific methods, then check out the Bytes and Bytearray Operations section in Python’s documentation.

Finally, both bytes and bytearray objects support the common sequence operations that you learned in the Common Sequence Operations on Strings section.

Boolean logic relies on the truth value of expressions and objects. The truth value of an expression or object can take one of two possible values: true or false . In Python, these two values are represented by True and False , respectively:

Both True and False are instances of the bool data type, which is built into Python. In the following sections, you’ll learn the basics about Python’s bool data type.

Python provides a built-in Boolean data type. Objects of this type may have one of two possible values: True or False . These values are defined as built-in constants with values of 1 and 0 , respectively. In practice, the bool type is a subclass of int . Therefore, True and False are also instances of int :

In Python, the bool type is a subclass of the int type. It has only two possible values, 0 and 1 , which map to the constants False and True .

These constant values are also the literals of the bool type:

Boolean objects that are equal to True are truthy, and those equal to False are falsy. In Python, non-Boolean objects also have a truth value. In other words, Python objects are either truthy or falsy.

You can use the built-in bool() function to convert any Python object to a Boolean value. Internally, Python uses the following rules to identify falsy objects:

  • Constants that are defined to be false: None and False
  • The zero of any numeric type: 0 , 0.0 , 0j , Decimal(0) , Fraction(0, 1)
  • Empty sequences and collections: '' , () , [] , {} , set() , range(0)

The rest of the objects are considered truthy in Python. You can use the built-in bool() function to explicitly learn the truth value of any Python object:

In these examples, you use bool() with arguments of different types. In each case, the function returns a Boolean value corresponding to the object’s truth value.

Note: You rarely need to call bool() yourself. Instead, you can rely on Python calling bool() under the hood when necessary. For example, you can say if numbers: instead of if bool(numbers): to check whether numbers is truthy.

You can also use the bool() function with custom classes:

By default, all instances of custom classes are true. If you want to modify this behavior, you can use the .__bool__() special method. Consider the following update of your Point class:

The .__bool__() method returns False when both coordinates are equal to 0 and True otherwise. Here’s how your class works now:

Now, when both coordinates are 0 , you get False from calling bool() . For the rest of the points, you get True .

You’ve learned about the basic built-in data types that Python provides. These types are the building blocks of most Python programs. With them, you can represent numeric, textual, byte, and Boolean data.

In this tutorial, you’ve learned about:

  • Python’s numeric types, such as int , float , and complex
  • The str data type, which represents textual data in Python
  • Boolean values with Python’s bool data type

With this knowledge, you’re ready to start using all of the basic data types that are built into Python.

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About Leodanis Pozo Ramos

Leodanis Pozo Ramos

Leodanis is an industrial engineer who loves Python and software development. He's a self-taught Python developer with 6+ years of experience. He's an avid technical writer with a growing number of articles published on Real Python and other sites.

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Aldren Santos

Master Real-World Python Skills With Unlimited Access to Real Python

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

What Do You Think?

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal . Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session . Happy Pythoning!

Keep Learning

Related Topics: basics python

Recommended Video Course: Basic Data Types in Python

Keep reading Real Python by creating a free account or signing in:

Already have an account? Sign-In

Almost there! Complete this form and click the button below to gain instant access:

Basic Data Types in Python: A Quick Exploration (Sample Code)

🔒 No spam. We take your privacy seriously.

representation of a data type

Most of us write numbers in Arabic form, ie, 1, 2, 3,..., 9. Some people write them differently, such as I, II, III, IV,..., IX. Nomatter what type of representation, most human beings can understand, at least the two types I mentioned. Unfortunately the computer doesn't. Computer is the most stupid thing you can ever encounter in your life.

Modern computers are built up with transistors. Whenever an electric current pass into the transistors either an or status will be established. Therefore the computer can only reconize two numbers, for OFF, and for ON, which can be referred to as . There is nothing in between Bit 0 and Bit 1 (eg Bit 0.5 doesn't exist). Hence computers can be said to be discrete machines. The number system consists only of two numbers is called . And to distinguish the different numbering systems, the numbers human use, ie 1,2,3,4..., will be called (since they are based 10 numbers) from now on.

How, therefore, can computer understand numbers larger than 1? The answer is simple, 2 is simply 1+1, (like 10 = 9+1 for human) the numbers are added and overflow digit is carred over to the left position. So (decimal) 2 is representated in Binary as 10. To further illustrate the relationship, I have listed the numbers 1 to 9 in both systems for compaison:

0 0000 0000
1 0000 0001
2 0000 0010
3 0000 0011
4 0000 0100
5 0000 0101
6 0000 0110
7 0000 0111
8 0000 1000
9 0000 1001

You may ask why do I always put 8 binary digits there. Well, the smallest unit in the computer's memory to store data is called a BYTE , which consists of 8 BITS. One Byte allows upto 256 different combinations of data representation (2 8 = 256). What happens when we have numbers greater than 256? The computer simply uses more Bytes to hold the value, 2 Bytes can hold values upto 65536 (2 16 ) and so forth.

Not only does the computer not understand the (decimal) numbers you use, it doesn't even understand letters like "ABCDEFG...". The fact is, it doesn't care. Whatever letters you input into the computer, the computer just saves it there and delivers to you when you instruct it so. It saves these letters in the same Binary format as digits, in accordance to a pattern. In PC (including DOS, Windows 95/98/NT, and UNIX), the pattern is called ASCII (pronounced ask-ee ) which stands for A merican S tandard C ode for I nformation I nterchange .

In this format, the letter "A" is represented by "0100 0001" ,or most often, referred to decimal 65 in the ASCII Table. The standard coding under ASCII is here . When performing comparison of characters, the computer actually looks up the associated ASCII codes and compare the ASCII values instead of the characters. Therefore the letter "B" which has ASCII value of 66 is greater than the letter "A" with ASCII value of 65.

The computer stores data in different formats or types . The number 10 can be stored as numeric value as in "10 dollars" or as character as in the address "10 Main Street" .  So how can the computer tell? Once again the computer doesn't care, it is your responsibility to ensure that you get the correct data out of it. (For illustration character 10 and numeric 10 are represented by 0011-0001-0011-0000 and 0000-1010 respectively — you can see how different they are.) Different programming launguages have different data types , although the foundamental ones are usually very similar.

C++ has many data types. The followings are some basic data types you will be facing in these chapters. Note that there are more complicated data types. You can even create your own data types. Some of these will be discussed later in the tutorial.

char 

1

ASCII -128 to127

 
unsigned char 

1

ASCII 0 to 255 

including high ASCII chars 
int

2

-32768 to 32767

Integer
unsigned (unsigned int)

2

0 to 65535

non-negative integer
long int 

4

� 2 billions

double sized integer
unsigned long int

4

0 to 4 billion

non-negative long integer 
float

4

3.4 �e38 

6 significant digits 
double 

8

1.7 �e308

15 significant digits 

char is basically used to store alphanumerics (numbers are stored in character form). Recall that character is stored as ASCII representation in PC. ASCII -128 to -1 do not exist, so char accomodates data from ASCII 0 (null zero) to ASCII 127 (DEL key). The original C++ does not have a String data type (but string is available through the inclusion of a library — to be discussed later). String can be stored as an one-dimensional array (list) with a "null zero" (ASCII 0) store in the last "cell" in the array. Unsigned char effectively accomodates the use of Extended ASCII characters which represent most special characters like the copyright sign �, registered trademark sign � etc plus some European letters like �, �, etc. Both char and unsigned char are stored internally as integers so they can effectively be compared (to be greater or less than).

Whenever you write a char (letter) in your program you must include it in single quotes. When you write strings (words or sentences) you must include them in double quotes. Otherwise C++ will treat these letters/words/sentences as tokens (to be discussed in Chapter 4). Remember in C/C++, A, 'A', "A" are all different. The first A (without quotes) means a variable or constant (discussed in Chapter 4), the second 'A' (in single quotes) means a character A which occupies one byte of memory. The third "A" (in double quotes) means a string containing the letter A followed by a null character which occupies 2 bytes of memory (will use more memory if store in a variable/constant of bigger size). See these examples: letter = 'A'; cout << 'A'; cout << "10 Main Street";

int (integer) represents all non-frational real numbers. Since int has a relatively small range (upto 32767), whenever you need to store value that has the possibility of going beyond this limit, long int should be used instead. The beauty of using int is that since it has no frational parts, its value is absolute and calculations of int are extremely accurate. However note that dividing an int by another may result in truncation, eg int 10 / int 3 will result in 3, not 3.3333 (more on this will be discussed later).

float , on the other hand, contains fractions. However real fractional numbers are not possible in computers since they are discrete machines (they can only handle the numbers 0 and 1, not 1.5 nor 1.75 or anything in between 0 and 1). No matter how many digits your calculator can show, you cannot produce a result of 2/3 without rounding, truncating, or by approximation. Mathameticians always write 2/3 instead of 0.66666.......... when they need the EXACT values. Since computer cannot produce real fractions the issue of significant digits comes to sight. For most applications a certain significant numbers are all you need. For example when you talk about money, $99.99 has no difference to $99.988888888888 (rounded to nearest cent); when you talk about the wealth of Bill Gates, it make little sense of saying $56,123,456,789.95 instead of just saying approximately $56 billions (these figures are not real, I have no idea how much money Bill has, although I wish he would give me the roundings). As you may see from the above table, float has only 6 significant digits, so for some applications it may not be sufficient, espically in scentific calculations, in which case you may want to use double or even long double to handle the numbers. There is also another problem in using float/double . Since numbers are represented internally as binary values, whenever a frational number is calculated or translated to/from binary there will be a rounding/truncaion error. So if you have a float 0, add 0.01 to it for 100 times, then minus 1.00 from it ( see the codes here or get the executable codes here ), you will not get 0 as it should be, rather you will get a value close to zero, but not really zero. Using double or long double will reduce the error but will not eliminate it. However as I mentioned earlier, the relevance may not affect our real life, just mean you may need to exercise caution when programming with floating point numbers.

There is another C++ data type I haven't included here — bool (boolean) data type which can only store a value of either 0 (false) or 1 (true). I will be using int (integer) to handle logical comparisons which poses more challenge and variety of use.

Escape Sequences are not data types but I feel I would better discuss them here. I mentioned earlier that you have to include a null zero at the end of a "string" in using an array of char to represent string. The easiest way to do this is to write the escape sequence '\0' which is understood by C++ as null zero. The followings are Escape Sequences in C++:

\a Alarm \t Tab \" Double Quote
\b Backspace \v Vertical Tab \000 Octal Num
\f Form Feed \\ Backslash \xhh Hex number
\n New Line \? Question Mark \0 Null Zero
\r Carriage Return \' Single Quote    

Earlier I said you can create your own data types. Here I will show you how. In fact you not only can create new data types but you can also create an alias of existing data type. For example you are writing a program which deals with dollar values. Since dollar values have fractional parts you have to either use float or double data types (eg assign float data type to salary by writing float salary . You can create an alias of the same data type MONEY and write MONEY salary. You do this by adding the following type definition into your program:

typedef double MONEY;

You can also create new data types. I will discuss more on this when we come to Arrays in Chapter 10. But the following illustrates how you create a new data type of array from a base data type:

Data types #

Data type objects

Array types and conversions between types #

NumPy supports a much greater variety of numerical types than Python does. This section shows which are available, and how to modify an array’s data-type.

NumPy numerical types are instances of numpy.dtype (data-type) objects, each having unique characteristics. Once you have imported NumPy using import numpy as np you can create arrays with a specified dtype using the scalar types in the numpy top-level API, e.g. numpy.bool , numpy.float32 , etc.

These scalar types as arguments to the dtype keyword that many numpy functions or methods accept. For example:

Array types can also be referred to by character codes, for example:

See Specifying and constructing data types for more information about specifying and constructing data type objects, including how to specify parameters like the byte order.

To convert the type of an array, use the .astype() method. For example:

Note that, above, we could have used the Python float object as a dtype instead of numpy.float64 . NumPy knows that int refers to numpy.int_ , bool means numpy.bool , that float is numpy.float64 and complex is numpy.complex128 . The other data-types do not have Python equivalents.

To determine the type of an array, look at the dtype attribute:

dtype objects also contain information about the type, such as its bit-width and its byte-order. The data type can also be used indirectly to query properties of the type, such as whether it is an integer:

Numerical Data Types #

There are 5 basic numerical types representing booleans ( bool ), integers ( int ), unsigned integers ( uint ) floating point ( float ) and complex . A basic numerical type name combined with a numeric bitsize defines a concrete type. The bitsize is the number of bits that are needed to represent a single value in memory. For example, numpy.float64 is a 64 bit floating point data type. Some types, such as numpy.int_ and numpy.intp , have differing bitsizes, dependent on the platforms (e.g. 32-bit vs. 64-bit CPU architectures). This should be taken into account when interfacing with low-level code (such as C or Fortran) where the raw memory is addressed.

Data Types for Strings and Bytes #

In addition to numerical types, NumPy also supports storing unicode strings, via the numpy.str_ dtype ( U character code), null-terminated byte sequences via numpy.bytes_ ( S character code), and arbitrary byte sequences, via numpy.void ( V character code).

All of the above are fixed-width data types. They are parameterized by a width, in either bytes or unicode points, that a single data element in the array must fit inside. This means that storing an array of byte sequences or strings using this dtype requires knowing or calculating the sizes of the longest text or byte sequence in advance.

As an example, we can create an array storing the words "hello" and "world!" :

Here the data type is detected as a unicode string that is a maximum of 6 code points long, enough to store both entries without truncation. If we specify a shorter or longer data type, the string is either truncated or zero-padded to fit in the specified width:

We can see the zero-padding a little more clearly if we use the bytes data type and ask NumPy to print out the bytes in the array buffer:

Each entry is padded with two extra null bytes. Note however that NumPy cannot tell the difference between intentionally stored trailing nulls and padding nulls:

If you need to store and round-trip any trailing null bytes, you will need to use an unstructured void data type:

Advanced types, not listed above, are explored in section Structured arrays .

Relationship Between NumPy Data Types and C Data Types #

NumPy provides both bit sized type names and names based on the names of C types. Since the definition of C types are platform dependent, this means the explicitly bit sized should be preferred to avoid platform-dependent behavior in programs using NumPy.

To ease integration with C code, where it is more natural to refer to platform-dependent C types, NumPy also provides type aliases that correspond to the C types for the platform. Some dtypes have trailing underscore to avoid confusion with builtin python type names, such as numpy.bool_ .

Canonical Python API name

Python API “C-like” name

Actual C type

Description

or

N/A

(defined in )

Boolean (True or False) stored as a byte.

char

Platform-defined integer type with 8 bits.

char

Platform-defined integer type with 8 bits without sign.

Platform-defined integer type with 16 bits.

short

Platform-defined integer type with 16 bits without sign.

Platform-defined integer type with 32 bits.

int

Platform-defined integer type with 32 bits without sign.

N/A

/

Platform-defined integer of size ; used e.g. for sizes.

N/A

Platform-defined integer type capable of storing the maximum allocation size.

N/A

Guaranteed to hold pointers. Character code only (Python and C).

N/A

Guaranteed to hold pointers. Character code only (Python and C).

or

Platform-defined integer type with at least 32 bits.

or

long

Platform-defined integer type with at least 32 bits without sign.

N/A

long

Platform-defined integer type with at least 64 bits.

N/A

long long

Platform-defined integer type with at least 64 bits without sign.

N/A

Half precision float: sign bit, 5 bits exponent, 10 bits mantissa.

Platform-defined single precision float: typically sign bit, 8 bits exponent, 23 bits mantissa.

Platform-defined double precision float: typically sign bit, 11 bits exponent, 52 bits mantissa.

or

double

Platform-defined extended-precision float.

complex

Complex number, represented by two single-precision floats (real and imaginary components).

complex

Complex number, represented by two double-precision floats (real and imaginary components).

or

double complex

Complex number, represented by two extended-precision floats (real and imaginary components).

Since many of these have platform-dependent definitions, a set of fixed-size aliases are provided (See Sized aliases ).

Array scalars #

NumPy generally returns elements of arrays as array scalars (a scalar with an associated dtype). Array scalars differ from Python scalars, but for the most part they can be used interchangeably (the primary exception is for versions of Python older than v2.x, where integer array scalars cannot act as indices for lists and tuples). There are some exceptions, such as when code requires very specific attributes of a scalar or when it checks specifically whether a value is a Python scalar. Generally, problems are easily fixed by explicitly converting array scalars to Python scalars, using the corresponding Python type function (e.g., int , float , complex , str ).

The primary advantage of using array scalars is that they preserve the array type (Python may not have a matching scalar type available, e.g. int16 ). Therefore, the use of array scalars ensures identical behaviour between arrays and scalars, irrespective of whether the value is inside an array or not. NumPy scalars also have many of the same methods arrays do.

Overflow errors #

The fixed size of NumPy numeric types may cause overflow errors when a value requires more memory than available in the data type. For example, numpy.power evaluates 100 ** 9 correctly for 64-bit integers, but gives -1486618624 (incorrect) for a 32-bit integer.

The behaviour of NumPy and Python integer types differs significantly for integer overflows and may confuse users expecting NumPy integers to behave similar to Python’s int . Unlike NumPy, the size of Python’s int is flexible. This means Python integers may expand to accommodate any integer and will not overflow.

NumPy provides numpy.iinfo and numpy.finfo to verify the minimum or maximum values of NumPy integer and floating point values respectively

If 64-bit integers are still too small the result may be cast to a floating point number. Floating point numbers offer a larger, but inexact, range of possible values.

Extended precision #

Python’s floating-point numbers are usually 64-bit floating-point numbers, nearly equivalent to numpy.float64 . In some unusual situations it may be useful to use floating-point numbers with more precision. Whether this is possible in numpy depends on the hardware and on the development environment: specifically, x86 machines provide hardware floating-point with 80-bit precision, and while most C compilers provide this as their long double type, MSVC (standard for Windows builds) makes long double identical to double (64 bits). NumPy makes the compiler’s long double available as numpy.longdouble (and np.clongdouble for the complex numbers). You can find out what your numpy provides with np.finfo(np.longdouble) .

NumPy does not provide a dtype with more precision than C’s long double ; in particular, the 128-bit IEEE quad precision data type (FORTRAN’s REAL*16 ) is not available.

For efficient memory alignment, numpy.longdouble is usually stored padded with zero bits, either to 96 or 128 bits. Which is more efficient depends on hardware and development environment; typically on 32-bit systems they are padded to 96 bits, while on 64-bit systems they are typically padded to 128 bits. np.longdouble is padded to the system default; np.float96 and np.float128 are provided for users who want specific padding. In spite of the names, np.float96 and np.float128 provide only as much precision as np.longdouble , that is, 80 bits on most x86 machines and 64 bits in standard Windows builds.

Be warned that even if numpy.longdouble offers more precision than python float , it is easy to lose that extra precision, since python often forces values to pass through float . For example, the % formatting operator requires its arguments to be converted to standard python types, and it is therefore impossible to preserve extended precision even if many decimal places are requested. It can be useful to test your code with the value 1 + np.finfo(np.longdouble).eps .

Data Representation in Computer: Number Systems, Characters, Audio, Image and Video

representation of a data type

What is Data Representation in Computer?

A computer uses a fixed number of bits to represent a piece of data which could be a number, a character, image, sound, video, etc. Data representation is the method used internally to represent data in a computer. Let us see how various types of data can be represented in computer memory.

Number Systems

Number systems are the technique to represent numbers in the computer system architecture, every value that you are saving or getting into/from computer memory has a defined number system.

The number 289 is pronounced as two hundred and eighty-nine and it consists of the symbols 2, 8, and 9. Similarly, there are other number systems. Each has its own symbols and method for constructing a number.

A number system has a unique base, which depends upon the number of symbols. The number of symbols used in a number system is called the base or radix of a number system.

Let us discuss some of the number systems. Computer architecture supports the following number of systems:

Binary Number System

Octal number system.

The octal number system has only eight (8) digits from 0 to 7. Every number (value) represents with 0,1,2,3,4,5,6 and 7 in this number system. The base of the octal number system is 8, because it has only 8 digits.

Decimal Number System

The decimal number system has only ten (10) digits from 0 to 9. Every number (value) represents with 0,1,2,3,4,5,6, 7,8 and 9 in this number system. The base of decimal number system is 10, because it has only 10 digits.

Hexadecimal Number System

Data representation of characters.

There are different methods to represent characters . Some of them are discussed below:

The code called ASCII (pronounced ‘􀀏’.S-key”), which stands for American Standard Code for Information Interchange, uses 7 bits to represent each character in computer memory. The ASCII representation has been adopted as a standard by the U.S. government and is widely accepted.

Since there are exactly 128 unique combinations of 7 bits, this 7-bit code can represent only128 characters. Another version is ASCII-8, also called extended ASCII, which uses 8 bits for each character, can represent 256 different characters.

If ASCII-coded data is to be used in a computer that uses EBCDIC representation, it is necessary to transform ASCII code to EBCDIC code. Similarly, if EBCDIC coded data is to be used in an ASCII computer, EBCDIC code has to be transformed to ASCII.

Using 8-bit ASCII we can represent only 256 characters. This cannot represent all characters of written languages of the world and other symbols. Unicode is developed to resolve this problem. It aims to provide a standard character encoding scheme, which is universal and efficient.

Data Representation of Audio, Image and Video

In most cases, we may have to represent and process data other than numbers and characters. This may include audio data, images, and videos. We can see that like numbers and characters, the audio, image, and video data also carry information.

Multimedia data such as audio, image, and video are stored in different types of files. The variety of file formats is due to the fact that there are quite a few approaches to compressing the data and a number of different ways of packaging the data.

For example, an image is most popularly stored in Joint Picture Experts Group (JPEG ) file format. An image file consists of two parts – header information and image data. Information such as the name of the file, size, modified data, file format, etc. is stored in the header part.

Numerous such techniques are used to achieve compression. Depending on the application, images are stored in various file formats such as bitmap file format (BMP), Tagged Image File Format (TIFF), Graphics Interchange Format (GIF), Portable (Public) Network Graphic (PNG).

For example, WAV file format typically contains uncompressed sound and MP3 files typically contain compressed audio data. The synthesized music data is stored in MIDI(Musical Instrument Digital Interface) files.

Similarly, video is also stored in different files such as AVI (Audio Video Interleave) – a file format designed to store both audio and video data in a standard package that allows synchronous audio with video playback, MP3, JPEG-2, WMV, etc.

FAQs About Data Representation in Computer

What is number system with example, leave a reply cancel reply.

Robot

Download the Learning Outcomes App Today

Embibe Logo

Share this article

link

Table of Contents

Latest updates.

1 Million Means: 1 Million in Rupees, Lakhs and Crores

1 Million Means: 1 Million in Rupees, Lakhs and Crores

Ways To Improve Learning Outcomes: Learn Tips & Tricks

Ways To Improve Learning Outcomes: Learn Tips & Tricks

The Three States of Matter: Solids, Liquids, and Gases

The Three States of Matter: Solids, Liquids, and Gases

Types of Motion: Introduction, Parameters, Examples

Types of Motion: Introduction, Parameters, Examples

Understanding Frequency Polygon: Detailed Explanation

Understanding Frequency Polygon: Detailed Explanation

Uses of Silica Gel in Packaging?

Uses of Silica Gel in Packaging?

Visual Learning Style for Students: Pros and Cons

Visual Learning Style for Students: Pros and Cons

Air Pollution: Know the Causes, Effects & More

Air Pollution: Know the Causes, Effects & More

Sexual Reproduction in Flowering Plants

Sexual Reproduction in Flowering Plants

Integers Introduction: Check Detailed Explanation

Integers Introduction: Check Detailed Explanation

Tag cloud :.

  • entrance exams
  • engineering
  • ssc cgl 2024
  • Written By Priya_Singh
  • Last Modified 24-01-2023

Data Representation: Definition, Types, Examples

Data Representation: Data representation is a technique for analysing numerical data. The relationship between facts, ideas, information, and concepts is depicted in a diagram via data representation. It is a fundamental learning strategy that is simple and easy to understand. It is always determined by the data type in a specific domain. Graphical representations are available in many different shapes and sizes.

In mathematics, a graph is a chart in which statistical data is represented by curves or lines drawn across the coordinate point indicated on its surface. It aids in the investigation of a relationship between two variables by allowing one to evaluate the change in one variable’s amount in relation to another over time. It is useful for analysing series and frequency distributions in a given context. On this page, we will go through two different types of graphs that can be used to graphically display data. Continue reading to learn more.

Learn Informative Blog

Data Representation in Maths

Definition: After collecting the data, the investigator has to condense them in tabular form to study their salient features. Such an arrangement is known as the presentation of data.

Any information gathered may be organised in a frequency distribution table, and then shown using pictographs or bar graphs. A bar graph is a representation of numbers made up of equally wide bars whose lengths are determined by the frequency and scale you choose.

The collected raw data can be placed in any one of the given ways:

  • Serial order of alphabetical order
  • Ascending order
  • Descending order

Data Representation Example

Example: Let the marks obtained by \(30\) students of class VIII in a class test, out of \(50\)according to their roll numbers, be:

\(39,\,25,\,5,\,33,\,19,\,21,\,12,41,\,12,\,21,\,19,\,1,\,10,\,8,\,12\)

\(17,\,19,\,17,\,17,\,41,\,40,\,12,41,\,33,\,19,\,21,\,33,\,5,\,1,\,21\)

The data in the given form is known as raw data or ungrouped data. The above-given data can be placed in the serial order as shown below:

Data Representation Example

Now, for say you want to analyse the standard of achievement of the students. If you arrange them in ascending or descending order, it will give you a better picture.

Ascending order:

\(1,\,1,\,5,\,5,\,8,\,10,\,12,12,\,12,\,12,\,17,\,17,\,17,\,19,\,19\)

\(19,\,19,\,21,\,21,\,21,\,25,\,33,33,\,33,\,39,\,40,\,41,\,41,\,41\)

Descending order:

\(41,\,41,\,41,\,40,\,39,\,33,\,33,33,\,25,\,21,\,21,\,21,\,21,\,19,\,19\)

\(19,\,19,\,17,\,17,\,17,\,12,\,12,12,\,12,\,10,\,8,\,5,\,5,1,\,1\)

When the raw data is placed in ascending or descending order of the magnitude is known as an array or arrayed data.

Graph Representation in Data Structure

A few of the graphical representation of data is given below:

  • Frequency distribution table

Pictorial Representation of Data: Bar Chart

The bar graph represents the ​qualitative data visually. The information is displayed horizontally or vertically and compares items like amounts, characteristics, times, and frequency.

The bars are arranged in order of frequency, so more critical categories are emphasised. By looking at all the bars, it is easy to tell which types in a set of data dominate the others. Bar graphs can be in many ways like single, stacked, or grouped.

Bar Chart

Graphical Representation of Data: Frequency Distribution Table

A frequency table or frequency distribution is a method to present raw data in which one can easily understand the information contained in the raw data.

The frequency distribution table is constructed by using the tally marks. Tally marks are a form of a numerical system with the vertical lines used for counting. The cross line is placed over the four lines to get a total of \(5\).

Frequency Distribution Table

Consider a jar containing the different colours of pieces of bread as shown below:

Frequency Distribution Table Example

Construct a frequency distribution table for the data mentioned above.

Frequency Distribution Table Example

Graphical Representation of Data: Histogram

The histogram is another kind of graph that uses bars in its display. The histogram is used for quantitative data, and ranges of values known as classes are listed at the bottom, and the types with greater frequencies have the taller bars.

A histogram and the bar graph look very similar; however, they are different because of the data level. Bar graphs measure the frequency of the categorical data. A categorical variable has two or more categories, such as gender or hair colour.

Histogram

Graphical Representation of Data: Pie Chart

The pie chart is used to represent the numerical proportions of a dataset. This graph involves dividing a circle into different sectors, where each of the sectors represents the proportion of a particular element as a whole. Thus, it is also known as a circle chart or circle graph.

Pie Chart

Graphical Representation of Data: Line Graph

A graph that uses points and lines to represent change over time is defined as a line graph. In other words, it is the chart that shows a line joining multiple points or a line that shows the link between the points.

The diagram illustrates the quantitative data between two changing variables with the straight line or the curve that joins a series of successive data points. Linear charts compare two variables on the vertical and the horizontal axis.

Line Graph

General Rules for Visual Representation of Data

We have a few rules to present the information in the graphical representation effectively, and they are given below:

  • Suitable Title:  Ensure that the appropriate title is given to the graph, indicating the presentation’s subject.
  • Measurement Unit:  Introduce the measurement unit in the graph.
  • Proper Scale:  To represent the data accurately, choose an appropriate scale.
  • Index:  In the Index, the appropriate colours, shades, lines, design in the graphs are given for better understanding.
  • Data Sources:  At the bottom of the graph, include the source of information wherever necessary.
  • Keep it Simple:  Build the graph in a way that everyone should understand easily.
  • Neat:  You have to choose the correct size, fonts, colours etc., in such a way that the graph must be a model for the presentation of the information.

Solved Examples on Data Representation

Q.1. Construct the frequency distribution table for the data on heights in \(({\rm{cm}})\) of \(20\) boys using the class intervals \(130 – 135,135 – 140\) and so on. The heights of the boys in \({\rm{cm}}\) are: 

Data Representation Example 1

Ans: The frequency distribution for the above data can be constructed as follows:

Data Representation Example

Q.2. Write the steps of the construction of Bar graph? Ans: To construct the bar graph, follow the given steps: 1. Take a graph paper, draw two lines perpendicular to each other, and call them horizontal and vertical. 2. You have to mark the information given in the data like days, weeks, months, years, places, etc., at uniform gaps along the horizontal axis. 3. Then you have to choose the suitable scale to decide the heights of the rectangles or the bars and then mark the sizes on the vertical axis. 4. Draw the bars or rectangles of equal width and height marked in the previous step on the horizontal axis with equal spacing. The figure so obtained will be the bar graph representing the given numerical data.

Q.3. Read the bar graph and then answer the given questions: I. Write the information provided by the given bar graph. II. What is the order of change of the number of students over several years? III. In which year is the increase of the student maximum? IV. State whether true or false. The enrolment during \(1996 – 97\) is double that of \(1995 – 96\)

pictorial representation of data

Ans: I. The bar graph represents the number of students in class \({\rm{VI}}\) of a school during the academic years \(1995 – 96\,to\,1999 – 2000\). II. The number of stcccccudents is changing in increasing order as the heights of bars are growing. III. The increase in the number of students in uniform and the increase in the height of bars is uniform. Hence, in this case, the growth is not maximum in any of the years. The enrolment in the years is \(1996 – 97\, = 200\). and the enrolment in the years is \(1995 – 96\, = 150\). IV. The enrolment in \(1995 – 97\,\) is not double the enrolment in \(1995 – 96\). So the statement is false.

Q.4. Write the frequency distribution for the given information of ages of \(25\) students of class VIII in a school. \(15,\,16,\,16,\,14,\,17,\,17,\,16,\,15,\,15,\,16,\,16,\,17,\,15\) \(16,\,16,\,14,\,16,\,15,\,14,\,15,\,16,\,16,\,15,\,14,\,15\) Ans: Frequency distribution of ages of \(25\) students:

Data Representation Example

Q.5. There are \(20\) students in a classroom. The teacher asked the students to talk about their favourite subjects. The results are listed below:

Data Representation Example

By looking at the above data, which is the most liked subject? Ans: Representing the above data in the frequency distribution table by using tally marks as follows:

Data Representation Example

From the above table, we can see that the maximum number of students \((7)\) likes mathematics.

Also, Check –

  • Diagrammatic Representation of Data

In the given article, we have discussed the data representation with an example. Then we have talked about graphical representation like a bar graph, frequency table, pie chart, etc. later discussed the general rules for graphic representation. Finally, you can find solved examples along with a few FAQs. These will help you gain further clarity on this topic.

Test Informative Blog

FAQs on Data Representation

Q.1: How is data represented? A: The collected data can be expressed in various ways like bar graphs, pictographs, frequency tables, line graphs, pie charts and many more. It depends on the purpose of the data, and accordingly, the type of graph can be chosen.

Q.2: What are the different types of data representation? A : The few types of data representation are given below: 1. Frequency distribution table 2. Bar graph 3. Histogram 4. Line graph 5. Pie chart

Q.3: What is data representation, and why is it essential? A: After collecting the data, the investigator has to condense them in tabular form to study their salient features. Such an arrangement is known as the presentation of data. Importance: The data visualization gives us a clear understanding of what the information means by displaying it visually through maps or graphs. The data is more natural to the mind to comprehend and make it easier to rectify the trends outliners or trends within the large data sets.

Q.4: What is the difference between data and representation? A: The term data defines the collection of specific quantitative facts in their nature like the height, number of children etc., whereas the information in the form of data after being processed, arranged and then presented in the state which gives meaning to the data is data representation.

Q.5: Why do we use data representation? A: The data visualization gives us a clear understanding of what the information means by displaying it visually through maps or graphs. The data is more natural to the mind to comprehend and make it easier to rectify the trends outliners or trends within the large data sets.

Related Articles

1 Million Means: 1 million in numerical is represented as 10,00,000. The Indian equivalent of a million is ten lakh rupees. It is not a...

Ways To Improve Learning Outcomes: With the development of technology, students may now rely on strategies to enhance learning outcomes. No matter how knowledgeable a...

The Three States of Matter: Anything with mass and occupied space is called ‘Matter’. Matters of different kinds surround us. There are some we can...

Motion is the change of a body's position or orientation over time. The motion of humans and animals illustrates how everything in the cosmos is...

Understanding Frequency Polygon: Students who are struggling with understanding Frequency Polygon can check out the details here. A graphical representation of data distribution helps understand...

When you receive your order of clothes or leather shoes or silver jewellery from any online shoppe, you must have noticed a small packet containing...

Visual Learning Style: We as humans possess the power to remember those which we have caught visually in our memory and that too for a...

Air Pollution: In the past, the air we inhaled was pure and clean. But as industrialisation grows and the number of harmful chemicals in the...

In biology, flowering plants are known by the name angiosperms. Male and female reproductive organs can be found in the same plant in flowering plants....

Integers Introduction: To score well in the exam, students must check out the Integers introduction and understand them thoroughly. The collection of negative numbers and whole...

Human Respiratory System – Detailed Explanation

Human Respiratory System: Students preparing for the NEET and Biology-related exams must have an idea about the human respiratory system. It is a network of tissues...

Place Value of Numbers: Detailed Explanation

Place Value of Numbers: Students must understand the concept of the place value of numbers to score high in the exam. In mathematics, place value...

The Leaf: Types, Structures, Parts

The Leaf: Students who want to understand everything about the leaf can check out the detailed explanation provided by Embibe experts. Plants have a crucial role...

Factors Affecting Respiration: Definition, Diagrams with Examples

In plants, respiration can be regarded as the reversal of the photosynthetic process. Like photosynthesis, respiration involves gas exchange with the environment. Unlike photosynthesis, respiration...

General Terms Related to Spherical Mirrors

General terms related to spherical mirrors: A mirror with the shape of a portion cut out of a spherical surface or substance is known as a...

Number System: Types, Conversion and Properties

Number System: Numbers are highly significant and play an essential role in Mathematics that will come up in further classes. In lower grades, we learned how...

Types of Respiration

Every living organism has to "breathe" to survive. The process by which the living organisms use their food to get energy is called respiration. It...

Animal Cell: Definition, Diagram, Types of Animal Cells

Animal Cell: An animal cell is a eukaryotic cell with membrane-bound cell organelles without a cell wall. We all know that the cell is the fundamental...

Conversion of Percentages: Conversion Method & Examples

Conversion of Percentages: To differentiate and explain the size of quantities, the terms fractions and percent are used interchangeably. Some may find it difficult to...

Arc of a Circle: Definition, Properties, and Examples

Arc of a circle: A circle is the set of all points in the plane that are a fixed distance called the radius from a fixed point...

Ammonia (NH3): Preparation, Structure, Properties and Uses

Ammonia, a colourless gas with a distinct odour, is a chemical building block and a significant component in producing many everyday items. It is found...

CGPA to Percentage: Calculator for Conversion, Formula, & More

CGPA to Percentage: The average grade point of a student is calculated using their cumulative grades across all subjects, omitting any supplemental coursework. Many colleges,...

Uses of Ether – Properties, Nomenclature, Uses, Disadvantages

Uses of Ether:  Ether is an organic compound containing an oxygen atom and an ether group connected to two alkyl/aryl groups. It is formed by the...

General and Middle Terms: Definitions, Formula, Independent Term, Examples

General and Middle terms: The binomial theorem helps us find the power of a binomial without going through the tedious multiplication process. Further, the use...

Mutually Exclusive Events: Definition, Formulas, Solved Examples

Mutually Exclusive Events: In the theory of probability, two events are said to be mutually exclusive events if they cannot occur simultaneously or at the...

Geometry: Definition, Shapes, Structure, Examples

Geometry is a branch of mathematics that is largely concerned with the forms and sizes of objects, their relative positions, and the qualities of space....

Bohr’s Model of Hydrogen Atom: Expressions for Radius, Energy

Rutherford’s Atom Model was undoubtedly a breakthrough in atomic studies. However, it was not wholly correct. The great Danish physicist Niels Bohr (1885–1962) made immediate...

representation of a data type

39 Insightful Publications

World Economic Forum

Embibe Is A Global Innovator

accenture

Innovator Of The Year Education Forever

Interpretable And Explainable AI

Interpretable And Explainable AI

Tedx

Revolutionizing Education Forever

Amazon AI Conclave

Best AI Platform For Education

Forbes India

Enabling Teachers Everywhere

ACM

Decoding Performance

World Education Summit

Leading AI Powered Learning Solution Provider

Journal of Educational Data Mining

Auto Generation Of Tests

BW Disrupt

Disrupting Education In India

Springer

Problem Sequencing Using DKT

Fortune India Forty Under Fourty

Help Students Ace India's Toughest Exams

Edtech Digest

Best Education AI Platform

Nasscom Product Connect

Unlocking AI Through Saas

Tech In Asia

Fixing Student’s Behaviour With Data Analytics

Your Story

Leveraging Intelligence To Deliver Results

City AI

Brave New World Of Applied AI

vccircle

You Can Score Higher

INK Talks

Harnessing AI In Education

kstart

Personalized Ed-tech With AI

StartUpGrind

Exciting AI Platform, Personalizing Education

Digital Women Award

Disruptor Award For Maximum Business Impact

The Mumbai Summit 2020 AI

Top 20 AI Influencers In India

USPTO

Proud Owner Of 9 Patents

StartUpGrind

Innovation in AR/VR/MR

StartUpGrind

Best Animated Frames Award 2024

Close

Trending Searches

Previous year question papers, sample papers.

Unleash Your True Potential With Personalised Learning on EMBIBE

Pattern

Ace Your Exam With Personalised Learning on EMBIBE

Enter mobile number.

By signing up, you agree to our Privacy Policy and Terms & Conditions

Interpretable SHAP Model Combining Meta-learning and Vision Transformer for Lithology Classification Using Limited and Unbalanced Drilling Data in Well Logging

  • Original Paper
  • Published: 19 August 2024

Cite this article

representation of a data type

  • Youzhuang Sun 1 , 2 ,
  • Shanchen Pang 1 , 2 ,
  • Zhiyuan Zhao 1 , 2 &
  • Yongan Zhang 1 , 2  

12 Accesses

Explore all metrics

Recent advances in geological exploration and oil and gas development have highlighted the critical need for accurate classification and prediction of subterranean lithologies. To address this, we introduce the Meta-Vision Transformer (Meta-ViT) method, a novel approach. This technique synergistically combines the adaptability of meta-learning with the analytical prowess of ViT. Meta-learning excels in identifying nuanced similarities across tasks, significantly enhancing learning efficiency. Simultaneously, the ViT leverages these meta-learning insights to navigate the complex landscape of geological exploration, improving lithology identification accuracy. The Meta-ViT model employs a support set to extract crucial insights through meta-learning, and a query set to test the generalizability of these insights. This dual-framework setup enables the ViT to detect various underground rock types with unprecedented precision. Additionally, by simulating diverse tasks and data scenarios, meta-learning broadens the model's applicational scope. Integrating the SHAP (SHapley Additive exPlanations) model, rooted in Shapley values from cooperative game theory, greatly enhances the interpretability of rock type classifications. We also utilized the ADASYN (Adaptive Synthetic Sampling) technique to optimize sample representation, generating new samples based on existing densities to ensure uniform distribution. Our extensive testing across various training and testing set ratios showed that the Meta-ViT model outperforms dramatically traditional machine learning models. This approach not only refines learning processes but it also adeptly addresses the inherent challenges of geological data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

representation of a data type

Explore related subjects

  • Artificial Intelligence

Alhudhaif, A. (2021). A novel multi-class imbalanced EEG signals classification based on the adaptive synthetic sampling (ADASYN) approach. PeerJ Computer Science, 7 (1), 523–529.

Article   Google Scholar  

Blagus, R., & Lusa, L. (2013). SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 14 (1), 1–16.

Google Scholar  

Bressan, T. S., Souza, M. K., & Girelli, T. J. (2020). Evaluation of machine learning methods for lithology classification using geophysical data. Computers & Geosciences, 139 (1), 104–112.

Chen, H., Xue, S., & Zheng, X. (2023). Coal mine microseismic identification and first-arrival picking based on Conv-LSTM-Unet. Acta Geophysica, 71 (1), 161–173.

Corina, A. N., & Hovda, S. (2018). Automatic lithology prediction from well logging using kernel density estimation. Journal of Petroleum Science and Engineering, 170 (1), 664–674.

Article   CAS   Google Scholar  

Dev, V. A., & Eden, M. R. (2018). Evaluating the boosting approach to machine learning for formation lithology classification. In Computer Aided Chemical Engineering, 44 (1), 1465–1470.

Gaudart, J., Giusiano, B., & Huiart, L. (2004). Comparison of the performance of multi-layer perceptron and linear regression for epidemiological data. Computational Statistics & Data Analysis, 44 (4), 547–570.

Han, K., Wang, Y., & Chen, H. (2022). A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45 (1), 87–110.

Hospedales, T., Antoniou, A., & Micaelli, P. (2021). Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (9), 5149–5169.

Jiang, C., Zhang, D., & Chen, S. (2021). Lithology identification from well-log curves via neural networks with additional geologic constraint. Geophysics, 86 (5), 85–100.

Kumar, T., Seelam, N. K., & Rao, G. S. (2022). Lithology prediction from well log data using machine learning techniques: A case study from Talcher coalfield, Eastern India. Journal of Applied Geophysics, 199 (1), 104–112.

Li, C., Cui, Y., & Luo, N. (2022). Trans-resnet: Integrating transformers and CNNs for Alzheimer’s disease classification. In 2022 IEEE 19th International Symposium on Biomedical Imaging .

Li, J., Lü, S., & Li, Z. (2022b). Unsupervised domain adaptation via softmax-based prototype construction and adaptation. Information Sciences, 609 (2), 257–275.

Li, X., Tripe, D., & Malone, C. (2020). Measuring systemic risk contribution. The leave-one-out z-score method. Finance Research Letters, 36 (3), 101–113.

CAS   Google Scholar  

Lin, J., Li, H., & Liu, N. (2020). Automatic lithology identification by applying LSTM to logging data: A case study in X tight rock reservoirs. IEEE Geoscience and Remote Sensing Letters, 18 (8), 1361–1365.

Liu, C., Liu, A., & Halabi, S. (2011). A min–max combination of biomarkers to improve diagnostic accuracy. Statistics in Medicine, 30 (16), 2005–2014.

Min, X., Pengbo, Q., & Fengwei, Z. (2020). Research and application of logging lithology identification for igneous reservoirs based on deep learning. Journal of Applied Geophysics, 173 (4), 103–115.

Mishra, A., Sharma, A., & Patidar, A. K. (2022). Evaluation and development of a predictive model for geophysical well log data analysis and reservoir characterization: Machine learning applications to lithology prediction. Natural Resources Research, 31 (6), 3195–3222.

Ogundokun, R. O., Maskeliunas, R., & Misra, S. (2022). Improved CNN based on batch normalization and Adam optimizer. In International Conference on Computational Science and Its Applications .

Schweighofer, N., & Doya, K. (2003). Meta-learning in reinforcement learning. Neural Networks, 16 (1), 5–9.

Shahid, F., Zameer, A., & Muneeb, M. (2021). A novel genetic LSTM model for wind power forecast. Energy, 223 (4), 120–132.

Shakirov, A., Molchanov, A., & Ismailova, L. (2023). Quantitative assessment of rock lithology from gamma-ray and mud logging data. Geoenergy Science and Engineering, 225 (2), 211–223.

Su, J., Ahmed, M., & Lu, Y. (2024). Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568 (4), 127–138.

Sun, J., Li, Q., & Chen, M. (2019). Optimization of models for a rapid identification of lithology while drilling: A win-win strategy based on machine learning. Journal of Petroleum Science and Engineering, 176 (2), 321–341.

Sun, Y., Pang, S., & Zhang, J. (2024a). Porosity prediction through well logging data: A combined approach of convolutional neural network and transformer model (CNN-transformer). Physics of Fluids, 36 (2), 26–35.

Sun, Y., Pang, S., & Zhang, Y. (2024b). Application of the dynamic transformer model with well logging data for formation porosity prediction. Physics of Fluids, 36 (3), 45–55.

Sun, Y., Pang, S., & Zhang, Y. (2024c). Application of Adaboost-Transformer Algorithm for lithology identification based on well logging data. IEEE Geoscience and Remote Sensing Letters, 21 (1), 1–5.

Wang, J. X. (2021). Meta-learning in natural and artificial intelligence. Current Opinion in Behavioral Sciences, 38 (1), 90–95.

Wang, Y., Xiao, Z., & Cao, G. (2022). A convolutional neural network method based on Adam optimizer with power-exponential learning rate for bearing fault diagnosis. Journal of Vibroengineering, 24 (4), 666–678.

Xie, Y., Zhu, C., & Zhou, W. (2018). Evaluation of machine learning methods for formation lithology identification: A comparison of tuning processes and model performances. Journal of Petroleum Science and Engineering, 160 (2), 182–193.

Xu, Z., Chen, X., & Tang, W. (2021). Meta weight learning via model-agnostic meta-learning. Neurocomputing, 432 (2), 124–132.

Yan, T., Xu, R., & Sun, S. H. (2024). A real-time intelligent lithology identification method based on a dynamic felling strategy weighted random forest algorithm. Petroleum Science, 21 (2), 1135–1148.

Zhang, G., Wang, Z., & Chen, Y. (2018). Deep learning for seismic lithology prediction. Geophysical Journal International, 215 (2), 1368–1387.

Zhang, J., He, Y., & Zhang, Y. (2022a). Well-logging-based lithology classification using machine learning methods for high-quality reservoir identification: A case study of Baikouquan Formation in Mahu Area of Junggar Basin. NW China. Energies, 15 (10), 3675–3685.

Zhang, P., Srivastava, A., & Wang, T. Y. (2022b). C-memmap: Clustering-driven compact, adaptable, and generalizable meta-LSTM models for memory access prediction. International Journal of Data Science and Analytics, 13 (1), 3–16.

Zhang, Y., Xu, B., & Zhao, T. (2020). Convolutional multi-head self-attention on memory for aspect sentiment classification. IEEE/CAA Journal of Automatica Sinica, 7 (4), 1038–1044.

Zou, Y., Chen, Y., & Deng, H. (2021). Gradient boosting decision tree for lithology identification with well logs: A case study of Zhaoxian gold deposit, Shandong Peninsula. China. Natural Resources Research, 30 (5), 3197–3217.

Download references

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. The authors declare that they have no financial support or funding to report for this study.

Author information

Authors and affiliations.

College of Computer Science, China University of Petroleum (East China), Qingdao, Shandong, China

Youzhuang Sun, Shanchen Pang, Zhiyuan Zhao & Yongan Zhang

Qingdao College of Software, China University of Petroleum (East China), Qingdao, Shandong, China

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Shanchen Pang .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflict of interest regarding the publication of this manuscript. All financial and non-financial relationships and activities that could be perceived as potential sources of bias have been disclosed.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Sun, Y., Pang, S., Zhao, Z. et al. Interpretable SHAP Model Combining Meta-learning and Vision Transformer for Lithology Classification Using Limited and Unbalanced Drilling Data in Well Logging. Nat Resour Res (2024). https://doi.org/10.1007/s11053-024-10396-4

Download citation

Received : 08 April 2024

Accepted : 02 August 2024

Published : 19 August 2024

DOI : https://doi.org/10.1007/s11053-024-10396-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Logging parameters
  • Meta-learning
  • Vision transformer
  • Lithology prediction
  • Machine learning
  • Find a journal
  • Publish with us
  • Track your research
  • Number System and Arithmetic
  • Trigonometry
  • Probability
  • Mensuration
  • Linear Algebra
  • CBSE Class 8 Maths Formulas
  • CBSE Class 9 Maths Formulas
  • CBSE Class 10 Maths Formulas
  • CBSE Class 11 Maths Formulas

What are the different ways of Data Representation?

The process of collecting the data and analyzing that data in large quantity is known as statistics. It is a branch of mathematics trading with the collection, analysis, interpretation, and presentation of numeral facts and figures.

It is a numerical statement that helps us to collect and analyze the data in large quantity the statistics are based on two of its concepts:

  • Statistical Data 
  • Statistical Science

Statistics must be expressed numerically and should be collected systematically.

Data Representation

The word data refers to constituting people, things, events, ideas. It can be a title, an integer, or anycast.  After collecting data the investigator has to condense them in tabular form to study their salient features. Such an arrangement is known as the presentation of data.

It refers to the process of condensing the collected data in a tabular form or graphically. This arrangement of data is known as Data Representation.

The row can be placed in different orders like it can be presented in ascending orders, descending order, or can be presented in alphabetical order. 

Example: Let the marks obtained by 10 students of class V in a class test, out of 50 according to their roll numbers, be: 39, 44, 49, 40, 22, 10, 45, 38, 15, 50 The data in the given form is known as raw data. The above given data can be placed in the serial order as shown below: Roll No. Marks 1 39 2 44 3 49 4 40 5 22 6 10 7 45 8 38 9 14 10 50 Now, if you want to analyse the standard of achievement of the students. If you arrange them in ascending or descending order, it will give you a better picture. Ascending order: 10, 15, 22, 38, 39, 40, 44. 45, 49, 50 Descending order: 50, 49, 45, 44, 40, 39, 38, 22, 15, 10 When the row is placed in ascending or descending order is known as arrayed data.

Types of Graphical Data Representation

Bar chart helps us to represent the collected data visually. The collected data can be visualized horizontally or vertically in a bar chart like amounts and frequency. It can be grouped or single. It helps us in comparing different items. By looking at all the bars, it is easy to say which types in a group of data influence the other.

Now let us understand bar chart by taking this example  Let the marks obtained by 5 students of class V in a class test, out of 10 according to their names, be: 7,8,4,9,6 The data in the given form is known as raw data. The above given data can be placed in the bar chart as shown below: Name Marks Akshay 7 Maya 8 Dhanvi 4 Jaslen 9 Muskan 6

A histogram is the graphical representation of data. It is similar to the appearance of a bar graph but there is a lot of difference between histogram and bar graph because a bar graph helps to measure the frequency of categorical data. A categorical data means it is based on two or more categories like gender, months, etc. Whereas histogram is used for quantitative data.

For example:

The graph which uses lines and points to present the change in time is known as a line graph. Line graphs can be based on the number of animals left on earth, the increasing population of the world day by day, or the increasing or decreasing the number of bitcoins day by day, etc. The line graphs tell us about the changes occurring across the world over time. In a  line graph, we can tell about two or more types of changes occurring around the world.

For Example:

Pie chart is a type of graph that involves a structural graphic representation of numerical proportion. It can be replaced in most cases by other plots like a bar chart, box plot, dot plot, etc. As per the research, it is shown that it is difficult to compare the different sections of a given pie chart, or if it is to compare data across different pie charts.

Frequency Distribution Table

A frequency distribution table is a chart that helps us to summarise the value and the frequency of the chart. This frequency distribution table has two columns, The first column consist of the list of the various outcome in the data, While the second column list the frequency of each outcome of the data. By putting this kind of data into a table it helps us to make it easier to understand and analyze the data. 

For Example: To create a frequency distribution table, we would first need to list all the outcomes in the data. In this example, the results are 0 runs, 1 run, 2 runs, and 3 runs. We would list these numerals in numerical ranking in the foremost queue. Subsequently, we ought to calculate how many times per result happened. They scored 0 runs in the 1st, 4th, 7th, and 8th innings, 1 run in the 2nd, 5th, and the 9th innings, 2 runs in the 6th inning, and 3 runs in the 3rd inning. We set the frequency of each result in the double queue. You can notice that the table is a vastly more useful method to show this data.  Baseball Team Runs Per Inning Number of Runs Frequency           0       4           1        3            2        1            3        1

Sample Questions

Question 1: Considering the school fee submission of 10 students of class 10th is given below:

Muskan  Paid
Kritika Not paid
Anmol Not paid
Raghav Paid
Nitin Paid
Dhanvi Paid
Jasleen Paid
Manas Not paid
Anshul Not paid
Sahil Paid
In order to draw the bar graph for the data above, we prepare the frequency table as given below. Fee submission No. of Students Paid   6 Not paid    4 Now we have to represent the data by using the bar graph. It can be drawn by following the steps given below: Step 1: firstly we have to draw the two axis of the graph X-axis and the Y-axis. The varieties of the data must be put on the X-axis (the horizontal line) and the frequencies of the data must be put on the Y-axis (the vertical line) of the graph. Step 2: After drawing both the axis now we have to give the numeric scale to the Y-axis (the vertical line) of the graph It should be started from zero and ends up with the highest value of the data. Step 3: After the decision of the range at the Y-axis now we have to give it a suitable difference of the numeric scale. Like it can be 0,1,2,3…….or 0,10,20,30 either we can give it a numeric scale like 0,20,40,60… Step 4: Now on the X-axis we have to label it appropriately. Step 5: Now we have to draw the bars according to the data but we have to keep in mind that all the bars should be of the same length and there should be the same distance between each graph

Question 2: Watch the subsequent pie chart that denotes the money spent by Megha at the funfair. The suggested colour indicates the quantity paid for each variety. The total value of the data is 15 and the amount paid on each variety is diagnosed as follows:

Chocolates – 3

Wafers – 3

Toys – 2

Rides – 7

To convert this into pie chart percentage, we apply the formula:  (Frequency/Total Frequency) × 100 Let us convert the above data into a percentage: Amount paid on rides: (7/15) × 100 = 47% Amount paid on toys: (2/15) × 100 = 13% Amount paid on wafers: (3/15) × 100 = 20% Amount paid on chocolates: (3/15) × 100 = 20 %

Question 3: The line graph given below shows how Devdas’s height changes as he grows.

Given below is a line graph showing the height changes in Devdas’s as he grows. Observe the graph and answer the questions below.

representation of a data type

(i) What was the height of  Devdas’s at 8 years? Answer: 65 inches (ii) What was the height of  Devdas’s at 6 years? Answer:  50 inches (iii) What was the height of  Devdas’s at 2 years? Answer: 35 inches (iv) How much has  Devdas’s grown from 2 to 8 years? Answer: 30 inches (v) When was  Devdas’s 35 inches tall? Answer: 2 years.

Please Login to comment...

Similar reads.

  • Mathematics
  • School Learning

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

IMAGES

  1. DataTypes in Php Programming

    representation of a data type

  2. Data Types: The Complete Guide for Beginners (2022)

    representation of a data type

  3. data type

    representation of a data type

  4. Data Representation

    representation of a data type

  5. Programming: Types of Data Types

    representation of a data type

  6. Data Type Formal Definitions

    representation of a data type

COMMENTS

  1. Data type

    The standard type hierarchy of Python 3. In computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these values as machine types. [1] A data type specification in a program constrains the possible values that an ...

  2. How exactly are data types represented in a computer?

    2. Really, datatypes are an abstraction that allows your programming language to treat a few bytes at some address as some kind of numeric type. Consider the data type as a lens that lets you see a piece of memory as an int, or a float. In reality, it's all just bits to the computer. answered Jan 9, 2010 at 17:30.

  3. Data Types in Programming

    There are various kind of data types available according to the various kind of data available. Data types are of 3 types. Primitive Data type: int, float, char, bool. Composite Data Types: string, array, pointers. User Defined Data Type. Summer-time is here and so is the time to skill-up!

  4. PDF Data Types And Representation

    Two major approaches: structural equivalence and name equivalence. Name equivalence is based on declarations. Two types are the same only if they have the same name. (Each type definition introduces a new type) strict: aliases (i.e. declaring a type to be equal to another type) are distinct. loose: aliases are equivalent.

  5. Data Representation

    This guide to data representation covers all the key concepts you need to know to understand the principles of representing data in computer systems. Whether you're a GCSE, IB or A-level computer science student, our guide provides a detailed explanation of how data is represented in binary, hexadecimal, and ASCII formats, as well as the ...

  6. A Tutorial on Data Representation

    The interpretation of binary pattern is called data representation or encoding. Furthermore, it is important that the data representation schemes are agreed-upon by all the parties, i.e., industrial standards need to be formulated and straightly followed. ... The char data type are based on the original 16-bit Unicode standard called UCS-2. The ...

  7. Decoding Computation Through Data Representation

    Primitive data types: Computers deal with binary data at the most basic level. In most programming languages, integers, floating-point numbers, characters, and Booleans are foundational data types. Their representation involves bit patterns in memory, with specifics such as endian-ness, precision, and overflow/underflow considerations.

  8. PDF Data Representation

    Data Representation Data Representation Eric Roberts CS 106A February 10, 2016 Claude Shannon Claude Shannon was one of the pioneers who shaped computer science in its early ... • When you pass an argument of a primitive type to a method, Java copies the value of the argument into the parameter variable. As a result, changes to the parameter ...

  9. PDF Bits and Bytes Data Representation 1

    Data Representation Computer Organization I 4 CS@VT ©2005-2020 WD McQuain Integer Data Types We need to provide support for a variety of data types. For integer values, we need to provide a variety of types that allow the user to choose based upon memory considerations and range of representation. For contemporary programming languages, we ...

  10. Basic Data Types in Python: A Quick Exploration

    With these types, you can represent numeric values, text and binary data, and Boolean values in your code. So, these data types are the basic building blocks of most Python programs and projects. In this tutorial, you'll learn about: Numeric types, such as int, float, and complex. The str data type, which represents textual data.

  11. PDF Data Representaton: Bits, Data Types, Operations (Chapter 2)

    Data Type •In a computer system, we need a representation of data and operations that can be performed on the data by the machine instructions or the computer language. •This combination of representation + operationsis known as a data type. • The type tells the compiler how the programmer intends to use it •Prog.

  12. Data Representation and Data Types

    Data Representation and Data Types. Data Representation. Most of us write numbers in Arabic form, ie, 1, 2, 3,..., 9. Some people write them differently, such as I, II, III, IV,...,IX. Nomatter what type of representation, most human beings canunderstand, at least the two types I mentioned. Unfortunately thecomputer doesn't.

  13. 2.1: Types of Data Representation

    2.1: Types of Data Representation. Page ID. Two common types of graphic displays are bar charts and histograms. Both bar charts and histograms use vertical or horizontal bars to represent the number of data points in each category or interval. The main difference graphically is that in a bar chart there are spaces between the bars and in a ...

  14. Data representations

    Data representations are useful for interpreting data and identifying trends and relationships. When working with data representations, pay close attention to both the data values and the key words in the question. When matching data to a representation, check that the values are graphed accurately for all categories.

  15. PDF Data Representation Type of Data Representation

    Data Representation • At its most basic level, all digital information must reduce to 0s and 1s, which can be discussed as binary, octal, or hex data. • There's no practical limit on how it can be interpreted to represent more complex structured data Type of Data Representation Integers 2's complement Floating point numbers IEEE 754

  16. Data types

    Numerical Data Types#. There are 5 basic numerical types representing booleans (bool), integers (int), unsigned integers (uint) floating point (float) and complex.A basic numerical type name combined with a numeric bitsize defines a concrete type.

  17. C++ Data Types

    C++ supports the following data types: Primary or Built-in or Fundamental data type; Derived data types; User-defined data types; Data Types in C++ are Mainly Divided into 3 Types: 1. Primitive Data Types: These data types are built-in or predefined data types and can be used directly by the user to declare variables. example: int, char, float, bool, etc. Primitive data types available in C++ are:

  18. Data Representation in Computer: Number Systems, Characters

    A computer uses a fixed number of bits to represent a piece of data which could be a number, a character, image, sound, video, etc. Data representation is the method used internally to represent data in a computer. Let us see how various types of data can be represented in computer memory. Before discussing data representation of numbers, let ...

  19. Data Types in C

    int var_name;. The integer data type can also be used as. unsigned int: Unsigned int data type in C is used to store the data values from zero to positive numbers but it can't store negative values like signed int. short int: It is lesser in size than the int by 2 bytes so can only store values from -32,768 to 32,767. long int: Larger version of the int datatype so can store values greater ...

  20. 10 Data Types (With Definitions and Examples)

    6. Short. Similar to the long data type, a short is a variable integer. Programmers represent these as whole numbers, and they can be positive or negative. Sometimes a short data type is a single integer. 7. String. A string data type is a combination of characters that can be either constant or variable.

  21. Data Representation: Definition, Types, Examples

    Data Representation: Data representation is a technique for analysing numerical data. The relationship between facts, ideas, information, and concepts is depicted in a diagram via data representation. It is a fundamental learning strategy that is simple and easy to understand. It is always determined by the data type in a specific domain.

  22. Graphical Representation of Data

    A bar graph is a type of graphical representation of the data in which bars of uniform width are drawn with equal spacing between them on one axis (x-axis usually), depicting the variable. The values of the variables are represented by the height of the bars. Histograms.

  23. Interpretable SHAP Model Combining Meta-learning and Vision ...

    These data are essential for analyzing subsurface geological structures. The raw data are preprocessed—normalization and standardization—to enhance the accuracy and efficiency of model processing in subsequent steps. Step 2: Meta-Learning Training Set Construction. Using the preprocessed data, construct a meta-training set.

  24. What are the different ways of Data Representation?

    When the row is placed in ascending or descending order is known as arrayed data. Types of Graphical Data Representation. Bar Chart. Bar chart helps us to represent the collected data visually. The collected data can be visualized horizontally or vertically in a bar chart like amounts and frequency. It can be grouped or single.