0 00000001 00000000000000000000000B
E = 1, F = 0
N(min) = 1.0B × 2^-126
(≈1.17549435 × 10^-38)
7F7F FFFFH 0 11111110 00000000000000000000000B E = 254, F = 0 N(max) = 1.1...1B × 2^127 = (2 - 2^-23) × 2^127 (≈3.4028235 × 10^38) | Double | 0010 0000 0000 0000H N(min) = 1.0B × 2^-1022 (≈2.2250738585072014 × 10^-308) | 7FEF FFFF FFFF FFFFH N(max) = 1.1...1B × 2^1023 = (2 - 2^-52) × 2^1023 (≈1.7976931348623157 × 10^308) |
Denormalized Floating-Point Numbers
If E = 0 , but the fraction is non-zero, then the value is in denormalized form, and a leading bit of 0 is assumed, as follows:
- For single-precision, E = 0 , N = (-1)^S × 0.F × 2^(-126)
- For double-precision, E = 0 , N = (-1)^S × 0.F × 2^(-1022)
Denormalized form can represent very small numbers closed to zero, and zero, which cannot be represented in normalized form, as shown in the above figure.
The minimum and maximum of denormalized floating-point numbers are:
Precision | Denormalized D(min) | Denormalized D(max) |
Single | 0000 0001H 0 00000000 00000000000000000000001B E = 0, F = 00000000000000000000001B D(min) = 0.0...1 × 2^-126 = 1 × 2^-23 × 2^-126 = 2^-149 (≈1.4 × 10^-45) | 007F FFFFH 0 00000000 11111111111111111111111B E = 0, F = 11111111111111111111111B D(max) = 0.1...1 × 2^-126 = (1-2^-23)×2^-126 (≈1.1754942 × 10^-38) |
Double | 0000 0000 0000 0001H D(min) = 0.0...1 × 2^-1022 = 1 × 2^-52 × 2^-1022 = 2^-1074 (≈4.9 × 10^-324) | 001F FFFF FFFF FFFFH D(max) = 0.1...1 × 2^-1022 = (1-2^-52)×2^-1022 (≈4.4501477170144023 × 10^-308) |
Special Values
Zero : Zero cannot be represented in the normalized form, and must be represented in denormalized form with E=0 and F=0 . There are two representations for zero: +0 with S=0 and -0 with S=1 .
Infinity : The value of +infinity (e.g., 1/0 ) and -infinity (e.g., -1/0 ) are represented with an exponent of all 1's ( E = 255 for single-precision and E = 2047 for double-precision), F=0 , and S=0 (for +INF ) and S=1 (for -INF ).
Not a Number (NaN) : NaN denotes a value that cannot be represented as real number (e.g. 0/0 ). NaN is represented with Exponent of all 1's ( E = 255 for single-precision and E = 2047 for double-precision) and any non-zero fraction.
Character Encoding
In computer memory, character are "encoded" (or "represented") using a chosen "character encoding schemes" (aka "character set", "charset", "character map", or "code page").
For example, in ASCII (as well as Latin1, Unicode, and many other character sets):
- code numbers 65D (41H) to 90D (5AH) represents 'A' to 'Z' , respectively.
- code numbers 97D (61H) to 122D (7AH) represents 'a' to 'z' , respectively.
- code numbers 48D (30H) to 57D (39H) represents '0' to '9' , respectively.
It is important to note that the representation scheme must be known before a binary pattern can be interpreted. E.g., the 8-bit pattern " 0100 0010B " could represent anything under the sun known only to the person encoded it.
The most commonly-used character encoding schemes are: 7-bit ASCII (ISO/IEC 646) and 8-bit Latin-x (ISO/IEC 8859-x) for western european characters, and Unicode (ISO/IEC 10646) for internationalization (i18n).
A 7-bit encoding scheme (such as ASCII) can represent 128 characters and symbols. An 8-bit character encoding scheme (such as Latin-x) can represent 256 characters and symbols; whereas a 16-bit encoding scheme (such as Unicode UCS-2) can represents 65,536 characters and symbols.
7-bit ASCII Code (aka US-ASCII, ISO/IEC 646, ITU-T T.50)
- ASCII (American Standard Code for Information Interchange) is one of the earlier character coding schemes.
- ASCII is originally a 7-bit code. It has been extended to 8-bit to better utilize the 8-bit computer memory organization. (The 8th-bit was originally used for parity check in the early computers.)
Hex | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
2 | SP | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |
3 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? |
4 | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
5 | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
6 | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
7 | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | |
Dec | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
3 | | | SP | ! | " | # | $ | % | & | ' |
4 | ( | ) | * | + | , | - | . | / | 0 | 1 |
5 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; |
6 | < | = | > | ? | @ | A | B | C | D | E |
7 | F | G | H | I | J | K | L | M | N | O |
8 | P | Q | R | S | T | U | V | W | X | Y |
9 | Z | [ | \ | ] | ^ | _ | ` | a | b | c |
10 | d | e | f | g | h | i | j | k | l | m |
11 | n | o | p | q | r | s | t | u | v | w |
12 | x | y | z | { | | | } | ~ | | | |
- Code number 32D (20H) is the blank or space character.
- '0' to '9' : 30H-39H (0011 0001B to 0011 1001B) or (0011 xxxxB where xxxx is the equivalent integer value )
- 'A' to 'Z' : 41H-5AH (0101 0001B to 0101 1010B) or (010x xxxxB) . 'A' to 'Z' are continuous without gap.
- 'a' to 'z' : 61H-7AH (0110 0001B to 0111 1010B) or (011x xxxxB) . 'A' to 'Z' are also continuous without gap. However, there is a gap between uppercase and lowercase letters. To convert between upper and lowercase, flip the value of bit-5.
- 09H for Tab ( '\t' ).
- 0AH for Line-Feed or newline (LF or '\n' ) and 0DH for Carriage-Return (CR or 'r' ), which are used as line delimiter (aka line separator , end-of-line ) for text files. There is unfortunately no standard for line delimiter: Unixes and Mac use 0AH (LF or " \n "), Windows use 0D0AH (CR+LF or " \r\n "). Programming languages such as C/C++/Java (which was created on Unix) use 0AH (LF or " \n ").
- In programming languages such as C/C++/Java, line-feed ( 0AH ) is denoted as '\n' , carriage-return ( 0DH ) as '\r' , tab ( 09H ) as '\t' .
DEC | HEX | Meaning | DEC | HEX | Meaning |
0 | 00 | NUL | Null | 17 | 11 | DC1 | Device Control 1 |
1 | 01 | SOH | Start of Heading | 18 | 12 | DC2 | Device Control 2 |
2 | 02 | STX | Start of Text | 19 | 13 | DC3 | Device Control 3 |
3 | 03 | ETX | End of Text | 20 | 14 | DC4 | Device Control 4 |
4 | 04 | EOT | End of Transmission | 21 | 15 | NAK | Negative Ack. |
5 | 05 | ENQ | Enquiry | 22 | 16 | SYN | Sync. Idle |
6 | 06 | ACK | Acknowledgment | 23 | 17 | ETB | End of Transmission |
7 | 07 | BEL | Bell | 24 | 18 | CAN | Cancel |
8 | 08 | BS | Back Space | 25 | 19 | EM | End of Medium |
| | | | 26 | 1A | SUB | Substitute |
| | | | 27 | 1B | ESC | Escape |
11 | 0B | VT | Vertical Feed | 28 | 1C | IS4 | File Separator |
12 | 0C | FF | Form Feed | 29 | 1D | IS3 | Group Separator |
| | | | 30 | 1E | IS2 | Record Separator |
14 | 0E | SO | Shift Out | 31 | 1F | IS1 | Unit Separator |
15 | 0F | SI | Shift In | | | | |
16 | 10 | DLE | Datalink Escape | 127 | 7F | DEL | Delete |
8-bit Latin-1 (aka ISO/IEC 8859-1)
ISO/IEC-8859 is a collection of 8-bit character encoding standards for the western languages.
ISO/IEC 8859-1, aka Latin alphabet No. 1, or Latin-1 in short, is the most commonly-used encoding scheme for western european languages. It has 191 printable characters from the latin script, which covers languages like English, German, Italian, Portuguese and Spanish. Latin-1 is backward compatible with the 7-bit US-ASCII code. That is, the first 128 characters in Latin-1 (code numbers 0 to 127 (7FH)), is the same as US-ASCII. Code numbers 128 (80H) to 159 (9FH) are not assigned. Code numbers 160 (A0H) to 255 (FFH) are given as follows:
Hex | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
A | NBSP | ¡ | ¢ | £ | ¤ | ¥ | ¦ | § | ¨ | © | ª | « | ¬ | SHY | ® | ¯ |
B | ° | ± | ² | ³ | ´ | µ | ¶ | · | ¸ | ¹ | º | » | ¼ | ½ | ¾ | ¿ |
C | À | Á | Â | Ã | Ä | Å | Æ | Ç | È | É | Ê | Ë | Ì | Í | Î | Ï |
D | Ð | Ñ | Ò | Ó | Ô | Õ | Ö | × | Ø | Ù | Ú | Û | Ü | Ý | Þ | ß |
E | à | á | â | ã | ä | å | æ | ç | è | é | ê | ë | ì | í | î | ï |
F | ð | ñ | ò | ó | ô | õ | ö | ÷ | ø | ù | ú | û | ü | ý | þ | ÿ |
ISO/IEC-8859 has 16 parts. Besides the most commonly-used Part 1, Part 2 is meant for Central European (Polish, Czech, Hungarian, etc), Part 3 for South European (Turkish, etc), Part 4 for North European (Estonian, Latvian, etc), Part 5 for Cyrillic, Part 6 for Arabic, Part 7 for Greek, Part 8 for Hebrew, Part 9 for Turkish, Part 10 for Nordic, Part 11 for Thai, Part 12 was abandon, Part 13 for Baltic Rim, Part 14 for Celtic, Part 15 for French, Finnish, etc. Part 16 for South-Eastern European.
Other 8-bit Extension of US-ASCII (ASCII Extensions)
Beside the standardized ISO-8859-x, there are many 8-bit ASCII extensions, which are not compatible with each others.
ANSI (American National Standards Institute) (aka Windows-1252 , or Windows Codepage 1252): for Latin alphabets used in the legacy DOS/Windows systems. It is a superset of ISO-8859-1 with code numbers 128 (80H) to 159 (9FH) assigned to displayable characters, such as "smart" single-quotes and double-quotes. A common problem in web browsers is that all the quotes and apostrophes (produced by "smart quotes" in some Microsoft software) were replaced with question marks or some strange symbols. It it because the document is labeled as ISO-8859-1 (instead of Windows-1252), where these code numbers are undefined. Most modern browsers and e-mail clients treat charset ISO-8859-1 as Windows-1252 in order to accommodate such mis-labeling.
Hex | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
8 | € | | ‚ | ƒ | „ | … | † | ‡ | ˆ | ‰ | Š | ‹ | Œ | | Ž | |
9 | | ‘ | ’ | “ | ” | • | – | — | | ™ | š | › | œ | | ž | Ÿ |
EBCDIC (Extended Binary Coded Decimal Interchange Code): Used in the early IBM computers.
Unicode (aka ISO/IEC 10646 Universal Character Set)
Before Unicode, no single character encoding scheme could represent characters in all languages. For example, western european uses several encoding schemes (in the ISO-8859-x family). Even a single language like Chinese has a few encoding schemes (GB2312/GBK, BIG5). Many encoding schemes are in conflict of each other, i.e., the same code number is assigned to different characters.
Unicode aims to provide a standard character encoding scheme, which is universal, efficient, uniform and unambiguous. Unicode standard is maintained by a non-profit organization called the Unicode Consortium (@ www.unicode.org ). Unicode is an ISO/IEC standard 10646.
Unicode is backward compatible with the 7-bit US-ASCII and 8-bit Latin-1 (ISO-8859-1). That is, the first 128 characters are the same as US-ASCII; and the first 256 characters are the same as Latin-1.
Unicode originally uses 16 bits (called UCS-2 or Unicode Character Set - 2 byte), which can represent up to 65,536 characters. It has since been expanded to more than 16 bits, currently stands at 21 bits. The range of the legal codes in ISO/IEC 10646 is now from U+0000H to U+10FFFFH (21 bits or about 2 million characters), covering all current and ancient historical scripts. The original 16-bit range of U+0000H to U+FFFFH (65536 characters) is known as Basic Multilingual Plane (BMP), covering all the major languages in use currently. The characters outside BMP are called Supplementary Characters , which are not frequently-used.
Unicode has two encoding schemes:
- UCS-2 (Universal Character Set - 2 Byte): Uses 2 bytes (16 bits), covering 65,536 characters in the BMP. BMP is sufficient for most of the applications. UCS-2 is now obsolete.
- UCS-4 (Universal Character Set - 4 Byte): Uses 4 bytes (32 bits), covering BMP and the supplementary characters.
UTF-8 (Unicode Transformation Format - 8-bit)
The 16/32-bit Unicode (UCS-2/4) is grossly inefficient if the document contains mainly ASCII characters, because each character occupies two bytes of storage. Variable-length encoding schemes, such as UTF-8, which uses 1-4 bytes to represent a character, was devised to improve the efficiency. In UTF-8, the 128 commonly-used US-ASCII characters use only 1 byte, but some less-commonly characters may require up to 4 bytes. Overall, the efficiency improved for document containing mainly US-ASCII texts.
The transformation between Unicode and UTF-8 is as follows:
Bits | Unicode | UTF-8 Code | Bytes |
7 | 00000000 0xxxxxxx | 0xxxxxxx | 1 (ASCII) |
11 | 00000yyy yyxxxxxx | 110yyyyy 10xxxxxx | 2 |
16 | zzzzyyyy yyxxxxxx | 1110zzzz 10yyyyyy 10xxxxxx | 3 |
21 | 000uuuuu zzzzyyyy yyxxxxxx | 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx | 4 |
In UTF-8, Unicode numbers corresponding to the 7-bit ASCII characters are padded with a leading zero; thus has the same value as ASCII. Hence, UTF-8 can be used with all software using ASCII. Unicode numbers of 128 and above, which are less frequently used, are encoded using more bytes (2-4 bytes). UTF-8 generally requires less storage and is compatible with ASCII. The drawback of UTF-8 is more processing power needed to unpack the code due to its variable length. UTF-8 is the most popular format for Unicode.
- UTF-8 uses 1-3 bytes for the characters in BMP (16-bit), and 4 bytes for supplementary characters outside BMP (21-bit).
- The 128 ASCII characters (basic Latin letters, digits, and punctuation signs) use one byte. Most European and Middle East characters use a 2-byte sequence, which includes extended Latin letters (with tilde, macron, acute, grave and other accents), Greek, Armenian, Hebrew, Arabic, and others. Chinese, Japanese and Korean (CJK) use three-byte sequences.
- All the bytes, except the 128 ASCII characters, have a leading '1' bit. In other words, the ASCII bytes, with a leading '0' bit, can be identified and decoded easily.
Example : 您好 (Unicode: 60A8H 597DH)
UTF-16 (Unicode Transformation Format - 16-bit)
UTF-16 is a variable-length Unicode character encoding scheme, which uses 2 to 4 bytes. UTF-16 is not commonly used. The transformation table is as follows:
Unicode | UTF-16 Code | Bytes |
xxxxxxxx xxxxxxxx | Same as UCS-2 - no encoding | 2 |
000uuuuu zzzzyyyy yyxxxxxx (uuuuu≠0) | 110110ww wwzzzzyy 110111yy yyxxxxxx (wwww = uuuuu - 1) | 4 |
Take note that for the 65536 characters in BMP, the UTF-16 is the same as UCS-2 (2 bytes). However, 4 bytes are used for the supplementary characters outside the BMP.
For BMP characters, UTF-16 is the same as UCS-2. For supplementary characters, each character requires a pair 16-bit values, the first from the high-surrogates range, ( \uD800-\uDBFF ), the second from the low-surrogates range ( \uDC00-\uDFFF ).
UTF-32 (Unicode Transformation Format - 32-bit)
Same as UCS-4, which uses 4 bytes for each character - unencoded.
Formats of Multi-Byte (e.g., Unicode) Text Files
Endianess (or byte-order) : For a multi-byte character, you need to take care of the order of the bytes in storage. In big endian , the most significant byte is stored at the memory location with the lowest address (big byte first). In little endian , the most significant byte is stored at the memory location with the highest address (little byte first). For example, 您 (with Unicode number of 60A8H ) is stored as 60 A8 in big endian; and stored as A8 60 in little endian. Big endian, which produces a more readable hex dump, is more commonly-used, and is often the default.
BOM (Byte Order Mark) : BOM is a special Unicode character having code number of FEFFH , which is used to differentiate big-endian and little-endian. For big-endian, BOM appears as FE FFH in the storage. For little-endian, BOM appears as FF FEH . Unicode reserves these two code numbers to prevent it from crashing with another character.
Unicode text files could take on these formats:
- Big Endian: UCS-2BE, UTF-16BE, UTF-32BE.
- Little Endian: UCS-2LE, UTF-16LE, UTF-32LE.
- UTF-16 with BOM. The first character of the file is a BOM character, which specifies the endianess. For big-endian, BOM appears as FE FFH in the storage. For little-endian, BOM appears as FF FEH .
UTF-8 file is always stored as big endian. BOM plays no part. However, in some systems (in particular Windows), a BOM is added as the first character in the UTF-8 file as the signature to identity the file as UTF-8 encoded. The BOM character ( FEFFH ) is encoded in UTF-8 as EF BB BF . Adding a BOM as the first character of the file is not recommended, as it may be incorrectly interpreted in other system. You can have a UTF-8 file without BOM.
Formats of Text Files
Line Delimiter or End-Of-Line (EOL) : Sometimes, when you use the Windows NotePad to open a text file (created in Unix or Mac), all the lines are joined together. This is because different operating platforms use different character as the so-called line delimiter (or end-of-line or EOL). Two non-printable control characters are involved: 0AH (Line-Feed or LF) and 0DH (Carriage-Return or CR).
- Windows/DOS uses OD0AH (CR+LF or " \r\n ") as EOL.
- Unix and Mac use 0AH (LF or " \n ") only.
End-of-File (EOF) : [TODO]
Windows' CMD Codepage
Character encoding scheme (charset) in Windows is called codepage . In CMD shell, you can issue command "chcp" to display the current codepage, or "chcp codepage-number" to change the codepage.
Take note that:
- The default codepage 437 (used in the original DOS) is an 8-bit character set called Extended ASCII , which is different from Latin-1 for code numbers above 127.
- Codepage 1252 (Windows-1252), is not exactly the same as Latin-1. It assigns code number 80H to 9FH to letters and punctuation, such as smart single-quotes and double-quotes. A common problem in browser that display quotes and apostrophe in question marks or boxes is because the page is supposed to be Windows-1252, but mislabelled as ISO-8859-1.
- For internationalization and chinese character set: codepage 65001 for UTF8, codepage 1201 for UCS-2BE, codepage 1200 for UCS-2LE, codepage 936 for chinese characters in GB2312, codepage 950 for chinese characters in Big5.
Chinese Character Sets
Unicode supports all languages, including asian languages like Chinese (both simplified and traditional characters), Japanese and Korean (collectively called CJK). There are more than 20,000 CJK characters in Unicode. Unicode characters are often encoded in the UTF-8 scheme, which unfortunately, requires 3 bytes for each CJK character, instead of 2 bytes in the unencoded UCS-2 (UTF-16).
Worse still, there are also various chinese character sets, which is not compatible with Unicode:
- GB2312/GBK: for simplified chinese characters. GB2312 uses 2 bytes for each chinese character. The most significant bit (MSB) of both bytes are set to 1 to co-exist with 7-bit ASCII with the MSB of 0. There are about 6700 characters. GBK is an extension of GB2312, which include more characters as well as traditional chinese characters.
- BIG5: for traditional chinese characters BIG5 also uses 2 bytes for each chinese character. The most significant bit of both bytes are also set to 1. BIG5 is not compatible with GBK, i.e., the same code number is assigned to different character.
For example, the world is made more interesting with these many standards:
| Standard | Characters | Codes |
Simplified | GB2312 | 和谐 | BACD D0B3 |
UCS-2 | 和谐 | 548C 8C10 |
UTF-8 | 和谐 | E5928C E8B090 |
Traditional | BIG5 | 和諧 | A94D BFD3 |
UCS-2 | 和諧 | 548C 8AE7 |
UTF-8 | 和諧 | E5928C E8ABA7 |
Notes for Windows' CMD Users : To display the chinese character correctly in CMD shell, you need to choose the correct codepage, e.g., 65001 for UTF8, 936 for GB2312/GBK, 950 for Big5, 1201 for UCS-2BE, 1200 for UCS-2LE, 437 for the original DOS. You can use command " chcp " to display the current code page and command " chcp codepage_number " to change the codepage. You also have to choose a font that can display the characters (e.g., Courier New, Consolas or Lucida Console, NOT Raster font).
Collating Sequences (for Ranking Characters)
A string consists of a sequence of characters in upper or lower cases, e.g., "apple" , "BOY" , "Cat" . In sorting or comparing strings, if we order the characters according to the underlying code numbers (e.g., US-ASCII) character-by-character, the order for the example would be "BOY" , "apple" , "Cat" because uppercase letters have a smaller code number than lowercase letters. This does not agree with the so-called dictionary order , where the same uppercase and lowercase letters have the same rank. Another common problem in ordering strings is "10" (ten) at times is ordered in front of "1" to "9" .
Hence, in sorting or comparison of strings, a so-called collating sequence (or collation ) is often defined, which specifies the ranks for letters (uppercase, lowercase), numbers, and special symbols. There are many collating sequences available. It is entirely up to you to choose a collating sequence to meet your application's specific requirements. Some case-insensitive dictionary-order collating sequences have the same rank for same uppercase and lowercase letters, i.e., 'A' , 'a' ⇒ 'B' , 'b' ⇒ ... ⇒ 'Z' , 'z' . Some case-sensitive dictionary-order collating sequences put the uppercase letter before its lowercase counterpart, i.e., 'A' ⇒ 'B' ⇒ 'C' ... ⇒ 'a' ⇒ 'b' ⇒ 'c' ... . Typically, space is ranked before digits '0' to '9' , followed by the alphabets.
Collating sequence is often language dependent, as different languages use different sets of characters (e.g., á, é, a, α) with their own orders.
For Java Programmers - java.nio.Charset
JDK 1.4 introduced a new java.nio.charset package to support encoding/decoding of characters from UCS-2 used internally in Java program to any supported charset used by external devices.
Example : The following program encodes some Unicode texts in various encoding scheme, and display the Hex codes of the encoded byte sequences.
For Java Programmers - char and String
The char data type are based on the original 16-bit Unicode standard called UCS-2. The Unicode has since evolved to 21 bits, with code range of U+0000 to U+10FFFF. The set of characters from U+0000 to U+FFFF is known as the Basic Multilingual Plane ( BMP ). Characters above U+FFFF are called supplementary characters. A 16-bit Java char cannot hold a supplementary character.
Recall that in the UTF-16 encoding scheme, a BMP characters uses 2 bytes. It is the same as UCS-2. A supplementary character uses 4 bytes. and requires a pair of 16-bit values, the first from the high-surrogates range, ( \uD800-\uDBFF ), the second from the low-surrogates range ( \uDC00-\uDFFF ).
In Java, a String is a sequences of Unicode characters. Java, in fact, uses UTF-16 for String and StringBuffer . For BMP characters, they are the same as UCS-2. For supplementary characters, each characters requires a pair of char values.
Java methods that accept a 16-bit char value does not support supplementary characters. Methods that accept a 32-bit int value support all Unicode characters (in the lower 21 bits), including supplementary characters.
This is meant to be an academic discussion. I have yet to encounter the use of supplementary characters!
Displaying Hex Values & Hex Editors
At times, you may need to display the hex values of a file, especially in dealing with Unicode characters. A Hex Editor is a handy tool that a good programmer should possess in his/her toolbox. There are many freeware/shareware Hex Editor available. Try google "Hex Editor".
I used the followings:
- NotePad++ with Hex Editor Plug-in: Open-source and free. You can toggle between Hex view and Normal view by pushing the "H" button.
- PSPad: Freeware. You can toggle to Hex view by choosing "View" menu and select "Hex Edit Mode".
- TextPad: Shareware without expiration period. To view the Hex value, you need to "open" the file by choosing the file format of "binary" (??).
- UltraEdit: Shareware, not free, 30-day trial only.
Let me know if you have a better choice, which is fast to launch, easy to use, can toggle between Hex and normal view, free, ....
The following Java program can be used to display hex code for Java Primitives (integer, character and floating-point):
| System.out.println("Hex is " + Integer.toHexString(i)); // 3039 System.out.println("Binary is " + Integer.toBinaryString(i)); // 11000000111001 System.out.println("Octal is " + Integer.toOctalString(i)); // 30071 System.out.printf("Hex is %x\n", i); // 3039 System.out.printf("Octal is %o\n", i); // 30071 char c = 'a'; System.out.println("Character is " + c); // a System.out.printf("Character is %c\n", c); // a System.out.printf("Hex is %x\n", (short)c); // 61 System.out.printf("Decimal is %d\n", (short)c); // 97 float f = 3.5f; System.out.println("Decimal is " + f); // 3.5 System.out.println(Float.toHexString(f)); // 0x1.cp1 (Fraction=1.c, Exponent=1) f = -0.75f; System.out.println("Decimal is " + f); // -0.75 System.out.println(Float.toHexString(f)); // -0x1.8p-1 (F=-1.8, E=-1) double d = 11.22; System.out.println("Decimal is " + d); // 11.22 System.out.println(Double.toHexString(d)); // 0x1.670a3d70a3d71p3 (F=1.670a3d70a3d71 E=3) } } |
In Eclipse, you can view the hex code for integer primitive Java variables in debug mode as follows: In debug perspective, "Variable" panel ⇒ Select the "menu" (inverted triangle) ⇒ Java ⇒ Java Preferences... ⇒ Primitive Display Options ⇒ Check "Display hexadecimal values (byte, short, char, int, long)".
Summary - Why Bother about Data Representation?
Integer number 1 , floating-point number 1.0 character symbol '1' , and string "1" are totally different inside the computer memory. You need to know the difference to write good and high-performance programs.
- In 8-bit signed integer , integer number 1 is represented as 00000001B .
- In 8-bit unsigned integer , integer number 1 is represented as 00000001B .
- In 16-bit signed integer , integer number 1 is represented as 00000000 00000001B .
- In 32-bit signed integer , integer number 1 is represented as 00000000 00000000 00000000 00000001B .
- In 32-bit floating-point representation , number 1.0 is represented as 0 01111111 0000000 00000000 00000000B , i.e., S=0 , E=127 , F=0 .
- In 64-bit floating-point representation , number 1.0 is represented as 0 01111111111 0000 00000000 00000000 00000000 00000000 00000000 00000000B , i.e., S=0 , E=1023 , F=0 .
- In 8-bit Latin-1, the character symbol '1' is represented as 00110001B (or 31H ).
- In 16-bit UCS-2, the character symbol '1' is represented as 00000000 00110001B .
- In UTF-8, the character symbol '1' is represented as 00110001B .
If you "add" a 16-bit signed integer 1 and Latin-1 character '1' or a string "1", you could get a surprise.
Exercises (Data Representation)
For the following 16-bit codes:
Give their values, if they are representing:
- a 16-bit unsigned integer;
- a 16-bit signed integer;
- two 8-bit unsigned integers;
- two 8-bit signed integers;
- a 16-bit Unicode characters;
- two 8-bit ISO-8859-1 characters.
Ans: (1) 42 , 32810 ; (2) 42 , -32726 ; (3) 0 , 42 ; 128 , 42 ; (4) 0 , 42 ; -128 , 42 ; (5) '*' ; '耪' ; (6) NUL , '*' ; PAD , '*' .
REFERENCES & RESOURCES
- (Floating-Point Number Specification) IEEE 754 (1985), "IEEE Standard for Binary Floating-Point Arithmetic".
- (ASCII Specification) ISO/IEC 646 (1991) (or ITU-T T.50-1992), "Information technology - 7-bit coded character set for information interchange".
- (Latin-I Specification) ISO/IEC 8859-1, "Information technology - 8-bit single-byte coded graphic character sets - Part 1: Latin alphabet No. 1".
- (Unicode Specification) ISO/IEC 10646, "Information technology - Universal Multiple-Octet Coded Character Set (UCS)".
- Unicode Consortium @ http://www.unicode.org .
Last modified: January, 2014
Basic Data Types in Python: A Quick Exploration
Table of Contents
Python’s Basic Data Types
Integer literals, integer methods, the built-in int() function, floating-point literals, floating-point numbers representation, floating-point methods, the built-in float() function, complex number literals, complex number methods, the built-in complex() function, regular string literals, escape sequences in strings, raw string literals, f-string literals, string methods, common sequence operations on strings, the built-in str() and repr() functions, bytes literals, the built-in bytes() function, the built-in bytearray() function, bytes and bytearray methods, boolean literals, the built-in bool() function.
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Basic Data Types in Python
Python has several basic data types that are built into the language. With these types, you can represent numeric values, text and binary data, and Boolean values in your code. So, these data types are the basic building blocks of most Python programs and projects.
In this tutorial, you’ll learn about:
- Numeric types, such as int , float , and complex
- The str data type, which represents textual data
- The bytes and bytearray data types for storing bytes
- Boolean values with the bool data type
In this tutorial, you’ll learn only the basics of each data type. To learn more about a specific data type, you’ll find useful resources in the corresponding section.
Get Your Code: Click here to download the free sample code that you’ll use to learn about basic data types in Python.
Take the Quiz: Test your knowledge with our interactive “Basic Data Types in Python: A Quick Exploration” quiz. You’ll receive a score upon completion to help you track your learning progress:
Interactive Quiz
Take this quiz to test your understanding of the basic data types that are built into Python, like numbers, strings, bytes, and Booleans.
Python has several built-in data types that you can use out of the box because they’re built into the language. From all the built-in types available, you’ll find that a few of them represent basic objects, such as numbers, strings and characters, bytes, and Boolean values.
Note that the term basic refers to objects that can represent data you typically find in real life, such as numbers and text. It doesn’t include composite data types, such as lists , tuples , dictionaries , and others.
In Python, the built-in data types that you can consider basic are the following:
Class | Basic Type |
| Integer numbers |
| Floating-point numbers |
| Complex numbers |
| Strings and characters |
, | Bytes |
| Boolean values |
In the following sections, you’ll learn the basics of how to create, use, and work with all of these built-in data types in Python.
Integer Numbers
Integer numbers are whole numbers with no decimal places. They can be positive or negative numbers. For example, 0 , 1 , 2 , 3 , -1 , -2 , and -3 are all integers. Usually, you’ll use positive integer numbers to count things.
In Python, the integer data type is represented by the int class:
In the following sections, you’ll learn the basics of how to create and work with integer numbers in Python.
When you need to use integer numbers in your code, you’ll often use integer literals directly. Literals are constant values of built-in types spelled out literally, such as integers. Python provides a few different ways to create integer literals. The most common way is to use base-ten literals that look the same as integers look in math:
Here, you have three integer numbers: a positive one, a negative one, and zero. Note that to create negative integers, you need to prepend the minus sign ( - ) to the number.
Python has no limit to how long an integer value can be. The only constraint is the amount of memory your system has. Beyond that, an integer can be as long as you need:
For a really, really long integer, you can get a ValueError when converting it to a string:
If you need to print an integer number beyond the 4300-digit limit, then you can use the sys.set_int_max_str_digits() function to increase the limit and make your code work.
When you’re working with long integers, you can use the underscore character to make the literals more readable:
With the underscore as a thousands separator, you can make your integer literals more readable for fellow programmers reading your code.
You can also use other bases to represent integers. You can prepend the following characters to an integer value to indicate a base other than 10 :
Prefix | Representation | Base |
or (Zero + b or B) | Binary | 2 |
or (Zero + o or O) | Octal | 8 |
or (Zero + x or X) | Hexadecimal | 16 |
Using the above characters, you can create integer literals using binary , octal , and hexadecimal representations. For example:
Note that the underlying type of a Python integer is always int . So, in all cases, the built-in type() function returns int , irrespective of the base you use to build the literal.
The built-in int type has a few methods that you can use in some situations. Here’s a quick summary of these methods:
Method | Description |
| Returns a pair of integers whose ratio is equal to the original integer and has a positive denominator |
| Returns the number of ones in the binary representation of the absolute value of the integer |
| Returns the number of bits necessary to represent an integer in binary, excluding the sign and leading zeros |
| Returns the integer represented by the given array of bytes |
| Returns an array of bytes representing an integer |
| Returns |
When you call the .as_integer_ratio() method on an integer value, you get the integer as the numerator and 1 as the denominator. As you’ll see in a moment, this method is more useful in floating-point numbers.
Note that the int type also has a method called .is_integer() , which always returns True . This method exists for duck typing compatibility with floating-point numbers, which have the method as part of their public interface .
Note: To access an integer method on a literal, you need to wrap the literal in parentheses:
The parentheses are required because the dot character ( . ) also defines floating-point numbers, as you’ll learn in a moment. If you don’t use the parentheses, then you get a SyntaxError .
The .bit_count() and .bit_length() methods can help you when working on digital signal processing . For example, you may want every transmitted signal to have an even number of set bits:
In this toy example, you use .bit_count() to ensure that the received signal has the correct parity. This way, you implement a basic error detection mechanism.
Finally, the .from_bytes() and .to_bytes() methods can be useful in network programming. Often, you need to send and receive data over the network in binary format. To do this, you can use .to_bytes() to convert the message for network transmission. Similarly, you can use .from_bytes() to convert the message back.
The built-in int() function provides another way to create integer values using different representations. With no arguments, the function returns 0 :
This feature makes int() especially useful when you need a factory function for classes like defaultdict from the collections module.
Note: In Python, the built-in functions associated with data types, such as int() , float() , str() , and bytes() , are classes with a function-style name. The Python documentation calls them functions, so you’ll follow that practice in this tutorial. However, keep in mind that something like int() is really a class constructor rather than a regular function.
The int() function is commonly used to convert other data types into integers, provided that they’re valid numeric values:
In these examples, you first use int() to convert a floating-point number into an integer. Then, you convert a string into an integer. Note that when it comes to strings, you must ensure that the input string is a valid numeric value. Otherwise, you’ll get a ValueError exception.
Note: When you use the int() function to convert floating-point numbers, you must be aware that the function just removes the decimal or fractional part.
This function can take an additional argument called base , which defaults to 10 for decimal integers. This argument allows you to convert strings that represent integer values, which are expressed using a different base:
In this case, the first argument must be a string representing an integer value with or without a prefix. Then, you must provide the appropriate base in the second argument to run the conversion. Once you call the function, you get the resulting integer value.
Floating-Point Numbers
Floating-point numbers , or just float , are numbers with a decimal place. For example, 1.0 and 3.14 are floating-point numbers. You can also have negative float numbers, such as -2.75 . In Python, the name of the float class represents floating-point numbers:
In the following sections, you’ll learn the basics of how to create and work with floating-point numbers in Python.
The float type in Python designates floating-point numbers. To create these types of numbers, you can also use literals, similar to what you use in math. However, in Python, the dot character ( . ) is what you must use to create floating-point literals:
In these quick examples, you create floating-point numbers in three different ways. First, you have a literal build using an integer part, the dot, and the decimal part. You can also create a literal using the dot without specifying the decimal part, which defaults to 0 . Finally, you make a literal without specifying the integer part, which also defaults to 0 .
You can also have negative float numbers:
To create a negative floating-point number using a literal, you need to prepend the minus sign ( - ) to the number.
Similar to integer numbers, if you’re working with long floating-point numbers, you can use the underscore character as a thousands separator:
By using an underscore, you can make your floating-point literals more readable for humans, which is great.
Optionally, you can use the characters e or E followed by a positive or negative integer to express the number using scientific notation :
By using the e or E character, you can represent any floating-point number using scientific notation, as you did in the above examples.
Now, you can take a more in-depth look at how Python internally represents floating-point numbers. You can readily use floating-point numbers in Python without understanding them to this level, so don’t worry if this seems overly complicated. The information in this section is only meant to satisfy your curiosity.
Note: For additional information on the floating-point representation in Python and the potential pitfalls, see Floating Point Arithmetic: Issues and Limitations in the Python documentation.
Almost all platforms represent Python float values as 64-bit (double-precision) values, according to the IEEE 754 standard. In that case, a floating-point number’s maximum value is approximately 1.8 ⨉ 10 308 . Python will indicate this number, and any numbers greater than that, by the "inf" string:
The closest a nonzero number can be to zero is approximately 5.0 ⨉ 10 -324 . Anything closer to zero than that is effectively considered to be zero:
Python internally represents floating-point numbers as binary (base-2) fractions. Most decimal fractions can’t be represented exactly as binary fractions. So, in most cases, the internal representation of a floating-point number is an approximation of its actual value.
In practice, the difference between the actual and represented values is small and should be manageable. However, check out Make Python Lie to You for some challenges you should be aware of.
The built-in float type has a few methods and attributes which can be useful in some situations. Here’s a quick summary of them:
Method | Description |
| Returns a pair of integers whose ratio is exactly equal to the original |
| Returns if the float instance is finite with integral value, and otherwise |
| Returns a representation of a floating-point number as a hexadecimal string |
| Builds the from a hexadecimal string |
The .as_integer_ratio() method on a float value returns a pair of integers whose ratio equals the original number. You can use this method in scientific computations that require high precision. In these situations, you may need to avoid precision loss due to floating-point rounding errors.
For example, say that you need to perform computations with the gravitational constant:
With this exact ratio, you can perform calculations and prevent floating-point errors that may alter the results of your research.
The .is_integer() method allows you to check whether a given float value is an integer:
When the number after the decimal point is 0 , the .is_integer() method returns True . Otherwise, it returns False .
Finally, the .hex() and .fromhex() methods allow you to work with floating-point values using a hexadecimal representation:
The .hex() method returns a string that represents the target float value as a hexadecimal value. Note that .hex() is an instance method. The .fromhex() method takes a string that represents a floating-point number as an argument and builds an actual float number from it.
In both methods, the hexadecimal string has the following format:
In this template, apart from the integer identifier, the components are optional. Here’s what they mean:
- sign defines whether the number is positive or negative. It may be either + or - . Only the - sign is required because + is the default.
- "0x" is the hexadecimal prefix.
- integer is a string of hexadecimal digits representing the whole part of the float number.
- "." is a dot that separates the whole and fractional parts.
- fraction is a string of hexadecimal digits representing the fractional part of the float number.
- "p" allows for adding an exponent value.
- exponent is a decimal integer with an optional leading sign.
With these components, you’ll be able to create valid hexadecimal strings to process your floating-point numbers with the .hex() and .fromhex() methods.
The built-in float() function provides another way to create floating-point values. When you call float() with no argument, then you get 0.0 :
Again, this feature of float() allows you to use it as a factory function.
The float() function also helps you convert other data types into float , provided that they’re valid numeric values:
In these examples, you first use float() to convert an integer number into a float. Then, you convert a string into a float. Again, with strings, you need to make sure that the input string is a valid numeric value. Otherwise, you get a ValueError exception.
Complex Numbers
Python has a built-in type for complex numbers . Complex numbers are composed of real and imaginary parts. They have the form a + bi , where a and b are real numbers, and i is the imaginary unit . In Python, you’ll use a j instead of an i . For example:
In this example, the argument to type() may look like an expression. However, it’s a literal of a complex number in Python. If you pass the literal to the type() function, then you’ll get the complex type back.
Note: To dive deeper into complex numbers, check out the Simplify Complex Numbers With Python tutorial.
In the following sections, you’ll learn the basics of creating complex numbers in Python. You’ll also explore the methods of this data type.
In Python, you can define complex numbers using literals that look like a + bj , where a is the real part, and bj is the imaginary part:
As you can conclude from these examples, there are many ways to create complex numbers using literals. The key is that you need to use the j letter in one of the components. Note that the j can’t be used alone. If you try to do so, you get a NameError exception because Python thinks that you’re creating an expression . Instead, you need to write 1j .
In Python, the complex type has a single method called .conjugate() . When you call this method on a complex number, you get the conjugate:
The conjugate() method flips the sign of the imaginary part, returning the complex conjugate .
You can also use the built-in complex() function to create complex numbers by providing the real and imaginary parts as arguments:
When you call complex() with no argument, you get 0j . If you call the function with a single argument, that argument is the real part, and the imaginary part will be 0j . If you want only the imaginary part, you can pass 0 as the first argument. Note that you can also use negative numbers. In general, you can use integers and floating-point numbers as arguments to complex() .
You can also use complex() to convert strings to complex numbers:
To convert strings into complex numbers, you must provide a string that follows the format of complex numbers. For example, you can’t have spaces between the components. If you add spaces, then you get a ValueError exception.
Finally, note that you can’t use strings to provide the imaginary part of complex numbers. If you do that, then you get a TypeError exception.
Strings and Characters
In Python, strings are sequences of character data that you can use to represent and store textual data. The string type in Python is called str :
In this example, the argument to type() is a string literal that you commonly create using double quotes to enclose some text.
Note: Check out the Strings and Character Data in Python tutorial to dive deeper into Python strings.
In the following sections, you’ll learn the basics of how to create, use, format, and manipulate strings in Python.
You can also use literals to create strings. To build a single-line string literal, you can use double ( "" ) or single quotes ( '' ) and, optionally, a sequence of characters in between them. All the characters between the opening and closing quotes are part of the string:
Python’s strings can contain as many characters as you need. The only limit is your computer’s memory.
You can define empty strings by using the quotes without placing characters between them:
An empty string doesn’t contain any characters, so when you use the built-in len() function with an empty string as an argument, you get 0 as a result.
There is yet another way to delimit strings in Python. You can create triple-quoted string literals, which can be delimited using either three single quotes or three double quotes. Triple-quoted strings are commonly used to build multiline string literals. However, you can also use them to create single-line literals:
Even though you can use triple-quoted strings to create single-line string literals, the main use case of them would be to create multiline strings. In Python code, probably the most common use case for these string literals is when you need to provide docstrings for your packages, modules , functions , classes , and methods .
What if you want to include a quote character as part of the string itself? Your first impulse might be to try something like this:
As you can see, that doesn’t work so well. The string in this example opens with a single quote, so Python assumes the next single quote—the one in parentheses—is the closing delimiter. The final single quote is then a stray, which causes the syntax error shown.
If you want to include either type of quote character within the string, then you can delimit the string with the other type. In other words, if a string is to contain a single quote, delimit it with double quotes and vice versa:
In these examples, your first string includes a single quote as part of the text. To do this, you use double quotes to delimit the literal. In the second example, you do the opposite.
Sometimes, you want Python to interpret a character or sequence of characters within a string differently. This may occur in one of two ways. You may want to:
- Apply special meaning to characters
- Suppress special character meaning
You can accomplish these goals by using a backslash ( \ ) character to indicate that the characters following it should be interpreted specially. The combination of a backslash and a specific character is called an escape sequence . That’s because the backslash causes the subsequent character to escape its usual meaning.
You already know that if you use single quotes to delimit a string, then you can’t directly embed a single quote character as part of the string because, for that string, the single quote has a special meaning —it terminates the string. You can eliminate this limitation by using double quotes to delimit the string.
Alternatively, you can escape the quote character using a backslash:
In this example, the backslash escapes the single quote character by suppressing its usual meaning. Now, Python knows that your intention isn’t to terminate the string but to embed the single quote.
The following is a table of escape sequences that cause Python to suppress the usual special interpretation of a character in a string:
Character | Usual Interpretation | Escape Sequence | Escaped Interpretation |
| Delimit a string literal | | Literal single quote ( ) character |
| Delimit a string literal | | Literal double quote ( ) character |
| Terminates the input line | | Newline is ignored |
| Introduces an escape sequence | | Literal backslash ( ) character |
You already have an idea of how the first two escape sequences work. Now, how does the newline escape sequence work? Usually, a newline character terminates a physical line of input. So, pressing Enter in the middle of a string will cause an error:
When you press Enter after typing Hello , you get a SyntaxError . If you need to break up a string over more than one line, then you can include a backslash before each new line:
By using a backslash before pressing enter, you make Python ignore the new line and interpret the whole construct as a single line.
Finally, sometimes you need to include a literal backslash character in a string. If that backslash doesn’t precede a character with a special meaning, then you can insert it right away:
In this example, the character after the backslash doesn’t match any known escape sequence, so Python inserts the actual backslash for you. Note how the resulting string automatically doubles the backslash. Even though this example works, the best practice is to always double the backslash when you need this character in a string.
However, you may have the need to include a backslash right before a character that makes up an escape sequence:
Because the sequence \" matches a known escape sequence, your string fails with a SyntaxError . To avoid this issue, you can double the backslash:
In this update, you double the backslash to escape the character and prevent Python from raising an error.
Note: When you use the built-in print() function to print a string that includes an escaped backslash, then you won’t see the double backslash in the output:
In this example, the output only displays one backslash, producing the desired effect.
Up to this point, you’ve learned how to suppress the meaning of a given character by escaping it. Suppose you need to create a string containing a tab character. Some text editors may allow you to insert a tab character directly into your code. However, this is considered a poor practice for several reasons:
- Computers can distinguish between tabs and a sequence of spaces, but human beings can’t because these characters are visually indistinguishable.
- Some text editors automatically eliminate tabs by expanding them to an appropriate number of spaces.
- Some Python REPL environments will not insert tabs into code.
In Python, you can specify a tab character by the \t escape sequence:
The \t escape sequence changes the usual meaning of the letter t . Instead, Python interprets the combination as a tab character.
Here is a list of escape sequences that cause Python to apply special meaning to some characters instead of interpreting them literally:
Escape Sequence | Escaped Interpretation |
| ASCII Bell ( ) character |
| ASCII Backspace ( ) character |
| ASCII Formfeed ( ) character |
| ASCII Linefeed ( ) character |
| Character from Unicode database with given |
| ASCII Carriage return ( ) character |
| ASCII Horizontal tab ( ) character |
| Unicode character with 16-bit hex value |
| Unicode character with 32-bit hex value |
| ASCII Vertical tab ( ) character |
| Character with octal value |
| Character with hex value |
Of these escape sequences, the newline or linefeed character ( \n ) is probably the most popular. This sequence is commonly used to create nicely formatted text outputs.
Here are a few examples of the escape sequences in action:
These escape sequences are typically useful when you need to insert characters that aren’t readily generated from the keyboard or aren’t easily readable or printable.
A raw string is a string that doesn’t translate the escape sequences. Any backslash characters are left in the string.
Note: To learn more about raw strings, check out the What Are Python Raw Strings? tutorial.
To create a raw string, you can precede the literal with an r or R :
The raw string suppresses the meaning of the escape sequence and presents the characters as they are. This behavior comes in handy when you’re creating regular expressions because it allows you to use several different characters that may have special meanings without restrictions.
Python has another type of string literal called formatted strings or f-strings for short. F-strings allow you to interpolate values into your strings and format them as you need.
Note: To dive deeper into f-strings, check out the Python’s F-String for String Interpolation and Formatting tutorial
To build f-string literals, you must prepend an f or F letter to the string literal. Because the idea behind f-strings is to interpolate values and format them into the final string, you need to use something called a replacement field in your string literal. You create these fields using curly brackets.
Here’s a quick example of an f-string literal:
In this example, you interpolate the variable name into your string using an f-string literal and a replacement field.
You can also use f-strings to format the interpolated values. To do that, you can use format specifiers that use the syntax defined in Python’s string format mini-language . For example, here’s how you can present numeric values using a currency format:
Inside the replacement field, you have the variable you want to interpolate and the format specifier, which is the string that starts with a colon ( : ). In this example, the format specifier defines a floating-point number with two decimal places.
Python’s str data type is probably the built-in type with the most available methods . In fact, you’ll find methods for most string processing operations. Here’s a summary of the methods that perform some string processing and return a transformed string object:
Method | Description |
| Converts the first character to uppercase and the rest to lowercase |
| Converts the string into lowercase |
| Centers the string between using |
| Encodes the string using the specified |
| Replaces tab characters with spaces according to |
| Interpolates and formats the specified values |
| Interpolates and formats the specified values using a dictionary |
| Joins the items in an iterable with the string as a separator |
| Returns a left-justified version of the string |
| Returns a right-justified version of the string |
| Converts the string into lowercase |
| Trims the string by removing from the beginning and end |
| Trims the string by removing from the beginning |
| Trims the string by removing from the end |
| Removes from the beginning of the string |
| Removes from the end of the string |
| Returns a string where the substring is replaced with |
| Converts lowercase letters to uppercase letters and vice versa |
| Converts the first character of each word to uppercase and the rest to lowercase |
| Converts a string into uppercase |
| Fills the string with a specified number of zeroes at the beginning |
All the above methods allow you to perform a specific transformation on an existing string. In all cases, you get a new string as a result:
As you can see, the methods in these examples perform a specific transformation on the original string and return a new string object.
You’ll also find that the str class has several Boolean-valued methods or predicate methods:
Method | Result |
| if the string ends with the specified suffix, otherwise |
| if the string starts with the specified prefix, otherwise |
| if all characters in the string are alphanumeric, otherwise |
| if all characters in the string are letters, otherwise |
| if the string is empty or all characters in the string are ASCII, otherwise |
| if all characters in the string are decimals, otherwise |
| if all characters in the string are digits, otherwise |
| if the string is a valid Python name, otherwise |
| if all characters in the string are lowercase, otherwise |
| if all characters in the string are numeric, otherwise |
| if all characters in the string are printable, otherwise |
| if all characters in the string are whitespaces, otherwise |
| if the string follows title case, otherwise |
| if all characters in the string are uppercase, otherwise |
All these methods allow you to check for various conditions in your strings. Here are a few demonstrative examples:
In these examples, the methods check for specific conditions in the target string and return a Boolean value as a result.
Finally, you’ll find a few other methods that allow you to run several other operations on your strings:
Method | Description |
| Returns the number of occurrences of a substring |
| Searches the string for a specified value and returns the position of where it was found |
| Searches the string for a specified value and returns the last position of where it was found |
| Searches the string for a specified value and returns the position of where it was found |
| Searches the string for a specified value and returns the last position of where it was found |
| Splits the string at the specified separator and returns a list |
| Splits the string at line breaks and returns a list |
| Splits the string at the first occurance of |
| Splits the string at the last occurance of |
| Splits the string at the specified separator and returns a list |
| Returns a translation table to be used in translations |
| Returns a translated string |
The first method counts the number of repetitions of a substring in an existing string. Then, you have four methods that help you find substrings in a string.
The .split() method is especially useful when you need to split a string into a list of individual strings using a given character as a separator, which defaults to whitespaces. You can also use .partition() or .rpartition() if you need to divide the string in exactly two parts:
In these toy examples, you’ve used the .split() method to build a list of words from a sentence. Note that by default, the method uses whitespace characters as separators. You also used .partition() and .rpartition() to separate out the first and last number from a string with numbers.
The .maketrans() and .translate() are nice tools for playing with strings . For example, say that you want to implement the Cesar cipher algorithm . This algorithm allows for basic text encryption by shifting the alphabet by a number of letters. For example, if you shift the letter a by three, then you get the letter d , and so on.
The following code implements cipher() , a function that takes a character and rotates it by three:
In this example, you use .maketrans() to create a translation table that matches the lowercase alphabet to a shifted alphabet. Then, you apply the translation table to a string using the .translate() method.
Python’s strings are sequences of characters. As other built-in sequences like lists and tuples , strings support a set of operations that are known as common sequence operations . The table below is a summary of all the operations that are common to most sequence types in Python:
Operation | Example | Result |
| | The length of |
| | The item at index |
| | A slice of from index to |
| | A slice of from index to with step |
| | The smallest item of |
| | The largest item of |
| | if an item of is equal to , else |
| | if an item of is equal to , else |
| | The concatenation of and |
| or | The repetition of a number of times specified by |
| | The index of the first occurrence of in |
| | The total number of occurrences of in |
Sometimes, you need to determine the number of characters in a string. In this situation, you can use the built-in len() function:
When you call len() with a string as an argument, you get the number of characters in the string at hand.
Another common operation you’d run on strings is retrieving a single character or a substring from an existing string. In these situations, you can use indexing and slicing, respectively:
To retrieve a character from an existing string, you use the indexing operator [index] with the index of the target character. Note that indices are zero-based, so the first character lives at index 0 .
To retrieve a slice or substring from an existing string, you use the slicing operator with the appropriate indices. In the example above, you don’t provide the start index i , so Python assumes that you want to start from the beginning of the string. Then, you give the end index j to tell Python where to stop the slicing.
You can take a leap and try the rest of the operations by yourself. It will be a great learning exercise!
When it comes to creating and working with strings, you have two functions that can help you out and make your life easier:
The built-in str() function allows you to create new strings and also convert other data types into strings:
In these examples, you use the str() function to convert objects from different built-in types into strings. In the first example, you use the function to create an empty string. In the other examples, you get strings consisting of the object’s literals between quotes, which provide user-friendly representations of the objects.
At first glance, these results may not seem useful. However, there are use cases where you need to use str() .
For example, say that you have a list of numeric values and want to join them using the str.join() method. This method only accepts iterables of strings, so you need to convert the numbers:
If you try to pass a list of numeric values to .join() , then you get a TypeError exception because the function only joins strings. To work around this issue, you use a generator expression to convert each number to its string representation.
Behind the str() function, you’ll have the .__str__() special method . In other words, when you call str() , Python automatically calls the .__str__() special method on the underlying object. You can use this special method to support str() in your own classes.
Consider the following Person class:
In this class, you have two instance attributes , .name and .age . Then, you have .__str__() special methods to provide user-friendly string representations for your class.
Here’s how this class works:
In this code snippet, you create an instance of Person . Then, you call str() using the object as an argument. As a result, you get a descriptive message back, which is the user-friendly string representation of your class.
Note: To learn more about objects’ string representations in Python, check out the When Should You Use .__repr__() vs .__str__() in Python? tutorial.
Similarly, when you pass an object to the built-in repr() function, you get a developer-friendly string representation of the object itself:
In the case of built-in types, the string representation you get with repr() is the same as the one you get with the str() function. This is because the representations are the literals of each object, and you can directly use them to re-create the object at hand.
Ideally, you should be able to re-create the current object using this representation. To illustrate, go ahead and update the Person class:
The .__repr__() special method allows you to provide a developer-friendly string representation for your class:
You should be able to copy and paste the resulting representation to re-create the object. That’s why this string representation is said to be developer-friendly.
Bytes and Bytearrays
Bytes are immutable sequences of single bytes. In Python, the bytes class allows you to build sequences of bytes. This data type is commonly used for manipulating binary data, encoding and decoding text, processing file input and output, and communicating through networks.
Python also has a bytearray class as a mutable counterpart to bytes objects:
In the following sections, you’ll learn the basics of how to create and work with bytes and bytearray objects in Python.
To create a bytes literal, you’ll use a syntax that’s largely the same as that for string literals. The difference is that you need to prepend a b to the string literal. As with string literals, you can use different types of quotes to define bytes literals:
There is yet another difference between string literals and bytes literals. To define bytes literals, you can only use ASCII characters. If you need to insert binary values over the 127 characters, then you have to use the appropriate escape sequence:
In this example, \xc3\xb1 is the escape sequence for the letter ñ in the Spanish word "España" . Note that if you try to use the ñ directly, you get a SyntaxError .
The built-in bytes() function provides another way to create bytes objects. With no arguments, the function returns an empty bytes object:
You can use the bytes() function to convert string literals to bytes objects:
In these examples, you first use bytes() to convert a string into a bytes object. Note that for this to work, you need to provide the appropriate character encoding . In this example, you use the UTF-8 encoding. If you try to convert a string literal without providing the encoding, then you get a TypeError exception.
You can also use bytes() with an iterable of integers where each number is the Unicode code point of the individual characters:
In this example, each number in the list you use as an argument to bytes() is the code point for a specific letter. For example, 65 is the code point for A , 66 for B , and so on. You can get the Unicode code point of any character using the built-in ord() function.
Python doesn’t have dedicated literal syntax for bytearray objects. To create them, you’ll always use the class constructor bytearray() , which is also known as a built-in function in Python. Here are a few examples of how to create bytearray objects using this function:
In the first example, you call bytearray() without an argument to create an empty bytearray object. In the second example, you call the function with an integer as an argument. In this case, you create a bytearray with five zero-filled items.
Next, you use a list of code points to create a bytearray . This call works the same as with bytes objects. Finally, you use a bytes literal to build up the bytearray object.
In Python, bytes and bytearray objects are quite similar to strings. Instead of being sequences of characters, bytes and bytearray objects are sequences of integer numbers, with values from 0 to 255 .
Because of their similarities with strings, the bytes and bytearray types support mostly the same methods as strings , so you won’t repeat them in this section. If you need detailed explanations of specific methods, then check out the Bytes and Bytearray Operations section in Python’s documentation.
Finally, both bytes and bytearray objects support the common sequence operations that you learned in the Common Sequence Operations on Strings section.
Boolean logic relies on the truth value of expressions and objects. The truth value of an expression or object can take one of two possible values: true or false . In Python, these two values are represented by True and False , respectively:
Both True and False are instances of the bool data type, which is built into Python. In the following sections, you’ll learn the basics about Python’s bool data type.
Python provides a built-in Boolean data type. Objects of this type may have one of two possible values: True or False . These values are defined as built-in constants with values of 1 and 0 , respectively. In practice, the bool type is a subclass of int . Therefore, True and False are also instances of int :
In Python, the bool type is a subclass of the int type. It has only two possible values, 0 and 1 , which map to the constants False and True .
These constant values are also the literals of the bool type:
Boolean objects that are equal to True are truthy, and those equal to False are falsy. In Python, non-Boolean objects also have a truth value. In other words, Python objects are either truthy or falsy.
You can use the built-in bool() function to convert any Python object to a Boolean value. Internally, Python uses the following rules to identify falsy objects:
- Constants that are defined to be false: None and False
- The zero of any numeric type: 0 , 0.0 , 0j , Decimal(0) , Fraction(0, 1)
- Empty sequences and collections: '' , () , [] , {} , set() , range(0)
The rest of the objects are considered truthy in Python. You can use the built-in bool() function to explicitly learn the truth value of any Python object:
In these examples, you use bool() with arguments of different types. In each case, the function returns a Boolean value corresponding to the object’s truth value.
Note: You rarely need to call bool() yourself. Instead, you can rely on Python calling bool() under the hood when necessary. For example, you can say if numbers: instead of if bool(numbers): to check whether numbers is truthy.
You can also use the bool() function with custom classes:
By default, all instances of custom classes are true. If you want to modify this behavior, you can use the .__bool__() special method. Consider the following update of your Point class:
The .__bool__() method returns False when both coordinates are equal to 0 and True otherwise. Here’s how your class works now:
Now, when both coordinates are 0 , you get False from calling bool() . For the rest of the points, you get True .
You’ve learned about the basic built-in data types that Python provides. These types are the building blocks of most Python programs. With them, you can represent numeric, textual, byte, and Boolean data.
In this tutorial, you’ve learned about:
- Python’s numeric types, such as int , float , and complex
- The str data type, which represents textual data in Python
- Boolean values with Python’s bool data type
With this knowledge, you’re ready to start using all of the basic data types that are built into Python.
🐍 Python Tricks 💌
Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.
About Leodanis Pozo Ramos
Leodanis is an industrial engineer who loves Python and software development. He's a self-taught Python developer with 6+ years of experience. He's an avid technical writer with a growing number of articles published on Real Python and other sites.
Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:
Master Real-World Python Skills With Unlimited Access to Real Python
Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:
Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:
What Do You Think?
What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.
Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal . Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session . Happy Pythoning!
Keep Learning
Related Topics: basics python
Recommended Video Course: Basic Data Types in Python
Keep reading Real Python by creating a free account or signing in:
Already have an account? Sign-In
Almost there! Complete this form and click the button below to gain instant access:
Basic Data Types in Python: A Quick Exploration (Sample Code)
🔒 No spam. We take your privacy seriously.
Most of us write numbers in Arabic form, ie, 1, 2, 3,..., 9. Some people write them differently, such as I, II, III, IV,..., IX. Nomatter what type of representation, most human beings can understand, at least the two types I mentioned. Unfortunately the computer doesn't. Computer is the most stupid thing you can ever encounter in your life.
Modern computers are built up with transistors. Whenever an electric current pass into the transistors either an or status will be established. Therefore the computer can only reconize two numbers, for OFF, and for ON, which can be referred to as . There is nothing in between Bit 0 and Bit 1 (eg Bit 0.5 doesn't exist). Hence computers can be said to be discrete machines. The number system consists only of two numbers is called . And to distinguish the different numbering systems, the numbers human use, ie 1,2,3,4..., will be called (since they are based 10 numbers) from now on.
How, therefore, can computer understand numbers larger than 1? The answer is simple, 2 is simply 1+1, (like 10 = 9+1 for human) the numbers are added and overflow digit is carred over to the left position. So (decimal) 2 is representated in Binary as 10. To further illustrate the relationship, I have listed the numbers 1 to 9 in both systems for compaison:
| |
0 | 0000 0000 |
1 | 0000 0001 |
2 | 0000 0010 |
3 | 0000 0011 |
4 | 0000 0100 |
5 | 0000 0101 |
6 | 0000 0110 |
7 | 0000 0111 |
8 | 0000 1000 |
9 | 0000 1001 |
You may ask why do I always put 8 binary digits there. Well, the smallest unit in the computer's memory to store data is called a BYTE , which consists of 8 BITS. One Byte allows upto 256 different combinations of data representation (2 8 = 256). What happens when we have numbers greater than 256? The computer simply uses more Bytes to hold the value, 2 Bytes can hold values upto 65536 (2 16 ) and so forth.
Not only does the computer not understand the (decimal) numbers you use, it doesn't even understand letters like "ABCDEFG...". The fact is, it doesn't care. Whatever letters you input into the computer, the computer just saves it there and delivers to you when you instruct it so. It saves these letters in the same Binary format as digits, in accordance to a pattern. In PC (including DOS, Windows 95/98/NT, and UNIX), the pattern is called ASCII (pronounced ask-ee ) which stands for A merican S tandard C ode for I nformation I nterchange .
In this format, the letter "A" is represented by "0100 0001" ,or most often, referred to decimal 65 in the ASCII Table. The standard coding under ASCII is here . When performing comparison of characters, the computer actually looks up the associated ASCII codes and compare the ASCII values instead of the characters. Therefore the letter "B" which has ASCII value of 66 is greater than the letter "A" with ASCII value of 65.
The computer stores data in different formats or types . The number 10 can be stored as numeric value as in "10 dollars" or as character as in the address "10 Main Street" . So how can the computer tell? Once again the computer doesn't care, it is your responsibility to ensure that you get the correct data out of it. (For illustration character 10 and numeric 10 are represented by 0011-0001-0011-0000 and 0000-1010 respectively you can see how different they are.) Different programming launguages have different data types , although the foundamental ones are usually very similar.
C++ has many data types. The followings are some basic data types you will be facing in these chapters. Note that there are more complicated data types. You can even create your own data types. Some of these will be discussed later in the tutorial.
| | | |
char | 1 | ASCII -128 to127 | |
unsigned char | 1 | ASCII 0 to 255 | including high ASCII chars |
int | 2 | -32768 to 32767 | Integer |
unsigned (unsigned int) | 2 | 0 to 65535 | non-negative integer |
long int | 4 | � 2 billions | double sized integer |
unsigned long int | 4 | 0 to 4 billion | non-negative long integer |
float | 4 | 3.4 �e38 | 6 significant digits |
double | 8 | 1.7 �e308 | 15 significant digits |
char is basically used to store alphanumerics (numbers are stored in character form). Recall that character is stored as ASCII representation in PC. ASCII -128 to -1 do not exist, so char accomodates data from ASCII 0 (null zero) to ASCII 127 (DEL key). The original C++ does not have a String data type (but string is available through the inclusion of a library to be discussed later). String can be stored as an one-dimensional array (list) with a "null zero" (ASCII 0) store in the last "cell" in the array. Unsigned char effectively accomodates the use of Extended ASCII characters which represent most special characters like the copyright sign �, registered trademark sign � etc plus some European letters like �, �, etc. Both char and unsigned char are stored internally as integers so they can effectively be compared (to be greater or less than).
Whenever you write a char (letter) in your program you must include it in single quotes. When you write strings (words or sentences) you must include them in double quotes. Otherwise C++ will treat these letters/words/sentences as tokens (to be discussed in Chapter 4). Remember in C/C++, A, 'A', "A" are all different. The first A (without quotes) means a variable or constant (discussed in Chapter 4), the second 'A' (in single quotes) means a character A which occupies one byte of memory. The third "A" (in double quotes) means a string containing the letter A followed by a null character which occupies 2 bytes of memory (will use more memory if store in a variable/constant of bigger size). See these examples: letter = 'A'; cout << 'A'; cout << "10 Main Street";
int (integer) represents all non-frational real numbers. Since int has a relatively small range (upto 32767), whenever you need to store value that has the possibility of going beyond this limit, long int should be used instead. The beauty of using int is that since it has no frational parts, its value is absolute and calculations of int are extremely accurate. However note that dividing an int by another may result in truncation, eg int 10 / int 3 will result in 3, not 3.3333 (more on this will be discussed later).
float , on the other hand, contains fractions. However real fractional numbers are not possible in computers since they are discrete machines (they can only handle the numbers 0 and 1, not 1.5 nor 1.75 or anything in between 0 and 1). No matter how many digits your calculator can show, you cannot produce a result of 2/3 without rounding, truncating, or by approximation. Mathameticians always write 2/3 instead of 0.66666.......... when they need the EXACT values. Since computer cannot produce real fractions the issue of significant digits comes to sight. For most applications a certain significant numbers are all you need. For example when you talk about money, $99.99 has no difference to $99.988888888888 (rounded to nearest cent); when you talk about the wealth of Bill Gates, it make little sense of saying $56,123,456,789.95 instead of just saying approximately $56 billions (these figures are not real, I have no idea how much money Bill has, although I wish he would give me the roundings). As you may see from the above table, float has only 6 significant digits, so for some applications it may not be sufficient, espically in scentific calculations, in which case you may want to use double or even long double to handle the numbers. There is also another problem in using float/double . Since numbers are represented internally as binary values, whenever a frational number is calculated or translated to/from binary there will be a rounding/truncaion error. So if you have a float 0, add 0.01 to it for 100 times, then minus 1.00 from it ( see the codes here or get the executable codes here ), you will not get 0 as it should be, rather you will get a value close to zero, but not really zero. Using double or long double will reduce the error but will not eliminate it. However as I mentioned earlier, the relevance may not affect our real life, just mean you may need to exercise caution when programming with floating point numbers.
There is another C++ data type I haven't included here bool (boolean) data type which can only store a value of either 0 (false) or 1 (true). I will be using int (integer) to handle logical comparisons which poses more challenge and variety of use.
Escape Sequences are not data types but I feel I would better discuss them here. I mentioned earlier that you have to include a null zero at the end of a "string" in using an array of char to represent string. The easiest way to do this is to write the escape sequence '\0' which is understood by C++ as null zero. The followings are Escape Sequences in C++:
| | | | | |
\a | Alarm | \t | Tab | \" | Double Quote |
\b | Backspace | \v | Vertical Tab | \000 | Octal Num |
\f | Form Feed | \\ | Backslash | \xhh | Hex number |
\n | New Line | \? | Question Mark | \0 | Null Zero |
\r | Carriage Return | \' | Single Quote | | |
Earlier I said you can create your own data types. Here I will show you how. In fact you not only can create new data types but you can also create an alias of existing data type. For example you are writing a program which deals with dollar values. Since dollar values have fractional parts you have to either use float or double data types (eg assign float data type to salary by writing float salary . You can create an alias of the same data type MONEY and write MONEY salary. You do this by adding the following type definition into your program:
typedef double MONEY;
You can also create new data types. I will discuss more on this when we come to Arrays in Chapter 10. But the following illustrates how you create a new data type of array from a base data type:
Data types #
Data type objects
Array types and conversions between types #
NumPy supports a much greater variety of numerical types than Python does. This section shows which are available, and how to modify an array’s data-type.
NumPy numerical types are instances of numpy.dtype (data-type) objects, each having unique characteristics. Once you have imported NumPy using import numpy as np you can create arrays with a specified dtype using the scalar types in the numpy top-level API, e.g. numpy.bool , numpy.float32 , etc.
These scalar types as arguments to the dtype keyword that many numpy functions or methods accept. For example:
Array types can also be referred to by character codes, for example:
See Specifying and constructing data types for more information about specifying and constructing data type objects, including how to specify parameters like the byte order.
To convert the type of an array, use the .astype() method. For example:
Note that, above, we could have used the Python float object as a dtype instead of numpy.float64 . NumPy knows that int refers to numpy.int_ , bool means numpy.bool , that float is numpy.float64 and complex is numpy.complex128 . The other data-types do not have Python equivalents.
To determine the type of an array, look at the dtype attribute:
dtype objects also contain information about the type, such as its bit-width and its byte-order. The data type can also be used indirectly to query properties of the type, such as whether it is an integer:
Numerical Data Types #
There are 5 basic numerical types representing booleans ( bool ), integers ( int ), unsigned integers ( uint ) floating point ( float ) and complex . A basic numerical type name combined with a numeric bitsize defines a concrete type. The bitsize is the number of bits that are needed to represent a single value in memory. For example, numpy.float64 is a 64 bit floating point data type. Some types, such as numpy.int_ and numpy.intp , have differing bitsizes, dependent on the platforms (e.g. 32-bit vs. 64-bit CPU architectures). This should be taken into account when interfacing with low-level code (such as C or Fortran) where the raw memory is addressed.
Data Types for Strings and Bytes #
In addition to numerical types, NumPy also supports storing unicode strings, via the numpy.str_ dtype ( U character code), null-terminated byte sequences via numpy.bytes_ ( S character code), and arbitrary byte sequences, via numpy.void ( V character code).
All of the above are fixed-width data types. They are parameterized by a width, in either bytes or unicode points, that a single data element in the array must fit inside. This means that storing an array of byte sequences or strings using this dtype requires knowing or calculating the sizes of the longest text or byte sequence in advance.
As an example, we can create an array storing the words "hello" and "world!" :
Here the data type is detected as a unicode string that is a maximum of 6 code points long, enough to store both entries without truncation. If we specify a shorter or longer data type, the string is either truncated or zero-padded to fit in the specified width:
We can see the zero-padding a little more clearly if we use the bytes data type and ask NumPy to print out the bytes in the array buffer:
Each entry is padded with two extra null bytes. Note however that NumPy cannot tell the difference between intentionally stored trailing nulls and padding nulls:
If you need to store and round-trip any trailing null bytes, you will need to use an unstructured void data type:
Advanced types, not listed above, are explored in section Structured arrays .
Relationship Between NumPy Data Types and C Data Types #
NumPy provides both bit sized type names and names based on the names of C types. Since the definition of C types are platform dependent, this means the explicitly bit sized should be preferred to avoid platform-dependent behavior in programs using NumPy.
To ease integration with C code, where it is more natural to refer to platform-dependent C types, NumPy also provides type aliases that correspond to the C types for the platform. Some dtypes have trailing underscore to avoid confusion with builtin python type names, such as numpy.bool_ .
Canonical Python API name | Python API “C-like” name | Actual C type | Description |
or | N/A | (defined in ) | Boolean (True or False) stored as a byte. |
| | char | Platform-defined integer type with 8 bits. |
| | char | Platform-defined integer type with 8 bits without sign. |
| | | Platform-defined integer type with 16 bits. |
| | short | Platform-defined integer type with 16 bits without sign. |
| | | Platform-defined integer type with 32 bits. |
| | int | Platform-defined integer type with 32 bits without sign. |
| N/A | / | Platform-defined integer of size ; used e.g. for sizes. |
| N/A | | Platform-defined integer type capable of storing the maximum allocation size. |
N/A | | | Guaranteed to hold pointers. Character code only (Python and C). |
N/A | | | Guaranteed to hold pointers. Character code only (Python and C). |
or | | | Platform-defined integer type with at least 32 bits. |
or | | long | Platform-defined integer type with at least 32 bits without sign. |
N/A | | long | Platform-defined integer type with at least 64 bits. |
N/A | | long long | Platform-defined integer type with at least 64 bits without sign. |
| | N/A | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa. |
| | | Platform-defined single precision float: typically sign bit, 8 bits exponent, 23 bits mantissa. |
| | | Platform-defined double precision float: typically sign bit, 11 bits exponent, 52 bits mantissa. |
or | | double | Platform-defined extended-precision float. |
| | complex | Complex number, represented by two single-precision floats (real and imaginary components). |
| | complex | Complex number, represented by two double-precision floats (real and imaginary components). |
or | | double complex | Complex number, represented by two extended-precision floats (real and imaginary components). |
Since many of these have platform-dependent definitions, a set of fixed-size aliases are provided (See Sized aliases ).
Array scalars #
NumPy generally returns elements of arrays as array scalars (a scalar with an associated dtype). Array scalars differ from Python scalars, but for the most part they can be used interchangeably (the primary exception is for versions of Python older than v2.x, where integer array scalars cannot act as indices for lists and tuples). There are some exceptions, such as when code requires very specific attributes of a scalar or when it checks specifically whether a value is a Python scalar. Generally, problems are easily fixed by explicitly converting array scalars to Python scalars, using the corresponding Python type function (e.g., int , float , complex , str ).
The primary advantage of using array scalars is that they preserve the array type (Python may not have a matching scalar type available, e.g. int16 ). Therefore, the use of array scalars ensures identical behaviour between arrays and scalars, irrespective of whether the value is inside an array or not. NumPy scalars also have many of the same methods arrays do.
Overflow errors #
The fixed size of NumPy numeric types may cause overflow errors when a value requires more memory than available in the data type. For example, numpy.power evaluates 100 ** 9 correctly for 64-bit integers, but gives -1486618624 (incorrect) for a 32-bit integer.
The behaviour of NumPy and Python integer types differs significantly for integer overflows and may confuse users expecting NumPy integers to behave similar to Python’s int . Unlike NumPy, the size of Python’s int is flexible. This means Python integers may expand to accommodate any integer and will not overflow.
NumPy provides numpy.iinfo and numpy.finfo to verify the minimum or maximum values of NumPy integer and floating point values respectively
If 64-bit integers are still too small the result may be cast to a floating point number. Floating point numbers offer a larger, but inexact, range of possible values.
Extended precision #
Python’s floating-point numbers are usually 64-bit floating-point numbers, nearly equivalent to numpy.float64 . In some unusual situations it may be useful to use floating-point numbers with more precision. Whether this is possible in numpy depends on the hardware and on the development environment: specifically, x86 machines provide hardware floating-point with 80-bit precision, and while most C compilers provide this as their long double type, MSVC (standard for Windows builds) makes long double identical to double (64 bits). NumPy makes the compiler’s long double available as numpy.longdouble (and np.clongdouble for the complex numbers). You can find out what your numpy provides with np.finfo(np.longdouble) .
NumPy does not provide a dtype with more precision than C’s long double ; in particular, the 128-bit IEEE quad precision data type (FORTRAN’s REAL*16 ) is not available.
For efficient memory alignment, numpy.longdouble is usually stored padded with zero bits, either to 96 or 128 bits. Which is more efficient depends on hardware and development environment; typically on 32-bit systems they are padded to 96 bits, while on 64-bit systems they are typically padded to 128 bits. np.longdouble is padded to the system default; np.float96 and np.float128 are provided for users who want specific padding. In spite of the names, np.float96 and np.float128 provide only as much precision as np.longdouble , that is, 80 bits on most x86 machines and 64 bits in standard Windows builds.
Be warned that even if numpy.longdouble offers more precision than python float , it is easy to lose that extra precision, since python often forces values to pass through float . For example, the % formatting operator requires its arguments to be converted to standard python types, and it is therefore impossible to preserve extended precision even if many decimal places are requested. It can be useful to test your code with the value 1 + np.finfo(np.longdouble).eps .
Data Representation in Computer: Number Systems, Characters, Audio, Image and Video
What is Data Representation in Computer?
A computer uses a fixed number of bits to represent a piece of data which could be a number, a character, image, sound, video, etc. Data representation is the method used internally to represent data in a computer. Let us see how various types of data can be represented in computer memory.
Number Systems
Number systems are the technique to represent numbers in the computer system architecture, every value that you are saving or getting into/from computer memory has a defined number system.
The number 289 is pronounced as two hundred and eighty-nine and it consists of the symbols 2, 8, and 9. Similarly, there are other number systems. Each has its own symbols and method for constructing a number.
A number system has a unique base, which depends upon the number of symbols. The number of symbols used in a number system is called the base or radix of a number system.
Let us discuss some of the number systems. Computer architecture supports the following number of systems:
Binary Number System
Octal number system.
The octal number system has only eight (8) digits from 0 to 7. Every number (value) represents with 0,1,2,3,4,5,6 and 7 in this number system. The base of the octal number system is 8, because it has only 8 digits.
Decimal Number System
The decimal number system has only ten (10) digits from 0 to 9. Every number (value) represents with 0,1,2,3,4,5,6, 7,8 and 9 in this number system. The base of decimal number system is 10, because it has only 10 digits.
Hexadecimal Number System
Data representation of characters.
There are different methods to represent characters . Some of them are discussed below:
The code called ASCII (pronounced ‘’.S-key”), which stands for American Standard Code for Information Interchange, uses 7 bits to represent each character in computer memory. The ASCII representation has been adopted as a standard by the U.S. government and is widely accepted.
Since there are exactly 128 unique combinations of 7 bits, this 7-bit code can represent only128 characters. Another version is ASCII-8, also called extended ASCII, which uses 8 bits for each character, can represent 256 different characters.
If ASCII-coded data is to be used in a computer that uses EBCDIC representation, it is necessary to transform ASCII code to EBCDIC code. Similarly, if EBCDIC coded data is to be used in an ASCII computer, EBCDIC code has to be transformed to ASCII.
Using 8-bit ASCII we can represent only 256 characters. This cannot represent all characters of written languages of the world and other symbols. Unicode is developed to resolve this problem. It aims to provide a standard character encoding scheme, which is universal and efficient.
Data Representation of Audio, Image and Video
In most cases, we may have to represent and process data other than numbers and characters. This may include audio data, images, and videos. We can see that like numbers and characters, the audio, image, and video data also carry information.
Multimedia data such as audio, image, and video are stored in different types of files. The variety of file formats is due to the fact that there are quite a few approaches to compressing the data and a number of different ways of packaging the data.
For example, an image is most popularly stored in Joint Picture Experts Group (JPEG ) file format. An image file consists of two parts – header information and image data. Information such as the name of the file, size, modified data, file format, etc. is stored in the header part.
Numerous such techniques are used to achieve compression. Depending on the application, images are stored in various file formats such as bitmap file format (BMP), Tagged Image File Format (TIFF), Graphics Interchange Format (GIF), Portable (Public) Network Graphic (PNG).
For example, WAV file format typically contains uncompressed sound and MP3 files typically contain compressed audio data. The synthesized music data is stored in MIDI(Musical Instrument Digital Interface) files.
Similarly, video is also stored in different files such as AVI (Audio Video Interleave) – a file format designed to store both audio and video data in a standard package that allows synchronous audio with video playback, MP3, JPEG-2, WMV, etc.
FAQs About Data Representation in Computer
What is number system with example, leave a reply cancel reply.
Download the Learning Outcomes App Today
Share this article
Table of Contents
Latest updates.
1 Million Means: 1 Million in Rupees, Lakhs and Crores
Ways To Improve Learning Outcomes: Learn Tips & Tricks
The Three States of Matter: Solids, Liquids, and Gases
Types of Motion: Introduction, Parameters, Examples
Understanding Frequency Polygon: Detailed Explanation
Uses of Silica Gel in Packaging?
Visual Learning Style for Students: Pros and Cons
Air Pollution: Know the Causes, Effects & More
Sexual Reproduction in Flowering Plants
Integers Introduction: Check Detailed Explanation
Tag cloud :.
- entrance exams
- engineering
- ssc cgl 2024
- Written By Priya_Singh
- Last Modified 24-01-2023
Data Representation: Definition, Types, Examples
Data Representation: Data representation is a technique for analysing numerical data. The relationship between facts, ideas, information, and concepts is depicted in a diagram via data representation. It is a fundamental learning strategy that is simple and easy to understand. It is always determined by the data type in a specific domain. Graphical representations are available in many different shapes and sizes.
In mathematics, a graph is a chart in which statistical data is represented by curves or lines drawn across the coordinate point indicated on its surface. It aids in the investigation of a relationship between two variables by allowing one to evaluate the change in one variable’s amount in relation to another over time. It is useful for analysing series and frequency distributions in a given context. On this page, we will go through two different types of graphs that can be used to graphically display data. Continue reading to learn more.
Data Representation in Maths
Definition: After collecting the data, the investigator has to condense them in tabular form to study their salient features. Such an arrangement is known as the presentation of data.
Any information gathered may be organised in a frequency distribution table, and then shown using pictographs or bar graphs. A bar graph is a representation of numbers made up of equally wide bars whose lengths are determined by the frequency and scale you choose.
The collected raw data can be placed in any one of the given ways:
- Serial order of alphabetical order
- Ascending order
- Descending order
Data Representation Example
Example: Let the marks obtained by \(30\) students of class VIII in a class test, out of \(50\)according to their roll numbers, be:
\(39,\,25,\,5,\,33,\,19,\,21,\,12,41,\,12,\,21,\,19,\,1,\,10,\,8,\,12\)
\(17,\,19,\,17,\,17,\,41,\,40,\,12,41,\,33,\,19,\,21,\,33,\,5,\,1,\,21\)
The data in the given form is known as raw data or ungrouped data. The above-given data can be placed in the serial order as shown below:
Now, for say you want to analyse the standard of achievement of the students. If you arrange them in ascending or descending order, it will give you a better picture.
Ascending order:
\(1,\,1,\,5,\,5,\,8,\,10,\,12,12,\,12,\,12,\,17,\,17,\,17,\,19,\,19\)
\(19,\,19,\,21,\,21,\,21,\,25,\,33,33,\,33,\,39,\,40,\,41,\,41,\,41\)
Descending order:
\(41,\,41,\,41,\,40,\,39,\,33,\,33,33,\,25,\,21,\,21,\,21,\,21,\,19,\,19\)
\(19,\,19,\,17,\,17,\,17,\,12,\,12,12,\,12,\,10,\,8,\,5,\,5,1,\,1\)
When the raw data is placed in ascending or descending order of the magnitude is known as an array or arrayed data.
Graph Representation in Data Structure
A few of the graphical representation of data is given below:
- Frequency distribution table
Pictorial Representation of Data: Bar Chart
The bar graph represents the qualitative data visually. The information is displayed horizontally or vertically and compares items like amounts, characteristics, times, and frequency.
The bars are arranged in order of frequency, so more critical categories are emphasised. By looking at all the bars, it is easy to tell which types in a set of data dominate the others. Bar graphs can be in many ways like single, stacked, or grouped.
Graphical Representation of Data: Frequency Distribution Table
A frequency table or frequency distribution is a method to present raw data in which one can easily understand the information contained in the raw data.
The frequency distribution table is constructed by using the tally marks. Tally marks are a form of a numerical system with the vertical lines used for counting. The cross line is placed over the four lines to get a total of \(5\).
Consider a jar containing the different colours of pieces of bread as shown below:
Construct a frequency distribution table for the data mentioned above.
Graphical Representation of Data: Histogram
The histogram is another kind of graph that uses bars in its display. The histogram is used for quantitative data, and ranges of values known as classes are listed at the bottom, and the types with greater frequencies have the taller bars.
A histogram and the bar graph look very similar; however, they are different because of the data level. Bar graphs measure the frequency of the categorical data. A categorical variable has two or more categories, such as gender or hair colour.
Graphical Representation of Data: Pie Chart
The pie chart is used to represent the numerical proportions of a dataset. This graph involves dividing a circle into different sectors, where each of the sectors represents the proportion of a particular element as a whole. Thus, it is also known as a circle chart or circle graph.
Graphical Representation of Data: Line Graph
A graph that uses points and lines to represent change over time is defined as a line graph. In other words, it is the chart that shows a line joining multiple points or a line that shows the link between the points.
The diagram illustrates the quantitative data between two changing variables with the straight line or the curve that joins a series of successive data points. Linear charts compare two variables on the vertical and the horizontal axis.
General Rules for Visual Representation of Data
We have a few rules to present the information in the graphical representation effectively, and they are given below:
- Suitable Title: Ensure that the appropriate title is given to the graph, indicating the presentation’s subject.
- Measurement Unit: Introduce the measurement unit in the graph.
- Proper Scale: To represent the data accurately, choose an appropriate scale.
- Index: In the Index, the appropriate colours, shades, lines, design in the graphs are given for better understanding.
- Data Sources: At the bottom of the graph, include the source of information wherever necessary.
- Keep it Simple: Build the graph in a way that everyone should understand easily.
- Neat: You have to choose the correct size, fonts, colours etc., in such a way that the graph must be a model for the presentation of the information.
Solved Examples on Data Representation
Q.1. Construct the frequency distribution table for the data on heights in \(({\rm{cm}})\) of \(20\) boys using the class intervals \(130 – 135,135 – 140\) and so on. The heights of the boys in \({\rm{cm}}\) are:
Ans: The frequency distribution for the above data can be constructed as follows:
Q.2. Write the steps of the construction of Bar graph? Ans: To construct the bar graph, follow the given steps: 1. Take a graph paper, draw two lines perpendicular to each other, and call them horizontal and vertical. 2. You have to mark the information given in the data like days, weeks, months, years, places, etc., at uniform gaps along the horizontal axis. 3. Then you have to choose the suitable scale to decide the heights of the rectangles or the bars and then mark the sizes on the vertical axis. 4. Draw the bars or rectangles of equal width and height marked in the previous step on the horizontal axis with equal spacing. The figure so obtained will be the bar graph representing the given numerical data.
Q.3. Read the bar graph and then answer the given questions: I. Write the information provided by the given bar graph. II. What is the order of change of the number of students over several years? III. In which year is the increase of the student maximum? IV. State whether true or false. The enrolment during \(1996 – 97\) is double that of \(1995 – 96\)
Ans: I. The bar graph represents the number of students in class \({\rm{VI}}\) of a school during the academic years \(1995 – 96\,to\,1999 – 2000\). II. The number of stcccccudents is changing in increasing order as the heights of bars are growing. III. The increase in the number of students in uniform and the increase in the height of bars is uniform. Hence, in this case, the growth is not maximum in any of the years. The enrolment in the years is \(1996 – 97\, = 200\). and the enrolment in the years is \(1995 – 96\, = 150\). IV. The enrolment in \(1995 – 97\,\) is not double the enrolment in \(1995 – 96\). So the statement is false.
Q.4. Write the frequency distribution for the given information of ages of \(25\) students of class VIII in a school. \(15,\,16,\,16,\,14,\,17,\,17,\,16,\,15,\,15,\,16,\,16,\,17,\,15\) \(16,\,16,\,14,\,16,\,15,\,14,\,15,\,16,\,16,\,15,\,14,\,15\) Ans: Frequency distribution of ages of \(25\) students:
Q.5. There are \(20\) students in a classroom. The teacher asked the students to talk about their favourite subjects. The results are listed below:
By looking at the above data, which is the most liked subject? Ans: Representing the above data in the frequency distribution table by using tally marks as follows:
From the above table, we can see that the maximum number of students \((7)\) likes mathematics.
Also, Check –
- Diagrammatic Representation of Data
In the given article, we have discussed the data representation with an example. Then we have talked about graphical representation like a bar graph, frequency table, pie chart, etc. later discussed the general rules for graphic representation. Finally, you can find solved examples along with a few FAQs. These will help you gain further clarity on this topic.
FAQs on Data Representation
Q.1: How is data represented? A: The collected data can be expressed in various ways like bar graphs, pictographs, frequency tables, line graphs, pie charts and many more. It depends on the purpose of the data, and accordingly, the type of graph can be chosen.
Q.2: What are the different types of data representation? A : The few types of data representation are given below: 1. Frequency distribution table 2. Bar graph 3. Histogram 4. Line graph 5. Pie chart
Q.3: What is data representation, and why is it essential? A: After collecting the data, the investigator has to condense them in tabular form to study their salient features. Such an arrangement is known as the presentation of data. Importance: The data visualization gives us a clear understanding of what the information means by displaying it visually through maps or graphs. The data is more natural to the mind to comprehend and make it easier to rectify the trends outliners or trends within the large data sets.
Q.4: What is the difference between data and representation? A: The term data defines the collection of specific quantitative facts in their nature like the height, number of children etc., whereas the information in the form of data after being processed, arranged and then presented in the state which gives meaning to the data is data representation.
Q.5: Why do we use data representation? A: The data visualization gives us a clear understanding of what the information means by displaying it visually through maps or graphs. The data is more natural to the mind to comprehend and make it easier to rectify the trends outliners or trends within the large data sets.
Related Articles
1 Million Means: 1 million in numerical is represented as 10,00,000. The Indian equivalent of a million is ten lakh rupees. It is not a...
Ways To Improve Learning Outcomes: With the development of technology, students may now rely on strategies to enhance learning outcomes. No matter how knowledgeable a...
The Three States of Matter: Anything with mass and occupied space is called ‘Matter’. Matters of different kinds surround us. There are some we can...
Motion is the change of a body's position or orientation over time. The motion of humans and animals illustrates how everything in the cosmos is...
Understanding Frequency Polygon: Students who are struggling with understanding Frequency Polygon can check out the details here. A graphical representation of data distribution helps understand...
When you receive your order of clothes or leather shoes or silver jewellery from any online shoppe, you must have noticed a small packet containing...
Visual Learning Style: We as humans possess the power to remember those which we have caught visually in our memory and that too for a...
Air Pollution: In the past, the air we inhaled was pure and clean. But as industrialisation grows and the number of harmful chemicals in the...
In biology, flowering plants are known by the name angiosperms. Male and female reproductive organs can be found in the same plant in flowering plants....
Integers Introduction: To score well in the exam, students must check out the Integers introduction and understand them thoroughly. The collection of negative numbers and whole...
Human Respiratory System – Detailed Explanation
Human Respiratory System: Students preparing for the NEET and Biology-related exams must have an idea about the human respiratory system. It is a network of tissues...
Place Value of Numbers: Detailed Explanation
Place Value of Numbers: Students must understand the concept of the place value of numbers to score high in the exam. In mathematics, place value...
The Leaf: Types, Structures, Parts
The Leaf: Students who want to understand everything about the leaf can check out the detailed explanation provided by Embibe experts. Plants have a crucial role...
Factors Affecting Respiration: Definition, Diagrams with Examples
In plants, respiration can be regarded as the reversal of the photosynthetic process. Like photosynthesis, respiration involves gas exchange with the environment. Unlike photosynthesis, respiration...
General Terms Related to Spherical Mirrors
General terms related to spherical mirrors: A mirror with the shape of a portion cut out of a spherical surface or substance is known as a...
Number System: Types, Conversion and Properties
Number System: Numbers are highly significant and play an essential role in Mathematics that will come up in further classes. In lower grades, we learned how...
Types of Respiration
Every living organism has to "breathe" to survive. The process by which the living organisms use their food to get energy is called respiration. It...
Animal Cell: Definition, Diagram, Types of Animal Cells
Animal Cell: An animal cell is a eukaryotic cell with membrane-bound cell organelles without a cell wall. We all know that the cell is the fundamental...
Conversion of Percentages: Conversion Method & Examples
Conversion of Percentages: To differentiate and explain the size of quantities, the terms fractions and percent are used interchangeably. Some may find it difficult to...
Arc of a Circle: Definition, Properties, and Examples
Arc of a circle: A circle is the set of all points in the plane that are a fixed distance called the radius from a fixed point...
Ammonia (NH3): Preparation, Structure, Properties and Uses
Ammonia, a colourless gas with a distinct odour, is a chemical building block and a significant component in producing many everyday items. It is found...
CGPA to Percentage: Calculator for Conversion, Formula, & More
CGPA to Percentage: The average grade point of a student is calculated using their cumulative grades across all subjects, omitting any supplemental coursework. Many colleges,...
Uses of Ether – Properties, Nomenclature, Uses, Disadvantages
Uses of Ether: Ether is an organic compound containing an oxygen atom and an ether group connected to two alkyl/aryl groups. It is formed by the...
General and Middle Terms: Definitions, Formula, Independent Term, Examples
General and Middle terms: The binomial theorem helps us find the power of a binomial without going through the tedious multiplication process. Further, the use...
Mutually Exclusive Events: Definition, Formulas, Solved Examples
Mutually Exclusive Events: In the theory of probability, two events are said to be mutually exclusive events if they cannot occur simultaneously or at the...
Geometry: Definition, Shapes, Structure, Examples
Geometry is a branch of mathematics that is largely concerned with the forms and sizes of objects, their relative positions, and the qualities of space....
Bohr’s Model of Hydrogen Atom: Expressions for Radius, Energy
Rutherford’s Atom Model was undoubtedly a breakthrough in atomic studies. However, it was not wholly correct. The great Danish physicist Niels Bohr (1885–1962) made immediate...
39 Insightful Publications
Embibe Is A Global Innovator
Innovator Of The Year Education Forever
Interpretable And Explainable AI
Revolutionizing Education Forever
Best AI Platform For Education
Enabling Teachers Everywhere
Decoding Performance
Leading AI Powered Learning Solution Provider
Auto Generation Of Tests
Disrupting Education In India
Problem Sequencing Using DKT
Help Students Ace India's Toughest Exams
Best Education AI Platform
Unlocking AI Through Saas
Fixing Student’s Behaviour With Data Analytics
Leveraging Intelligence To Deliver Results
Brave New World Of Applied AI
You Can Score Higher
Harnessing AI In Education
Personalized Ed-tech With AI
Exciting AI Platform, Personalizing Education
Disruptor Award For Maximum Business Impact
Top 20 AI Influencers In India
Proud Owner Of 9 Patents
Innovation in AR/VR/MR
Best Animated Frames Award 2024
Trending Searches
Previous year question papers, sample papers.
Unleash Your True Potential With Personalised Learning on EMBIBE
Ace Your Exam With Personalised Learning on EMBIBE
Enter mobile number.
By signing up, you agree to our Privacy Policy and Terms & Conditions
Interpretable SHAP Model Combining Meta-learning and Vision Transformer for Lithology Classification Using Limited and Unbalanced Drilling Data in Well Logging
- Original Paper
- Published: 19 August 2024
Cite this article
- Youzhuang Sun 1 , 2 ,
- Shanchen Pang 1 , 2 ,
- Zhiyuan Zhao 1 , 2 &
- Yongan Zhang 1 , 2
12 Accesses
Explore all metrics
Recent advances in geological exploration and oil and gas development have highlighted the critical need for accurate classification and prediction of subterranean lithologies. To address this, we introduce the Meta-Vision Transformer (Meta-ViT) method, a novel approach. This technique synergistically combines the adaptability of meta-learning with the analytical prowess of ViT. Meta-learning excels in identifying nuanced similarities across tasks, significantly enhancing learning efficiency. Simultaneously, the ViT leverages these meta-learning insights to navigate the complex landscape of geological exploration, improving lithology identification accuracy. The Meta-ViT model employs a support set to extract crucial insights through meta-learning, and a query set to test the generalizability of these insights. This dual-framework setup enables the ViT to detect various underground rock types with unprecedented precision. Additionally, by simulating diverse tasks and data scenarios, meta-learning broadens the model's applicational scope. Integrating the SHAP (SHapley Additive exPlanations) model, rooted in Shapley values from cooperative game theory, greatly enhances the interpretability of rock type classifications. We also utilized the ADASYN (Adaptive Synthetic Sampling) technique to optimize sample representation, generating new samples based on existing densities to ensure uniform distribution. Our extensive testing across various training and testing set ratios showed that the Meta-ViT model outperforms dramatically traditional machine learning models. This approach not only refines learning processes but it also adeptly addresses the inherent challenges of geological data analysis.
This is a preview of subscription content, log in via an institution to check access.
Access this article
Subscribe and save.
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Price includes VAT (Russian Federation)
Instant access to the full article PDF.
Rent this article via DeepDyve
Institutional subscriptions
Explore related subjects
Alhudhaif, A. (2021). A novel multi-class imbalanced EEG signals classification based on the adaptive synthetic sampling (ADASYN) approach. PeerJ Computer Science, 7 (1), 523–529.
Article Google Scholar
Blagus, R., & Lusa, L. (2013). SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 14 (1), 1–16.
Google Scholar
Bressan, T. S., Souza, M. K., & Girelli, T. J. (2020). Evaluation of machine learning methods for lithology classification using geophysical data. Computers & Geosciences, 139 (1), 104–112.
Chen, H., Xue, S., & Zheng, X. (2023). Coal mine microseismic identification and first-arrival picking based on Conv-LSTM-Unet. Acta Geophysica, 71 (1), 161–173.
Corina, A. N., & Hovda, S. (2018). Automatic lithology prediction from well logging using kernel density estimation. Journal of Petroleum Science and Engineering, 170 (1), 664–674.
Article CAS Google Scholar
Dev, V. A., & Eden, M. R. (2018). Evaluating the boosting approach to machine learning for formation lithology classification. In Computer Aided Chemical Engineering, 44 (1), 1465–1470.
Gaudart, J., Giusiano, B., & Huiart, L. (2004). Comparison of the performance of multi-layer perceptron and linear regression for epidemiological data. Computational Statistics & Data Analysis, 44 (4), 547–570.
Han, K., Wang, Y., & Chen, H. (2022). A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45 (1), 87–110.
Hospedales, T., Antoniou, A., & Micaelli, P. (2021). Meta-learning in neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44 (9), 5149–5169.
Jiang, C., Zhang, D., & Chen, S. (2021). Lithology identification from well-log curves via neural networks with additional geologic constraint. Geophysics, 86 (5), 85–100.
Kumar, T., Seelam, N. K., & Rao, G. S. (2022). Lithology prediction from well log data using machine learning techniques: A case study from Talcher coalfield, Eastern India. Journal of Applied Geophysics, 199 (1), 104–112.
Li, C., Cui, Y., & Luo, N. (2022). Trans-resnet: Integrating transformers and CNNs for Alzheimer’s disease classification. In 2022 IEEE 19th International Symposium on Biomedical Imaging .
Li, J., Lü, S., & Li, Z. (2022b). Unsupervised domain adaptation via softmax-based prototype construction and adaptation. Information Sciences, 609 (2), 257–275.
Li, X., Tripe, D., & Malone, C. (2020). Measuring systemic risk contribution. The leave-one-out z-score method. Finance Research Letters, 36 (3), 101–113.
CAS Google Scholar
Lin, J., Li, H., & Liu, N. (2020). Automatic lithology identification by applying LSTM to logging data: A case study in X tight rock reservoirs. IEEE Geoscience and Remote Sensing Letters, 18 (8), 1361–1365.
Liu, C., Liu, A., & Halabi, S. (2011). A min–max combination of biomarkers to improve diagnostic accuracy. Statistics in Medicine, 30 (16), 2005–2014.
Min, X., Pengbo, Q., & Fengwei, Z. (2020). Research and application of logging lithology identification for igneous reservoirs based on deep learning. Journal of Applied Geophysics, 173 (4), 103–115.
Mishra, A., Sharma, A., & Patidar, A. K. (2022). Evaluation and development of a predictive model for geophysical well log data analysis and reservoir characterization: Machine learning applications to lithology prediction. Natural Resources Research, 31 (6), 3195–3222.
Ogundokun, R. O., Maskeliunas, R., & Misra, S. (2022). Improved CNN based on batch normalization and Adam optimizer. In International Conference on Computational Science and Its Applications .
Schweighofer, N., & Doya, K. (2003). Meta-learning in reinforcement learning. Neural Networks, 16 (1), 5–9.
Shahid, F., Zameer, A., & Muneeb, M. (2021). A novel genetic LSTM model for wind power forecast. Energy, 223 (4), 120–132.
Shakirov, A., Molchanov, A., & Ismailova, L. (2023). Quantitative assessment of rock lithology from gamma-ray and mud logging data. Geoenergy Science and Engineering, 225 (2), 211–223.
Su, J., Ahmed, M., & Lu, Y. (2024). Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568 (4), 127–138.
Sun, J., Li, Q., & Chen, M. (2019). Optimization of models for a rapid identification of lithology while drilling: A win-win strategy based on machine learning. Journal of Petroleum Science and Engineering, 176 (2), 321–341.
Sun, Y., Pang, S., & Zhang, J. (2024a). Porosity prediction through well logging data: A combined approach of convolutional neural network and transformer model (CNN-transformer). Physics of Fluids, 36 (2), 26–35.
Sun, Y., Pang, S., & Zhang, Y. (2024b). Application of the dynamic transformer model with well logging data for formation porosity prediction. Physics of Fluids, 36 (3), 45–55.
Sun, Y., Pang, S., & Zhang, Y. (2024c). Application of Adaboost-Transformer Algorithm for lithology identification based on well logging data. IEEE Geoscience and Remote Sensing Letters, 21 (1), 1–5.
Wang, J. X. (2021). Meta-learning in natural and artificial intelligence. Current Opinion in Behavioral Sciences, 38 (1), 90–95.
Wang, Y., Xiao, Z., & Cao, G. (2022). A convolutional neural network method based on Adam optimizer with power-exponential learning rate for bearing fault diagnosis. Journal of Vibroengineering, 24 (4), 666–678.
Xie, Y., Zhu, C., & Zhou, W. (2018). Evaluation of machine learning methods for formation lithology identification: A comparison of tuning processes and model performances. Journal of Petroleum Science and Engineering, 160 (2), 182–193.
Xu, Z., Chen, X., & Tang, W. (2021). Meta weight learning via model-agnostic meta-learning. Neurocomputing, 432 (2), 124–132.
Yan, T., Xu, R., & Sun, S. H. (2024). A real-time intelligent lithology identification method based on a dynamic felling strategy weighted random forest algorithm. Petroleum Science, 21 (2), 1135–1148.
Zhang, G., Wang, Z., & Chen, Y. (2018). Deep learning for seismic lithology prediction. Geophysical Journal International, 215 (2), 1368–1387.
Zhang, J., He, Y., & Zhang, Y. (2022a). Well-logging-based lithology classification using machine learning methods for high-quality reservoir identification: A case study of Baikouquan Formation in Mahu Area of Junggar Basin. NW China. Energies, 15 (10), 3675–3685.
Zhang, P., Srivastava, A., & Wang, T. Y. (2022b). C-memmap: Clustering-driven compact, adaptable, and generalizable meta-LSTM models for memory access prediction. International Journal of Data Science and Analytics, 13 (1), 3–16.
Zhang, Y., Xu, B., & Zhao, T. (2020). Convolutional multi-head self-attention on memory for aspect sentiment classification. IEEE/CAA Journal of Automatica Sinica, 7 (4), 1038–1044.
Zou, Y., Chen, Y., & Deng, H. (2021). Gradient boosting decision tree for lithology identification with well logs: A case study of Zhaoxian gold deposit, Shandong Peninsula. China. Natural Resources Research, 30 (5), 3197–3217.
Download references
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. The authors declare that they have no financial support or funding to report for this study.
Author information
Authors and affiliations.
College of Computer Science, China University of Petroleum (East China), Qingdao, Shandong, China
Youzhuang Sun, Shanchen Pang, Zhiyuan Zhao & Yongan Zhang
Qingdao College of Software, China University of Petroleum (East China), Qingdao, Shandong, China
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Shanchen Pang .
Ethics declarations
Conflict of interest.
The authors declare that they have no conflict of interest regarding the publication of this manuscript. All financial and non-financial relationships and activities that could be perceived as potential sources of bias have been disclosed.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
About this article
Sun, Y., Pang, S., Zhao, Z. et al. Interpretable SHAP Model Combining Meta-learning and Vision Transformer for Lithology Classification Using Limited and Unbalanced Drilling Data in Well Logging. Nat Resour Res (2024). https://doi.org/10.1007/s11053-024-10396-4
Download citation
Received : 08 April 2024
Accepted : 02 August 2024
Published : 19 August 2024
DOI : https://doi.org/10.1007/s11053-024-10396-4
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Logging parameters
- Meta-learning
- Vision transformer
- Lithology prediction
- Machine learning
- Find a journal
- Publish with us
- Track your research
- Number System and Arithmetic
- Trigonometry
- Probability
- Mensuration
- Linear Algebra
- CBSE Class 8 Maths Formulas
- CBSE Class 9 Maths Formulas
- CBSE Class 10 Maths Formulas
- CBSE Class 11 Maths Formulas
What are the different ways of Data Representation?
The process of collecting the data and analyzing that data in large quantity is known as statistics. It is a branch of mathematics trading with the collection, analysis, interpretation, and presentation of numeral facts and figures.
It is a numerical statement that helps us to collect and analyze the data in large quantity the statistics are based on two of its concepts:
- Statistical Data
- Statistical Science
Statistics must be expressed numerically and should be collected systematically.
Data Representation
The word data refers to constituting people, things, events, ideas. It can be a title, an integer, or anycast. After collecting data the investigator has to condense them in tabular form to study their salient features. Such an arrangement is known as the presentation of data.
It refers to the process of condensing the collected data in a tabular form or graphically. This arrangement of data is known as Data Representation.
The row can be placed in different orders like it can be presented in ascending orders, descending order, or can be presented in alphabetical order.
Example: Let the marks obtained by 10 students of class V in a class test, out of 50 according to their roll numbers, be: 39, 44, 49, 40, 22, 10, 45, 38, 15, 50 The data in the given form is known as raw data. The above given data can be placed in the serial order as shown below: Roll No. Marks 1 39 2 44 3 49 4 40 5 22 6 10 7 45 8 38 9 14 10 50 Now, if you want to analyse the standard of achievement of the students. If you arrange them in ascending or descending order, it will give you a better picture. Ascending order: 10, 15, 22, 38, 39, 40, 44. 45, 49, 50 Descending order: 50, 49, 45, 44, 40, 39, 38, 22, 15, 10 When the row is placed in ascending or descending order is known as arrayed data.
Types of Graphical Data Representation
Bar chart helps us to represent the collected data visually. The collected data can be visualized horizontally or vertically in a bar chart like amounts and frequency. It can be grouped or single. It helps us in comparing different items. By looking at all the bars, it is easy to say which types in a group of data influence the other.
Now let us understand bar chart by taking this example Let the marks obtained by 5 students of class V in a class test, out of 10 according to their names, be: 7,8,4,9,6 The data in the given form is known as raw data. The above given data can be placed in the bar chart as shown below: Name Marks Akshay 7 Maya 8 Dhanvi 4 Jaslen 9 Muskan 6
A histogram is the graphical representation of data. It is similar to the appearance of a bar graph but there is a lot of difference between histogram and bar graph because a bar graph helps to measure the frequency of categorical data. A categorical data means it is based on two or more categories like gender, months, etc. Whereas histogram is used for quantitative data.
For example:
The graph which uses lines and points to present the change in time is known as a line graph. Line graphs can be based on the number of animals left on earth, the increasing population of the world day by day, or the increasing or decreasing the number of bitcoins day by day, etc. The line graphs tell us about the changes occurring across the world over time. In a line graph, we can tell about two or more types of changes occurring around the world.
For Example:
Pie chart is a type of graph that involves a structural graphic representation of numerical proportion. It can be replaced in most cases by other plots like a bar chart, box plot, dot plot, etc. As per the research, it is shown that it is difficult to compare the different sections of a given pie chart, or if it is to compare data across different pie charts.
Frequency Distribution Table
A frequency distribution table is a chart that helps us to summarise the value and the frequency of the chart. This frequency distribution table has two columns, The first column consist of the list of the various outcome in the data, While the second column list the frequency of each outcome of the data. By putting this kind of data into a table it helps us to make it easier to understand and analyze the data.
For Example: To create a frequency distribution table, we would first need to list all the outcomes in the data. In this example, the results are 0 runs, 1 run, 2 runs, and 3 runs. We would list these numerals in numerical ranking in the foremost queue. Subsequently, we ought to calculate how many times per result happened. They scored 0 runs in the 1st, 4th, 7th, and 8th innings, 1 run in the 2nd, 5th, and the 9th innings, 2 runs in the 6th inning, and 3 runs in the 3rd inning. We set the frequency of each result in the double queue. You can notice that the table is a vastly more useful method to show this data. Baseball Team Runs Per Inning Number of Runs Frequency 0 4 1 3 2 1 3 1
Sample Questions
Question 1: Considering the school fee submission of 10 students of class 10th is given below:
| |
Muskan | Paid |
Kritika | Not paid |
Anmol | Not paid |
Raghav | Paid |
Nitin | Paid |
Dhanvi | Paid |
Jasleen | Paid |
Manas | Not paid |
Anshul | Not paid |
Sahil | Paid |
In order to draw the bar graph for the data above, we prepare the frequency table as given below. Fee submission No. of Students Paid 6 Not paid 4 Now we have to represent the data by using the bar graph. It can be drawn by following the steps given below: Step 1: firstly we have to draw the two axis of the graph X-axis and the Y-axis. The varieties of the data must be put on the X-axis (the horizontal line) and the frequencies of the data must be put on the Y-axis (the vertical line) of the graph. Step 2: After drawing both the axis now we have to give the numeric scale to the Y-axis (the vertical line) of the graph It should be started from zero and ends up with the highest value of the data. Step 3: After the decision of the range at the Y-axis now we have to give it a suitable difference of the numeric scale. Like it can be 0,1,2,3…….or 0,10,20,30 either we can give it a numeric scale like 0,20,40,60… Step 4: Now on the X-axis we have to label it appropriately. Step 5: Now we have to draw the bars according to the data but we have to keep in mind that all the bars should be of the same length and there should be the same distance between each graph
Question 2: Watch the subsequent pie chart that denotes the money spent by Megha at the funfair. The suggested colour indicates the quantity paid for each variety. The total value of the data is 15 and the amount paid on each variety is diagnosed as follows:
Chocolates – 3
Wafers – 3
Toys – 2
Rides – 7
To convert this into pie chart percentage, we apply the formula: (Frequency/Total Frequency) × 100 Let us convert the above data into a percentage: Amount paid on rides: (7/15) × 100 = 47% Amount paid on toys: (2/15) × 100 = 13% Amount paid on wafers: (3/15) × 100 = 20% Amount paid on chocolates: (3/15) × 100 = 20 %
Question 3: The line graph given below shows how Devdas’s height changes as he grows.
Given below is a line graph showing the height changes in Devdas’s as he grows. Observe the graph and answer the questions below.
(i) What was the height of Devdas’s at 8 years? Answer: 65 inches (ii) What was the height of Devdas’s at 6 years? Answer: 50 inches (iii) What was the height of Devdas’s at 2 years? Answer: 35 inches (iv) How much has Devdas’s grown from 2 to 8 years? Answer: 30 inches (v) When was Devdas’s 35 inches tall? Answer: 2 years.
Please Login to comment...
Similar reads.
- Mathematics
- School Learning
Improve your Coding Skills with Practice
What kind of Experience do you want to share?
IMAGES
COMMENTS
The standard type hierarchy of Python 3. In computer science and computer programming, a data type (or simply type) is a collection or grouping of data values, usually specified by a set of possible values, a set of allowed operations on these values, and/or a representation of these values as machine types. [1] A data type specification in a program constrains the possible values that an ...
2. Really, datatypes are an abstraction that allows your programming language to treat a few bytes at some address as some kind of numeric type. Consider the data type as a lens that lets you see a piece of memory as an int, or a float. In reality, it's all just bits to the computer. answered Jan 9, 2010 at 17:30.
There are various kind of data types available according to the various kind of data available. Data types are of 3 types. Primitive Data type: int, float, char, bool. Composite Data Types: string, array, pointers. User Defined Data Type. Summer-time is here and so is the time to skill-up!
Two major approaches: structural equivalence and name equivalence. Name equivalence is based on declarations. Two types are the same only if they have the same name. (Each type definition introduces a new type) strict: aliases (i.e. declaring a type to be equal to another type) are distinct. loose: aliases are equivalent.
This guide to data representation covers all the key concepts you need to know to understand the principles of representing data in computer systems. Whether you're a GCSE, IB or A-level computer science student, our guide provides a detailed explanation of how data is represented in binary, hexadecimal, and ASCII formats, as well as the ...
The interpretation of binary pattern is called data representation or encoding. Furthermore, it is important that the data representation schemes are agreed-upon by all the parties, i.e., industrial standards need to be formulated and straightly followed. ... The char data type are based on the original 16-bit Unicode standard called UCS-2. The ...
Primitive data types: Computers deal with binary data at the most basic level. In most programming languages, integers, floating-point numbers, characters, and Booleans are foundational data types. Their representation involves bit patterns in memory, with specifics such as endian-ness, precision, and overflow/underflow considerations.
Data Representation Data Representation Eric Roberts CS 106A February 10, 2016 Claude Shannon Claude Shannon was one of the pioneers who shaped computer science in its early ... • When you pass an argument of a primitive type to a method, Java copies the value of the argument into the parameter variable. As a result, changes to the parameter ...
Data Representation Computer Organization I 4 CS@VT ©2005-2020 WD McQuain Integer Data Types We need to provide support for a variety of data types. For integer values, we need to provide a variety of types that allow the user to choose based upon memory considerations and range of representation. For contemporary programming languages, we ...
With these types, you can represent numeric values, text and binary data, and Boolean values in your code. So, these data types are the basic building blocks of most Python programs and projects. In this tutorial, you'll learn about: Numeric types, such as int, float, and complex. The str data type, which represents textual data.
Data Type •In a computer system, we need a representation of data and operations that can be performed on the data by the machine instructions or the computer language. •This combination of representation + operationsis known as a data type. • The type tells the compiler how the programmer intends to use it •Prog.
Data Representation and Data Types. Data Representation. Most of us write numbers in Arabic form, ie, 1, 2, 3,..., 9. Some people write them differently, such as I, II, III, IV,...,IX. Nomatter what type of representation, most human beings canunderstand, at least the two types I mentioned. Unfortunately thecomputer doesn't.
2.1: Types of Data Representation. Page ID. Two common types of graphic displays are bar charts and histograms. Both bar charts and histograms use vertical or horizontal bars to represent the number of data points in each category or interval. The main difference graphically is that in a bar chart there are spaces between the bars and in a ...
Data representations are useful for interpreting data and identifying trends and relationships. When working with data representations, pay close attention to both the data values and the key words in the question. When matching data to a representation, check that the values are graphed accurately for all categories.
Data Representation • At its most basic level, all digital information must reduce to 0s and 1s, which can be discussed as binary, octal, or hex data. • There's no practical limit on how it can be interpreted to represent more complex structured data Type of Data Representation Integers 2's complement Floating point numbers IEEE 754
Numerical Data Types#. There are 5 basic numerical types representing booleans (bool), integers (int), unsigned integers (uint) floating point (float) and complex.A basic numerical type name combined with a numeric bitsize defines a concrete type.
C++ supports the following data types: Primary or Built-in or Fundamental data type; Derived data types; User-defined data types; Data Types in C++ are Mainly Divided into 3 Types: 1. Primitive Data Types: These data types are built-in or predefined data types and can be used directly by the user to declare variables. example: int, char, float, bool, etc. Primitive data types available in C++ are:
A computer uses a fixed number of bits to represent a piece of data which could be a number, a character, image, sound, video, etc. Data representation is the method used internally to represent data in a computer. Let us see how various types of data can be represented in computer memory. Before discussing data representation of numbers, let ...
int var_name;. The integer data type can also be used as. unsigned int: Unsigned int data type in C is used to store the data values from zero to positive numbers but it can't store negative values like signed int. short int: It is lesser in size than the int by 2 bytes so can only store values from -32,768 to 32,767. long int: Larger version of the int datatype so can store values greater ...
6. Short. Similar to the long data type, a short is a variable integer. Programmers represent these as whole numbers, and they can be positive or negative. Sometimes a short data type is a single integer. 7. String. A string data type is a combination of characters that can be either constant or variable.
Data Representation: Data representation is a technique for analysing numerical data. The relationship between facts, ideas, information, and concepts is depicted in a diagram via data representation. It is a fundamental learning strategy that is simple and easy to understand. It is always determined by the data type in a specific domain.
A bar graph is a type of graphical representation of the data in which bars of uniform width are drawn with equal spacing between them on one axis (x-axis usually), depicting the variable. The values of the variables are represented by the height of the bars. Histograms.
These data are essential for analyzing subsurface geological structures. The raw data are preprocessed—normalization and standardization—to enhance the accuracy and efficiency of model processing in subsequent steps. Step 2: Meta-Learning Training Set Construction. Using the preprocessed data, construct a meta-training set.
When the row is placed in ascending or descending order is known as arrayed data. Types of Graphical Data Representation. Bar Chart. Bar chart helps us to represent the collected data visually. The collected data can be visualized horizontally or vertically in a bar chart like amounts and frequency. It can be grouped or single.