We’ll look at the two string classes from a number of viewpoints: semantics, literal values, operations, comparison operators, built-in functions, methods and modules. Additionally, we have a digression on the immutability of string objects.
A String (the formal class name is str) is an immutable sequence of ASCII characters.
A Unicode String (unicode) is an immutable sequence of Unicode characters.
Since a string (either str or unicode) is a sequence, all of the common operations on sequences apply. We can concatenate string objects together and select characters from a string. When we select a slice from a string, we’ve extracted a substring.
An individual character is simply a string of length one.
Important
Python 3.0
The Python 2 str class, which is limited to single-byte ASCII characters does two separate things: it represents text as well as a collection of bytes.
The text features of str gain the features from the Unicode String class, unicode. The new str class will represent strings of text, irrespective of the underlying encoding. It can be ASCII, UTF-8, UTF-16 or any other encoding.
The “array of bytes” features of the Python 2 str class will be moved into a new class, bytes. This new class will implement simple sequences of bytes and will support conversion between bytes and strings using encoding and decoding functions.
A str is a sequence of ASCII characters. The literal value for a str is written by surrounding the value with quotes or apostrophes. There are several variations to provide some additional features.
| Basic String: | Strings are enclosed in matching quotes (") or apostrophes ('). A string enclosed in quotes (") can contain apostrophes ('); similarly, a string enclosed in apostrophes (') can contains quotes ("). A basic str must be completed on a single line, or continued with a \ as the very last character of a line. Examples: "consultive"
'syncopated'
"don't do that"
'"Okay," he said.'
|
|---|---|
| Multi-Line String: | |
Also called “Triple-Quoted String”. A multi-line str is enclosed in triple quotes (""") or triple apostrophes ('''). It continues on across line boundaries until the concluding triple-quote or triple-apostrophe. Examples: """A very long
string"""
'''SELECT *
FROM THIS, THAT
WHERE THIS.KEY = THAT.FK
AND THIS.CODE = 'Active'
'''
|
|
| Unicode String: | A Unicode String uses the above quoting rules, but prefaces the quote with (u"), (u'), (u""") or (u'''). Unicode is the Universal Character Set; each character requires from 1 to 4 bytes of storage. ASCII is a single-byte character set; each of the 256 ASCII characters requires a single byte of storage. Unicode permits any character in any of the languages in common use around the world. A special \uxxxx escape sequence is used for Unicode characters that don’t happen to occur on your ASCII keyboard. Examples: u'\u65e5\u672c'
u"All ASCII"
|
| Raw String: | A Raw String uses the above quoting rules, but prefaces the quote with (r"), (r'), (r""") or (r'''). The backslash characters (\) are not interpreted as escapes by Python, but are left as is. This is handy for Windows files names that contain \. It is also handy for regular expressions that make extensive use of backslashes. Examples: newline_literal= r'\n'
filename= "C:\mumbo\jumbo"
pattern= "(\*\S+\*)"
The newline_literal is a two character string, not the newline character. |
Outside of raw strings, non-printing characters and Unicode characters that aren’t found on your keyboard are created using escapes. A table of escapes is provided below. These are Python representations for unprintable ASCII characters. They’re called escapes because the \ is an escape from the usual meaning of the following character.
| Escape | Meaning |
| \\ | Backslash (\) |
| \' | Apostrophe (') |
| \" | Quote (") |
| \a | Audible Signal; the ASCII code called BEL. Some OS’s translate this to a screen flash or ignore it completely. |
| \b | Backspace (ASCII BS) |
| \f | Formfeed (ASCII FF). On a paper-based printer, this would move to the top of the next page. |
| \n | Linefeed (ASCII LF), also known as newline. This would move the paper up one line. |
| \r | Carriage Return (ASCII CR). On a paper based printer, this returned the print carriage to the start of the line. |
| \t | Horizontal Tab (ASCII TAB) |
| \ooo | An ASCII character with the given octal value. The ooo is any octal number. |
| \xhh | An ASCII character with the given hexadecimal value. The x is required. The hh is any hex number. |
Adjacent Strings. Note that adjacent string objects are automatically concatenated to make a single string.
"ab" "cd" "ef" is the same as "abcdef".
The most common use for this is the following:
msg = "A very long" \
"message, which didn't fit on" \
"one line."
Unicode Characters. For Unicode, a special \uxxxx escape is provided. This requires the four digit Unicode character identification.
For example, “日本” is made up of Unicode characters U+65e5 and U+672c. In Python, we write this string as u'\u65e5\u672c'.
There are a variety of Unicode encoding schemes, for example, UTF-8, UTF-16 and LATIN-1. The codecs module provides mechanisms for encoding and decoding Unicode Strings.
There are a number of operations on str objects, operations which create strs and operations which create other objects from strs.
There are three operations (+ , * , [ ]) that work with all sequences (including strs) and a unique operation, %, that can be performed only with str objects.
The + operator creates a new string as the concatenation of the arguments.
>>> "hi " + 'mom'
'hi mom'
The * operator between str and numbers (number * str or str * number) creates a new str that is a number of repetitions of the input str.
>>> print 3*"cool!"
cool!cool!cool!
The [ ] operator can extract a single character or a slice from the string. There are two forms: the single-item form and the slice form.
The single item format is string [ index ]. Characters are numbered from 0 to len(string). Characters are also numbered in reverse from -len(string) to -1.
The slice format is string [ start : end ].
Characters from start
to end -1 are chosen to create a new str
as a slice of the original str; there will be
characters in the resulting str.
If start is omitted it is the beginning of the string (position 0).
If end is omitted it is the end of the string (position -1).
Yes, you can omit both (someString[:]) to make a copy of a string.
>>> s="adenosine"
>>> s[2]
'e'
>>> s[:5]
'adeno'
>>> s[5:]
'sine'
>>> s[-5:]
'osine'
>>> s[:-5]
'aden'
The String Formatting Operation, %. The % operator is sometimes called string interpolation, since it interpolates literal text and converted values. We prefer to call it string formatting, since that is a more apt description. Much of the formatting is taken straight from the C library’s printf() function.
This operator has three forms. You can use % with a str and value, str and a tuple as well as str and classname:dict. We’ll cover tuple and dict in detail later.
The string on the left-hand side of % contains a mixture of literal text plus conversion specifications. A conversion specification begins with %. For example, integers are converted with %i. Each conversion specification will use a corresponding value from the tuple. The first conversion uses the first value of the tuple, the second conversion uses the second value from the tuple.
For example:
import random
d1, d2 = random.randrange(1,6), random.randrange(1,6)
r= "die 1 shows %i, and die 2 shows %i" % ( d1, d2 )
The first %i will convert the value for d1 to a string and insert the value, the second %i will convert the value for d2 to a string. The % operator returns the new string based on the format, with each conversion specification replaced with the appropriate values.
Conversion Specifications. Each conversion specification has from one to four elements, following this pattern:
% [ flags ][ width [. precision ]] code
The % and the final code in each conversion specification are required. The other elements are optional.
The optional flags element can have any combination of the following values:
The optional width element is a number that specifies the total number of characters for the field, including signs and decimal points. If omitted, the width is just big enough to hold the output number. If a * is used instead of a number, an item from the tuple of values is used as the width of the field. For example, "%*i" % ( 3, d1 ) uses the value 3 from the tuple as the field width and d1 as the value to convert to a string.
The optional precision element (which must be preceded by a dot, . if it is present) has a few different purposes. For numeric conversions, this is the number of digits to the right of the decimal point. For string conversions, this is the maximum number of characters to be printed, longer string s will be truncated. If a * is used instead of a number, an item from the tuple of values is used as the precision of the conversion. For example, "%*.*f" % ( 6, 2, avg ) uses the value 6 from the tuple as the field width, the value 2 from the tuple as the precision and avg as the value.
The standard conversion rules also permit a long or short indicator: l or h. These are tolerated by Python so that these formats will be compatible with C, but they have no effect. They reflect internal representation considerations for C programming, not external formatting of the data.
The required one-letter code element specifies the conversion to perform. The codes are listed below.
Here are some examples.
"%i: %i win, %i loss, %6.3f" % (count,win,loss,float(win)/loss)
This example does four conversions: three simple integer and one floating point that provides a width of 6 and 3 digits of precision. -0.000 is the expected format. The rest of the string is literally included in the output.
"Spin %3i: %2i, %s" % (spin,number,color)
This example does three conversions: one number is converted into a field with a width of 3, another converted with a width of 2, and a string is converted, using as much space as the string requires.
>>> a=6.02E23
>>> "%e" % a
'6.020000e+23'
>>> "%E" % a
'6.020000E+23'
>>>
This example shows simple conversion of a floating-point number to the default scientific notation which has a witdth of 12 and a precision of 6.
The standard comparisons (<, <=, >, >=, ==, !=) apply to str objects. These comparisons use the standard character-by-character comparison rules for ASCII or Unicode.
There are two additional comparisons: in and not in. These check to see if a substring occurs in a longer string. The in operator returns a True when the substring is found, False if the substring is not found. The not in operator returns True if the substring is not found.
>>> 'a' in 'xyzzyabcxyzzy'
True
>>> 'abc' in 'xyzzyabc'
True
Don’t be fooled by the fact that string representations of integers don’t seem to sort properly. String comparison does not magically recornize that the strings are representations of numbers. It’s simple “alphabetical order” rules applied to digits.
>>> '100' < '25'
True
This is true because '1' < '2'.
The for statement will step though all elements of a sequence. In the case of a string, it will step through each character of the string.
For example:
for letter in "forestland":
print letter
This will print each letter of the given string.
The following built-in functions are relevant to str manipulation
to be
a proper ASCII character.Creates a new Unicode object from the given encoded string. encoding defaults to the current default string encoding. errors defines the error handling, defaults to ‘strict’.
The unicode() function converts the string to a specific Unicode external representation. The default encoding is ‘UTF-8’ with ‘strict’ error handling.
Choices for errors are ‘strict’, ‘replace’ and ‘ignore’. Strict raises an exception for unrecognized characters, replace substitutes the Unicode replacement character ( \uFFFD ) and ignore skips over invalid characters.
The codecs and unicodedata modules provide more functions for working with Unicode.
>>> unicode("hi mom","UTF-16")
u'\u6968\u6d20\u6d6f'
>>> unicode("hi mom","UTF-8")
u'hi mom'
Important
Python 3
The ord(), chr(), unichr() and unicode() functions will be simplified in Python 3.
Python 3 no longer separates ASCII from Unicode strings. These functions will all implicitly work with Unicode strings. Note that the UTF-8 encoding of Unicode overlaps with ASCII, so this simplification to use Unicode will not significantly disrupt programs that work ASCII files.
Several important functions were defined earlier in String Conversion Functions.
repr(). Returns a canonical string representation of the object. For most object types, eval(repr(object)) == object.
For simple numeric types, the result of repr() isn’t very interesting. For more complex, types, however, it often reveals details of their structure.
>>> a="""a very
... long string
... in multiple lines
... """
>>> repr(a)
"'a very \\nlong string \\nin multiple lines\\n'"
This representation shows the newline characters ( \n ) embedded within the triple-quoted string.
Important
Python 3
The “reverse quotes” (`a`) work like repr(a). The reverse quote syntax is rarely used, and will be dropped in Python 3.
str(). Return a nice string representation of the object. If the argument is a string, the return value is the same object.
>>> a= str(355.0/113.0)
>>> a
'3.14159292035'
>>> len(a)
13
Some other functions which apply to strings as well as other sequence objects.
len(). For strings, this function returns the number of characters.
>>> len("abcdefg") 7 >>> len(r"\n") 2 >>> len("\n") 1
max(). For strings, this function returns the maximum character.
min(). For strings, this function returns the minimum character.
sorted(). Iterate through the string’s characters in sorted order. This expands the string into an explicit list of individual characters.
>>> sorted( "malapertly" )
['a', 'a', 'e', 'l', 'l', 'm', 'p', 'r', 't', 'y']
>>> "".join( sorted( "malapertly" ) )
'aaellmprty'
reversed(). Iterate through the string’s characters in reverse order. This creates an iterator. The iterator can be used with a variety of functions or statements.
>>> reversed( "malapertly" )
<reversed object at 0x600230>
>>> "".join( reversed( "malapertly" ) )
'yltrepalam'
A string object has a number of method functions. These can be grouped arbitrarily into transformations, which create new string s from old, and information, which returns a fact about a string.
The following string transformation functions create a new string object from an existing string.
Return a copy of the string, where all characters occurring in the optional argument deletechars are removed, and the remaining characters have been mapped through the given translation table. The table must be a string of length 256, providing a translation for each 1-byte ASCII character.
The translation tables are built using the string.maketrans() function in the string module.
The following accessor methods provide information about a string.
The following generators create another kind of object, usually a sequence, from a string.
There is an older module named string. Almost all of the functions in this module are directly available as methods of the string type. The one remaining function of value is the maketrans() function, which creates a translation table to be used by the translate() method of a string.
Return a translation table (a string 256 characters long) suitable for use in str.translate(). The from and to parameters must be strings of the same length. The table will assure that each character in from is mapped to the character in the same position in to.
The following example shows how to make and then apply a translation table.
>>> import string
>>> t= string.maketrans("aeiou","xxxxx")
>>> phrase= "now is the time for all good men to come to the aid of their party"
>>> phrase.translate( t )
'nxw xs thx txmx fxr xll gxxd mxn tx cxmx tx thx xxd xf thxxr pxrty'
The codecs module takes a different approach and has a number of built-in translations.
More importantly, this module contains a number of definitions of the characters in the ASCII character set. These definitions serve as a central, formal repository for facts about the character set. Note that there are general definitions, applicable to Unicode character setts, different from the ASCII definitions.
| ascii_letters: | The set of all letters, essentially a union of ascii_lowercase and ascii_uppercase. |
|---|---|
| ascii_lowercase: | |
| The lowercase letters in the ASCII character set: 'abcdefghijklmnopqrstuvwxyz' | |
| ascii_uppercase: | |
| The uppercase letters in the ASCII character set: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' | |
| digits: | The digits used to make decimal numbers: '0123456789' |
| hexdigits: | The digits used to make hexadecimal numbers: '0123456789abcdefABCDEF' |
| letters: | This is the set of all letters, a union of lowercase and uppercase, which depends on the setting of the locale on your system. |
| lowercase: | This is the set of lowercase letters, and depends on the setting of the locale on your system. |
| octdigits: | The digits used to make octal numbers: '01234567' |
| printable: | All printable characters in the character set. This is a union of digits, letters, punctuation and whitespace. |
| punctuation: | All punctuation in the ASCII character set, this is !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
| uppercase: | This is the set of uppercase letters, and depends on the setting of the locale on your system. |
| whitespace: | A collection of characters that cause spacing to happen. For ASCII this is '\t\n\x0b\x0c\r' |
Check Amount Writing.
Translate a number into the English phrase.
This example algorithm fragment is only to get you started. This shows how to pick off the digits from the right end of a number and assemble a resulting string from the left end of the string.
Note that the right-most two digits have special names, requiring some additional cases above and beyond the simplistic loop shown below. For example, 291 is “two hundred ninety one”, where 29 is “twenty nine”. The word for “2” changes, depending on the context.
As a practical matter, you should analyze the number by taking off three digits at a time, the expression (number % 1000) does this. You would then format the three digit number with words like “million”, “thousand”, etc.
English Words For An Amount, n
Initialization.
Set 
Set
. This is the “tens counter” that shows what position we’re examining.
Loop.
While
.
Get Right Digit. Set
, the remainder when divided by 10.
Make Phrase. Translate digit to a string from “zero” to “nine”. Translate tc to a string from “” to “thousand”. This is tricky because the “teens” are special, where the “hundreds” and “thousands” are pretty simple.
Assemble Result. Prepend digit string and tc string to the left end of the result string.
Next Digit.
. Be sure to use the // integer division operator, or you’ll get floating-point results.
Increment tc by 1.
Result. Return result as the English translation of n.
Roman Numerals.
This is similar to translating numbers to English. Instead we will translate them to Roman Numerals.
The Algorithm is similar to Check Amount Writing (above). You will pick off successive digits, using %10 and /10 to gather the digits from right to left.
The rules for Roman Numerals involve using four pairs of symbols for ones and five, tens and fifties, hundreds and five hundreds. An additional symbol for thousands covers all the relevant bases.
When a number is followed by the same or smaller number, it means addition. “II” is two 1’s = 2. “VI” is 5 + 1 = 6.
When one number is followed by a larger number, it means subtraction. “IX” is 1 before 10 = 9. “IIX” isn’t allowed, this would be “VIII”.
For numbers from 1 to 9, the symbols are “I” and “V”, and the coding works like this.
The same rules work for numbers from 10 to 90, using “X” and “L”. For numbers from 100 to 900, using the symbols “C” and “D”. For numbers between 1000 and 4000, using “M”.
Here are some examples. 1994 = MCMXCIV, 1956 = MCMLVI, 3888= MMMDCCCLXXXVIII
Word Lengths.
Analyze the following block of text. You’ll want to break into into words on whitespace boundaries. Then you’ll need to discard all punctuation from before, after or within a word.
What’s left will be a sequence of words composed of ASCII letters. Compute the length of each word, and produce the sequence of digits. (no word is 10 or more letters long.)
Compare the sequence of word lenghts with the value of math.pi.
Poe, E.
Near a Raven
Midnights so dreary, tired and weary,
Silently pondering volumes extolling all by-now obsolete lore.
During my rather long nap - the weirdest tap!
An ominous vibrating sound disturbing my chamber's antedoor.
"This", I whispered quietly, "I ignore".
This is based on http://www.cadaeic.net/cadenza.htm.
In Strings and Tuples we noted that string and tuple objects are immutable. They cannot be changed once they are created. Programmers experienced in other languages sometimes find this to be an odd restriction.
Two common questions that arise are how to expand a string and how to remove characters from a string.
Generally, we don’t expand or contract a string, we create a new string that is the concatenation of the original string objects. For example:
>>> a="abc"
>>> a=a+"def"
>>> a
'abcdef'
In effect, Python gives us string objects of arbitrary size. It does this by dynamically creating a new string instead of modifying an existing string.
Some programmers who have extensive experience in other languages will ask if creating a new string from the original string is the most efficient way to accomplish this. Or they suggest that it would be “simpler” to allow a mutable string for this kind of concatenation. The short answer is that Python’s storage management makes this use of immutable string the simplest and most efficient.
Responses to the immutability of tuple and mutability of list vary, including some of the following frequently asked questions.
Since a list does everything a tuple does and is mutable, why bother with tuple?
Immutable tuple objects are more efficient than variable-length list objects for some operations. Once the tuple is created, it can only be examined. When it is no longer referenced, the normal Python garbage collection will release the storage for the tuple.
Most importantly, a tuple can be reliably hashed to a single value. This makes it a usable key for a mapping.
Many applications rely on fixed-length tuples. A program that works with coordinate geometry in two dimensions may use 2-tuples to represent (x, y) coordinate pairs. Another example might be a program that works with colors as 3-tuples, (r, g, b), of red, green and blue levels. A variable-length list is not appropriate for these kinds of fixed-length tuple.
Wouldn’t it be “more efficient” to allow mutable string s?
There are a number of axes for efficiency: the two most common are time and memory use.
A mutable string could use less memory. However, this is only true in the benign special case where we are only replacing or shrinking the string within a fixed-size buffer. If the string expands beyond the size of the buffer the program must either crash with an exception, or it must switch to dynamic memory allocation. Python simply uses dynamic memory allocation from the start. C programs often have serious security problems created by attempting to access memory outside of a string buffer. Python avoids this problem by using dynamic allocation of immutable string objects.
Processing a mutable string could use less time. In the cases of changing a string in place or removing characters from a string, a fixed-length buffer would require somewhat less memory management overhead. Rather than indict Python for offering immutable string, this leads to some productive thinking about string processing in general.
In text-intensive applications we may want to avoid creating separate string objects. Instead, we may want to create a single string object – the input buffer – and work with slices of that buffer. Rather than create string, we can create slice objects that describe starting and ending offsets within the one-and-only input buffer.
If we then need to manipulate these slices of the input buffer, we can create new string objects only as needed. In this case, our application program is designed for efficiency. We use the Python string objects when we want flexibility and simplicity.