Strings

We’ll look at the two string classes from a number of viewpoints: semantics, literal values, operations, comparison operators, built-in functions, methods and modules. Additionally, we have a digression on the immutability of string objects.

String Semantics

A String (the formal class name is str) is an immutable sequence of ASCII characters.

A Unicode String (unicode) is an immutable sequence of Unicode characters.

Since a string (either str or unicode) is a sequence, all of the common operations on sequences apply. We can concatenate string objects together and select characters from a string. When we select a slice from a string, we’ve extracted a substring.

An individual character is simply a string of length one.

Important

Python 3.0

The Python 2 str class, which is limited to single-byte ASCII characters does two separate things: it represents text as well as a collection of bytes.

The text features of str gain the features from the Unicode String class, unicode. The new str class will represent strings of text, irrespective of the underlying encoding. It can be ASCII, UTF-8, UTF-16 or any other encoding.

The “array of bytes” features of the Python 2 str class will be moved into a new class, bytes. This new class will implement simple sequences of bytes and will support conversion between bytes and strings using encoding and decoding functions.

String Literal Values

A str is a sequence of ASCII characters. The literal value for a str is written by surrounding the value with quotes or apostrophes. There are several variations to provide some additional features.

Basic String:

Strings are enclosed in matching quotes (") or apostrophes ('). A string enclosed in quotes (") can contain apostrophes ('); similarly, a string enclosed in apostrophes (') can contains quotes ("). A basic str must be completed on a single line, or continued with a \ as the very last character of a line.

Examples:

"consultive"
'syncopated'
"don't do that"
'"Okay," he said.'
Multi-Line String:
 

Also called “Triple-Quoted String”.

A multi-line str is enclosed in triple quotes (""") or triple apostrophes ('''). It continues on across line boundaries until the concluding triple-quote or triple-apostrophe.

Examples:

"""A very long
string"""

'''SELECT *
FROM THIS, THAT
WHERE THIS.KEY = THAT.FK
AND THIS.CODE = 'Active'
'''
Unicode String:

A Unicode String uses the above quoting rules, but prefaces the quote with (u"), (u'), (u""") or (u''').

Unicode is the Universal Character Set; each character requires from 1 to 4 bytes of storage. ASCII is a single-byte character set; each of the 256 ASCII characters requires a single byte of storage. Unicode permits any character in any of the languages in common use around the world.

A special \uxxxx escape sequence is used for Unicode characters that don’t happen to occur on your ASCII keyboard.

Examples:

u'\u65e5\u672c'
u"All ASCII"
Raw String:

A Raw String uses the above quoting rules, but prefaces the quote with (r"), (r'), (r""") or (r''').

The backslash characters (\) are not interpreted as escapes by Python, but are left as is. This is handy for Windows files names that contain \. It is also handy for regular expressions that make extensive use of backslashes.

Examples:

newline_literal= r'\n'
filename= "C:\mumbo\jumbo"
pattern= "(\*\S+\*)"

The newline_literal is a two character string, not the newline character.

Outside of raw strings, non-printing characters and Unicode characters that aren’t found on your keyboard are created using escapes. A table of escapes is provided below. These are Python representations for unprintable ASCII characters. They’re called escapes because the \ is an escape from the usual meaning of the following character.

Escape Meaning
\\ Backslash (\)
\' Apostrophe (')
\" Quote (")
\a Audible Signal; the ASCII code called BEL. Some OS’s translate this to a screen flash or ignore it completely.
\b Backspace (ASCII BS)
\f Formfeed (ASCII FF). On a paper-based printer, this would move to the top of the next page.
\n Linefeed (ASCII LF), also known as newline. This would move the paper up one line.
\r Carriage Return (ASCII CR). On a paper based printer, this returned the print carriage to the start of the line.
\t Horizontal Tab (ASCII TAB)
\ooo An ASCII character with the given octal value. The ooo is any octal number.
\xhh An ASCII character with the given hexadecimal value. The x is required. The hh is any hex number.

Adjacent Strings. Note that adjacent string objects are automatically concatenated to make a single string.

"ab" "cd" "ef" is the same as "abcdef".

The most common use for this is the following:

msg = "A very long" \
"message, which didn't fit on" \
"one line."

Unicode Characters. For Unicode, a special \uxxxx escape is provided. This requires the four digit Unicode character identification.

For example, “日本” is made up of Unicode characters U+65e5 and U+672c. In Python, we write this string as u'\u65e5\u672c'.

There are a variety of Unicode encoding schemes, for example, UTF-8, UTF-16 and LATIN-1. The codecs module provides mechanisms for encoding and decoding Unicode Strings.

String Operations

There are a number of operations on str objects, operations which create strs and operations which create other objects from strs.

There are three operations (+ , * , [ ]) that work with all sequences (including strs) and a unique operation, %, that can be performed only with str objects.

The + operator creates a new string as the concatenation of the arguments.

>>> "hi " + 'mom'
'hi mom'

The * operator between str and numbers (number * str or str * number) creates a new str that is a number of repetitions of the input str.

>>> print 3*"cool!"
cool!cool!cool!

The [ ] operator can extract a single character or a slice from the string. There are two forms: the single-item form and the slice form.

  • The single item format is string [ index ]. Characters are numbered from 0 to len(string). Characters are also numbered in reverse from -len(string) to -1.

  • The slice format is string [ start : end ]. Characters from start to end -1 are chosen to create a new str as a slice of the original str; there will be end-start characters in the resulting str.

    If start is omitted it is the beginning of the string (position 0).

    If end is omitted it is the end of the string (position -1).

    Yes, you can omit both (someString[:]) to make a copy of a string.

>>> s="adenosine"
>>> s[2]
'e'
>>> s[:5]
'adeno'
>>> s[5:]
'sine'
>>> s[-5:]
'osine'
>>> s[:-5]
'aden'

The String Formatting Operation, %. The % operator is sometimes called string interpolation, since it interpolates literal text and converted values. We prefer to call it string formatting, since that is a more apt description. Much of the formatting is taken straight from the C library’s printf() function.

This operator has three forms. You can use % with a str and value, str and a tuple as well as str and classname:dict. We’ll cover tuple and dict in detail later.

The string on the left-hand side of % contains a mixture of literal text plus conversion specifications. A conversion specification begins with %. For example, integers are converted with %i. Each conversion specification will use a corresponding value from the tuple. The first conversion uses the first value of the tuple, the second conversion uses the second value from the tuple.

For example:

import random
d1, d2 = random.randrange(1,6), random.randrange(1,6)
r= "die 1 shows %i, and die 2 shows %i" % ( d1, d2 )

The first %i will convert the value for d1 to a string and insert the value, the second %i will convert the value for d2 to a string. The % operator returns the new string based on the format, with each conversion specification replaced with the appropriate values.

Conversion Specifications. Each conversion specification has from one to four elements, following this pattern:

% [ flags ][ width [. precision ]] code

The % and the final code in each conversion specification are required. The other elements are optional.

The optional flags element can have any combination of the following values:

-
Left adjust the converted value in a field that has a length given by the width element. The default is right adjustment.
+
Show positive signs (sign will be + or -). The default is to show negative signs only.
⎵ (a space)
Show positive signs with a space (sign will be ⎵ or -). The default is negative signs only.
#
Use the Python literal rules (0 for octal, 0x for hexadecimal, etc.) The default is decoration-free notation.
0
Zero-fill the the field that has a length given by the width element. The default is to space-fill the field. This doesn’t make a lot of sense with the - (left-adjust) flag.

The optional width element is a number that specifies the total number of characters for the field, including signs and decimal points. If omitted, the width is just big enough to hold the output number. If a * is used instead of a number, an item from the tuple of values is used as the width of the field. For example, "%*i" % ( 3, d1 ) uses the value 3 from the tuple as the field width and d1 as the value to convert to a string.

The optional precision element (which must be preceded by a dot, . if it is present) has a few different purposes. For numeric conversions, this is the number of digits to the right of the decimal point. For string conversions, this is the maximum number of characters to be printed, longer string s will be truncated. If a * is used instead of a number, an item from the tuple of values is used as the precision of the conversion. For example, "%*.*f" % ( 6, 2, avg ) uses the value 6 from the tuple as the field width, the value 2 from the tuple as the precision and avg as the value.

The standard conversion rules also permit a long or short indicator: l or h. These are tolerated by Python so that these formats will be compatible with C, but they have no effect. They reflect internal representation considerations for C programming, not external formatting of the data.

The required one-letter code element specifies the conversion to perform. The codes are listed below.

%
Not a conversion, this creates a % in the resulting str. Use %% to put a % in the output str.
c
Convert a single-character str. This will also convert an integer value to the corresponding ASCII character. For example, "%c" % ( 65, ) results in "A".
s
Convert a str. This will convert non- str objects by implicitly calling the str() function.
r
Call the repr() function, and insert that value.
i d
Convert a numeric value, showing ordinary decimal output. The code i stands for integer, d stands for decimal. They mean the same thing; but it’s hard to reach a consensus on which is “correct”.
u
Convert an unsigned number. While relevant to C programming, this is the same as the i or d format conversion.
o
Convert a numeric value, showing the octal representation. %#0 gets the Python-style value with a leading zero. This is similar to the oct() function.
x X
Convert a numeric value, showing the hexadecimal representation. %#X gets the Python-style value with a leading 0X; %#x gets the Python-style value with a leading 0x. This is similar to the hex() function.
e E
Convert a numeric value, showing scientific notation. %e produces ±d.ddd e ±xx, %E produces ±d.ddd E ±xx.
f F
Convert a numeric value, using ordinary decimal notation. In case the number is gigantic, this will switch to %g or %G notation.
g G
“Generic” floating-point conversion. For values with an exponent larger than -4, and smaller than the precision element, the %f format will be used. For values with an exponent smaller than -4, or values larger than the precision element, the %e or %E format will be used.

Here are some examples.

"%i: %i win, %i loss, %6.3f" % (count,win,loss,float(win)/loss)

This example does four conversions: three simple integer and one floating point that provides a width of 6 and 3 digits of precision. -0.000 is the expected format. The rest of the string is literally included in the output.

"Spin %3i: %2i, %s" % (spin,number,color)

This example does three conversions: one number is converted into a field with a width of 3, another converted with a width of 2, and a string is converted, using as much space as the string requires.

>>> a=6.02E23
>>> "%e" % a
'6.020000e+23'
>>> "%E" % a
'6.020000E+23'
>>>

This example shows simple conversion of a floating-point number to the default scientific notation which has a witdth of 12 and a precision of 6.

String Comparison Operations

The standard comparisons (<, <=, >, >=, ==, !=) apply to str objects. These comparisons use the standard character-by-character comparison rules for ASCII or Unicode.

There are two additional comparisons: in and not in. These check to see if a substring occurs in a longer string. The in operator returns a True when the substring is found, False if the substring is not found. The not in operator returns True if the substring is not found.

>>> 'a' in 'xyzzyabcxyzzy'
True
>>> 'abc' in 'xyzzyabc'
True

Don’t be fooled by the fact that string representations of integers don’t seem to sort properly. String comparison does not magically recornize that the strings are representations of numbers. It’s simple “alphabetical order” rules applied to digits.

>>> '100' < '25'
True

This is true because '1' < '2'.

String Statements

The for statement will step though all elements of a sequence. In the case of a string, it will step through each character of the string.

For example:

for letter in "forestland":
    print letter

This will print each letter of the given string.

String Built-in Functions

The following built-in functions are relevant to str manipulation

chr(i) → String
Return a str of one character with ordinal i. Note that 0 \leq i < 256 to be a proper ASCII character.
unichr(u) → Unicode String
Return a Unicode String (unicode) of one character with ordinal u. 0 \leq u < 65536.
ord(c) → integer
Return the integer ordinal of a one character str. This works for any character, including Unicode characters.
unicode(string[, encoding][, errors]) → Unicode String

Creates a new Unicode object from the given encoded string. encoding defaults to the current default string encoding. errors defines the error handling, defaults to ‘strict’.

The unicode() function converts the string to a specific Unicode external representation. The default encoding is ‘UTF-8’ with ‘strict’ error handling.

Choices for errors are ‘strict’, ‘replace’ and ‘ignore’. Strict raises an exception for unrecognized characters, replace substitutes the Unicode replacement character ( \uFFFD ) and ignore skips over invalid characters.

The codecs and unicodedata modules provide more functions for working with Unicode.

>>> unicode("hi mom","UTF-16")
u'\u6968\u6d20\u6d6f'
>>> unicode("hi mom","UTF-8")
u'hi mom'

Important

Python 3

The ord(), chr(), unichr() and unicode() functions will be simplified in Python 3.

Python 3 no longer separates ASCII from Unicode strings. These functions will all implicitly work with Unicode strings. Note that the UTF-8 encoding of Unicode overlaps with ASCII, so this simplification to use Unicode will not significantly disrupt programs that work ASCII files.

Several important functions were defined earlier in String Conversion Functions.

  • repr(). Returns a canonical string representation of the object. For most object types, eval(repr(object)) == object.

    For simple numeric types, the result of repr() isn’t very interesting. For more complex, types, however, it often reveals details of their structure.

    >>> a="""a very
    ... long string
    ... in multiple lines
    ... """
    >>> repr(a)
    "'a very \\nlong string \\nin multiple lines\\n'"
    

    This representation shows the newline characters ( \n ) embedded within the triple-quoted string.

    Important

    Python 3

    The “reverse quotes” (`a`) work like repr(a). The reverse quote syntax is rarely used, and will be dropped in Python 3.

  • str(). Return a nice string representation of the object. If the argument is a string, the return value is the same object.

    >>> a= str(355.0/113.0)
    >>> a
    '3.14159292035'
    >>> len(a)
    13
    

Some other functions which apply to strings as well as other sequence objects.

  • len(). For strings, this function returns the number of characters.

    >>> len("abcdefg")
    7
    >>> len(r"\n")
    2
    >>> len("\n")
    1
    
  • max(). For strings, this function returns the maximum character.

  • min(). For strings, this function returns the minimum character.

  • sorted(). Iterate through the string’s characters in sorted order. This expands the string into an explicit list of individual characters.

    >>> sorted( "malapertly" )
    ['a', 'a', 'e', 'l', 'l', 'm', 'p', 'r', 't', 'y']
    >>> "".join( sorted( "malapertly" ) )
    'aaellmprty'
    
  • reversed(). Iterate through the string’s characters in reverse order. This creates an iterator. The iterator can be used with a variety of functions or statements.

    >>> reversed( "malapertly" )
    <reversed object at 0x600230>
    >>> "".join( reversed( "malapertly" )  )
    'yltrepalam'
    

String Methods

A string object has a number of method functions. These can be grouped arbitrarily into transformations, which create new string s from old, and information, which returns a fact about a string.

The following string transformation functions create a new string object from an existing string.

str.capitalize() → string
Create a copy of the string with only its first character capitalized.
str.center(width) → string
Create a copy of the string centered in a string of length width. Padding is done using spaces.
str.encode(encoding[, errors]) → string
Return an encoded version of string. Default encoding is the current default string encoding. errors may be given to set a different error handling scheme. Default is ‘strict’ meaning that encoding errors raise a ValueError. Other possible values are ‘ignore’ and ‘replace’.
str.expandtabs([tabsize]) → string
Return a copy of string where all tab characters are expanded using spaces. If tabsize is not given, a tab size of 8 characters is assumed.
str.join(sequence) → string
Return a string which is the concatenation of the strings in the sequence. Each separator between elements is a copy of the given string object.
str.ljust(width) → string
Return a copy of string left justified in a string of length width. Padding is done using spaces.
str.lower() → string
Return a copy of string converted to lowercase.
str.lstrip() → string
Return a copy of string with leading whitespace removed.
str.replace(old, new[, maxsplit]) → string
Return a copy of string with all occurrences of substring old replaced by new. If the optional argument maxsplit is given, only the first maxsplit occurrences are replaced.
str.rjust(width) → string
Return a copy of string right justified in a string of length width. Padding is done using spaces.
str.rstrip() → string
Return a copy of string with trailing whitespace removed.
str.strip() → string
Return a copy of string with leading and trailing whitespace removed.
str.swapcase() → string
Return a copy of string with uppercase characters converted to lowercase and vice versa.
str.title() → string
Return a copy of string with words starting with uppercase characters, all remaining characters in lowercase.
str.translate(table[, deletechars]) → string

Return a copy of the string, where all characters occurring in the optional argument deletechars are removed, and the remaining characters have been mapped through the given translation table. The table must be a string of length 256, providing a translation for each 1-byte ASCII character.

The translation tables are built using the string.maketrans() function in the string module.

str.upper() → string
Return a copy of string converted to uppercase.

The following accessor methods provide information about a string.

str.count(sub[, start][, end]) → integer
Return the number of occurrences of substring sub in string. If start or end are present, these have the same meanings as a slice string[start:end].
str.endswith(suffix[, start][, end]) → boolean
Return True if string ends with the specified suffix, otherwise return False. The suffix can be a single string or a sequence of individual strings. If start or end are present, these have the same meanings as a slice string[start:end].
str.find(sub[, start][, end]) → integer
Return the lowest index in string where substring sub is found. Return -1 if the substring is not found. If start or end are present, these have the same meanings as a slice string[start:end].
str.index(sub[, start][, end]) → integer
Return the lowest index in string where substring sub is found. Raise ValueError if the substring is not found. If start or end are present, these have the same meanings as a slice string[start:end].
str.isalnum() → boolean
Return True if all characters in string are alphanumeric and there is at least one character in string; False otherwise.
str.isalpha() → boolean
Return True if all characters in string are alphabetic and there is at least one character in string; False otherwise.
str.isdigit() → boolean
Return True if all characters in string are digits and there is at least one character in string; False otherwise.
str.islower() → boolean
Return True if all characters in string are lowercase and there is at least one cased character in string; False otherwise.
str.isspace() → boolean
Return True if all characters in string are whitespace and there is at least one character in string, False otherwise.
str.istitle() → boolean
Return True if string is titlecased. Uppercase characters may only follow uncased characters (whitespace, punctuation, etc.) and lowercase characters only cased ones, False otherwise.
str.isupper() → boolean
Return True if all characters in string are uppercase and there is at least one cased character in string; False otherwise.
str.rfind(sub[, start][, end]) → integer
Return the highest index in string where substring sub is found. Return -1 if the substring is not found. If start or end are present, these have the same meanings as a slice string[start:end].
str.rindex(sub[, start][, end]) → integer
Return the highest index in string where substring sub is found. Raise ValueError if the substring is not found.. If start or end are present, these have the same meanings as a slice string[start:end].
str.startswith(sub[, start][, end]) → boolean
Return True if string starts with the specified prefix, otherwise return False. The prefix can be a single string or a sequence of individual strings. If start or end are present, these have the same meanings as a slice string[start:end].

The following generators create another kind of object, usually a sequence, from a string.

str.partition(separator) → 3-tuple
Return three values: the text prior to the first occurance of separator in string, the sep as the delimiter, and the text after the first occurance of the separator. If the separator doesn’t occur, all of the input string is in the first element of the 3-tuple; the other two elements are empty strings.
str.split(separator[, maxsplit]) → sequence
Return a list of the words in the string, using separator as the delimiter. If maxsplit is given, at most maxsplit splits are done. If separator is not specified, any whitespace characater is a separator.
str.splitlines(keepends) → sequence
Return a list of the lines in string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and set to True.

String Modules

There is an older module named string. Almost all of the functions in this module are directly available as methods of the string type. The one remaining function of value is the maketrans() function, which creates a translation table to be used by the translate() method of a string.

maketrans(from, to) → string

Return a translation table (a string 256 characters long) suitable for use in str.translate(). The from and to parameters must be strings of the same length. The table will assure that each character in from is mapped to the character in the same position in to.

The following example shows how to make and then apply a translation table.

>>> import string
>>> t= string.maketrans("aeiou","xxxxx")
>>> phrase= "now is the time for all good men to come to the aid of their party"

>>> phrase.translate( t )
'nxw xs thx txmx fxr xll gxxd mxn tx cxmx tx thx xxd xf thxxr pxrty'

The codecs module takes a different approach and has a number of built-in translations.

More importantly, this module contains a number of definitions of the characters in the ASCII character set. These definitions serve as a central, formal repository for facts about the character set. Note that there are general definitions, applicable to Unicode character setts, different from the ASCII definitions.

ascii_letters:The set of all letters, essentially a union of ascii_lowercase and ascii_uppercase.
ascii_lowercase:
 The lowercase letters in the ASCII character set: 'abcdefghijklmnopqrstuvwxyz'
ascii_uppercase:
 The uppercase letters in the ASCII character set: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
digits:The digits used to make decimal numbers: '0123456789'
hexdigits:The digits used to make hexadecimal numbers: '0123456789abcdefABCDEF'
letters:This is the set of all letters, a union of lowercase and uppercase, which depends on the setting of the locale on your system.
lowercase:This is the set of lowercase letters, and depends on the setting of the locale on your system.
octdigits:The digits used to make octal numbers: '01234567'
printable:All printable characters in the character set. This is a union of digits, letters, punctuation and whitespace.
punctuation:All punctuation in the ASCII character set, this is !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
uppercase:This is the set of uppercase letters, and depends on the setting of the locale on your system.
whitespace:A collection of characters that cause spacing to happen. For ASCII this is '\t\n\x0b\x0c\r'

String Exercises

  1. Check Amount Writing.

    Translate a number into the English phrase.

    This example algorithm fragment is only to get you started. This shows how to pick off the digits from the right end of a number and assemble a resulting string from the left end of the string.

    Note that the right-most two digits have special names, requiring some additional cases above and beyond the simplistic loop shown below. For example, 291 is “two hundred ninety one”, where 29 is “twenty nine”. The word for “2” changes, depending on the context.

    As a practical matter, you should analyze the number by taking off three digits at a time, the expression (number % 1000) does this. You would then format the three digit number with words like “million”, “thousand”, etc.

    English Words For An Amount, n

    1. Initialization.

      Set result \gets \text{ "" }

      Set tc \gets 0. This is the “tens counter” that shows what position we’re examining.

    2. Loop. While n > 0.

      1. Get Right Digit. Set digit \gets n \% 10, the remainder when divided by 10.

      2. Make Phrase. Translate digit to a string from “zero” to “nine”. Translate tc to a string from “” to “thousand”. This is tricky because the “teens” are special, where the “hundreds” and “thousands” are pretty simple.

      3. Assemble Result. Prepend digit string and tc string to the left end of the result string.

      4. Next Digit. n \gets \lfloor n \div 10 \rfloor. Be sure to use the // integer division operator, or you’ll get floating-point results.

        Increment tc by 1.

    3. Result. Return result as the English translation of n.

  2. Roman Numerals.

    This is similar to translating numbers to English. Instead we will translate them to Roman Numerals.

    The Algorithm is similar to Check Amount Writing (above). You will pick off successive digits, using %10 and /10 to gather the digits from right to left.

    The rules for Roman Numerals involve using four pairs of symbols for ones and five, tens and fifties, hundreds and five hundreds. An additional symbol for thousands covers all the relevant bases.

    When a number is followed by the same or smaller number, it means addition. “II” is two 1’s = 2. “VI” is 5 + 1 = 6.

    When one number is followed by a larger number, it means subtraction. “IX” is 1 before 10 = 9. “IIX” isn’t allowed, this would be “VIII”.

    For numbers from 1 to 9, the symbols are “I” and “V”, and the coding works like this.

    1. “I”
    2. “II”
    3. “III”
    4. “IV”
    5. “V”
    6. “VI”
    7. “VII”
    8. “VIII”
    9. “IX”

    The same rules work for numbers from 10 to 90, using “X” and “L”. For numbers from 100 to 900, using the symbols “C” and “D”. For numbers between 1000 and 4000, using “M”.

    Here are some examples. 1994 = MCMXCIV, 1956 = MCMLVI, 3888= MMMDCCCLXXXVIII

  3. Word Lengths.

    Analyze the following block of text. You’ll want to break into into words on whitespace boundaries. Then you’ll need to discard all punctuation from before, after or within a word.

    What’s left will be a sequence of words composed of ASCII letters. Compute the length of each word, and produce the sequence of digits. (no word is 10 or more letters long.)

    Compare the sequence of word lenghts with the value of math.pi.

    Poe, E.
    Near a Raven
    
    Midnights so dreary, tired and weary,
    Silently pondering volumes extolling all by-now obsolete lore.
    During my rather long nap - the weirdest tap!
    An ominous vibrating sound disturbing my chamber's antedoor.
    "This", I whispered quietly, "I ignore".

    This is based on http://www.cadaeic.net/cadenza.htm.

Digression on Immutability of Strings

In Strings and Tuples we noted that string and tuple objects are immutable. They cannot be changed once they are created. Programmers experienced in other languages sometimes find this to be an odd restriction.

Two common questions that arise are how to expand a string and how to remove characters from a string.

Generally, we don’t expand or contract a string, we create a new string that is the concatenation of the original string objects. For example:

>>> a="abc"
>>> a=a+"def"
>>> a
'abcdef'

In effect, Python gives us string objects of arbitrary size. It does this by dynamically creating a new string instead of modifying an existing string.

Some programmers who have extensive experience in other languages will ask if creating a new string from the original string is the most efficient way to accomplish this. Or they suggest that it would be “simpler” to allow a mutable string for this kind of concatenation. The short answer is that Python’s storage management makes this use of immutable string the simplest and most efficient.

Responses to the immutability of tuple and mutability of list vary, including some of the following frequently asked questions.

Since a list does everything a tuple does and is mutable, why bother with tuple?

Immutable tuple objects are more efficient than variable-length list objects for some operations. Once the tuple is created, it can only be examined. When it is no longer referenced, the normal Python garbage collection will release the storage for the tuple.

Most importantly, a tuple can be reliably hashed to a single value. This makes it a usable key for a mapping.

Many applications rely on fixed-length tuples. A program that works with coordinate geometry in two dimensions may use 2-tuples to represent (x, y) coordinate pairs. Another example might be a program that works with colors as 3-tuples, (r, g, b), of red, green and blue levels. A variable-length list is not appropriate for these kinds of fixed-length tuple.

Wouldn’t it be “more efficient” to allow mutable string s?

There are a number of axes for efficiency: the two most common are time and memory use.

A mutable string could use less memory. However, this is only true in the benign special case where we are only replacing or shrinking the string within a fixed-size buffer. If the string expands beyond the size of the buffer the program must either crash with an exception, or it must switch to dynamic memory allocation. Python simply uses dynamic memory allocation from the start. C programs often have serious security problems created by attempting to access memory outside of a string buffer. Python avoids this problem by using dynamic allocation of immutable string objects.

Processing a mutable string could use less time. In the cases of changing a string in place or removing characters from a string, a fixed-length buffer would require somewhat less memory management overhead. Rather than indict Python for offering immutable string, this leads to some productive thinking about string processing in general.

In text-intensive applications we may want to avoid creating separate string objects. Instead, we may want to create a single string object – the input buffer – and work with slices of that buffer. Rather than create string, we can create slice objects that describe starting and ending offsets within the one-and-only input buffer.

If we then need to manipulate these slices of the input buffer, we can create new string objects only as needed. In this case, our application program is designed for efficiency. We use the Python string objects when we want flexibility and simplicity.

Table Of Contents

Previous topic

Sequences: Strings, Tuples and Lists

Next topic

Tuples

This Page