Sequences of Characters : str and Unicode

A str is a sequence of characters. By “character” we mean any of the 128 US-ASCII characters: the digits, the punctuation marks, the letters.

In the case of a unicode object, we mean a sequence of any of the millions of Unicode characters.

We’ll examine a number of aspects of strings.

There is a string module, but it isn’t heavily used. We’ll look at it briefly in Modules That Help Work With Strings. Part 8 of the Python Library Reference [PythonLib] contains 11 modules that work with strings; we won’t dig into these deeply. We’ll return to the most important string module in Text Processing and Pattern Matching : The re Module.

We’ll look at some common patterns of string processing in Some Common Processing Patterns.

What Does Python mean by “String?”

A string is an immutable sequence of characters. Let’s look at this definition in detail.

  • Since a string is a sequence, all of the common operations and built-in functions of sequences apply. This includes +, * and [].
  • Since a string is immutable, it cannot be changed. New strings can be built from other strings, but a string cannot be modified.
  • Since strings are an extension to the basic sequence type, strings have additional method functions.

Here’s a depiction of a string of 10 characters. The Python value is "syncopated". Each character has a position that identifies the character in the string.

position 0 1 2 3 4 5 6 7 8 9
character s y n c o p a t e d

We get string objects from external devices like the keyboard, files or the network. We present strings to users either as files or on the GUI display. The print statement converts data to a string before showing it to the user. This means that printing a number really involves converting the number to a string of digits before printing the string of digit characters.

Often, our program will need to examine input strings to be sure they are valid. We may be checking a string to see if it is a legal name for a day of the week. Or, we may do a more complex examination to confirm that it is a valid time. There are a number of validations we may have to perform.

Our computations may involve numbers derived from input strings. Consequently, we may have to convert input strings to numbers or convert numbers to strings for presentation.

Writing a String in Python

We looked at strings quickly in Strings – Anything Not A Number. A String is a sequence of characters. We can create strings as literals or by using any number of factory functions.

When writing a string literal, we need to separate the characters that are in the string from the surrounding Python values. String literals are created by surrounding the characters with quotes or apostrophes. We call this surrounding punctuation quote characters, even though we can use apostrophes as well as quotes.

There are several variations on the quote characters that we use to define string literals.

Single-quote. A single-quoted string uses either the quote (") or apostrophe ('). A basic string must be completed on a single line. Both of these examples are essentially the same string.

  • Single-Apostrophe looks like this: 'xyz'.
  • Single-Quote looks like this: "xyz".

Triple-quote. Multi-line strings can be enclosed in triple quotes or triple apostrophes. A multi-line string continues on until the matching triple-quote or triple-apostrophe.

  • Triple-Apostrophe looks like this: '''xyz'''.
  • Triple-Quote looks like this: """xyz""".

Here some examples of creating strings.

a= "consultive"
apos= "Don't turn around."
quote= '"Stop," he said.'

doc_1= """fastexp(n,p) -> integer
Raises n to the p power, where p is a positive integer.

:param n: a number

:param p: an integer power
"""

novel= '''"Just don't shoot," Larry said.'''
a:

A simple string.

apos:

A string using ". It has an ' inside it.

quote:

A string using '. It has two " inside it.

doc_1:

This a six-line string.

Use repr(doc_1) to see how many lines it has. Better, use doc_1.splitlines().

novel:

This is a one-line string with both " and ' inside it.

Non-Printing Characters – Really! [How can it be a character and not have a printed representation?]

ASCII has a few dozen characters that are intended to control devices or adjust spacing on a printed document.

There are a few commonly-used non-printing characters: mostly tab and newline. One of the most common escapes is \n which represents the non-printing newline character that appears at the end of every line of a file in GNU/Linux or MacOS. Windows, often, will use a two character end-of-line sequence encoded as \r\c. Most of our editing tools quietly use either line-ending sequence.

These non-printing characters are created using escapes. A table of escapes is provided below. Normally, the Python compiler translates the escape into the appropriate non-printing character.

Here are a couple of literal strings with a \n character to encode a line break in the middle of the string.

'The first message.\nFollowed by another message.'

"postmarked forestland\nconfigures longitudes."

Python supports a broad selection of \ escapes. These are printed representations for unprintable ASCII characters. They’re called escapes because the \ is an escape from the usual meaning of the following character. We have very little use for most of these ASCII escapes. The newline (\n), backslash (\), apostrophe (') and quote (") escapes are handy to have.

Important

Escapes Become Single Characters

We type two (or more) characters to create an escape, but Python compiles this into a single character in our program.

In the most common case, we type \n and Python translates this into a single ASCII character that doesn’t exist on our keyboard.

Since \ is always the first of two (or more) characters, what if we want a plain-old \ as the single resulting character? How do we stop this escape business?

The answer is we don’t. When we type \\, Python puts a single \ in our program. Okay, it’s clunky, but it’s a character that isn’t used all that often. The few times we need it, we can cope. Further, Python has a “raw” mode that permits us to bypass these escapes.

Escape Meaning
\\ Backslash (\)
\' Apostrophe (\ ')
\ Quote (")
\a Audible Signal; the ASCII code called BEL. Some OS’s translate this to a screen flash or ignore it completely.
\b Backspace (ASCII BS)
\f Formfeed (ASCII FF). On a paper-based printer, this would move to the top of the next page.
\n Linefeed (ASCII LF), also known as newline. This would move the paper up one line.
\r Carriage Return (ASCII CR). On a paper based printer, this returned the print carriage to the start of the line.
\t Horizontal Tab (ASCII TAB)
\ooo An ASCII character with the given octal value. The ooo is any octal number.
\xhh An ASCII character with the given hexadecimal value. The x is required. The hh is any hex number.

We can also use a \ at the end of a line, which means that the end-of-line is ignored. The string continues on the next line, skipping over the line break. Here’s an example of a single string that was so long had to break it into multiple lines.

"A manuscript so long \
that it takes more than one \
line to finish it."

Why would we have this special dangling-backslash? Compare the previous example with the following.

"""A manuscript so long
that it takes more than one
line to finish it."""

What’s the difference? Enter them both into IDLE to see what Python displays. One string represent a single line of data, where the other string represents three lines of data. Since the \ escapes the meaning of the newline character, it vanishes from the string. This gives us a very fine degree of control over how our output looks.

Also note that adjacent strings are automatically put together to make a longer string. We won’t make much use of this, but it something that you may encounter when reading someone else’s programs.

"syn" "opti" "cal" is the same as "synoptical".

Unicode Strings. If a u or U is put in front of the string (for example, u"unicode"), this indicates a Unicode string. Without the u, it is an ASCII string. Unicode refers to the Universal Character Set; each character requires from 1 to 4 bytes of storage. ASCII is a single-byte character set; each of the 256 ASCII characters requires a single byte of storage. Unicode permits any character in any of the languages in common use around the world.

For the thousands of Unicode characters that are not on our computer keyboards, a special \uxxxx escape is provided. This requires the four digit Unicode character identification. For example, “日本” is made up of Unicode characters U+65e5 and U+672c. In Python, we write this string as u'\u65e5\u672c'.

Here’s an example that shows the internal representation and the easy-to-read output of this string. This will work nicely if you have an appropriate Unicode font installed on your computer. If this doesn’t work, you’ll need to do an operating system upgrade to get Unicode support.

>>> ch= u’\u65e5\u672c’
>>> ch
u’\u65e5\u672c’
>>> print(ch)
日本

It’s very important to note that Unicode characters are encoded into a sequence of bytes when they are written to a file. A sequence of bytes read from a file can be decoded to get the Unicode characters.

Once inside the computer’s memory, in a Python program, there’s no encoding. Just characters.

There are a variety of Unicode encoding schemes. The choice of encoding is based on assumptions about the typical number of bytes for a character. For example, the UTF-16 codes are most efficient when most of characters actually use two bytes and there are relatively few exceptions. The UTF-8 codes, on the other hand, work well on the internet where many of the protocols expect only the US ASCII characters.

For the most part, we can use the io module to control opening and closing files with specific encodings.

In the rare event that we need really fine control over the encoding, the codecs module provides mechanisms for encoding and decoding Unicode strings.

See http://www.unicode.org for more information.

Raw Strings. If an r or R is put in front of the string (for example, r"raw\nstring"), this indicates a raw string. This is a string where the backslash characters (\) are not interpreted by the Python compiler but are left as is. This is handy for Windows files names, which contain \. It is also handy for regular expressions that make heavy use of backslashes. We’ll look at these in Text Processing and Pattern Matching : The re Module.

"\n" is an escape that’s converted to a single unprintable newline character.

r"\n" is two characters, \ and n .

String Factory Functions

There is some subtlety to the factory functions which create strings. We have two conflicting interpretations of “string representation” of an object. For simple data types, like numbers, the string version of the number is the sequence of characters. However, for more complex objects, we often want something “readable” that doesn’t contain every nuance of the object’s value. Consequently, we have two factory functions for strings: str() and repr().

str(object) → string

Creates a string from the object. This is usually a human-friendly view of the object.

repr(object) → string

Creates a representation of object in Python syntax. Typically, this is a detailed, complete view of the object. For most object types, eval(repr( object )) == object. This is true for the built-in sequence types that we’ll look at in this part.

unicode( object [,encoding] [,errors] ) → Unicode string

Creates a new Unicode string from the given encoded string. encoding defaults to the current default string encoding. The optional errors parameter defines the error handling, defaults to 'strict'. The codec module provides a more complete set of functions for encoding and decoding Unicode strings. Generally, you will be using ‘UTF-8’ or ‘UTF-16’ encodings, since these cover much of the data the passes around the Internet.

eval(string) → value

Evaluate a string, which is expected to be a legal Python expression.

You can make use of repr() to get a detailed view of a specific sequence to help you in debugging. This can, for example, reveal non-printing characters in a character string.

The str() function converts any object to a string. Plus, we’ve seen other functions (like hex() and oct()) that produce strings.

>>> a= str(355.0/113.0)
>>> a
'3.14159292035'
>>> hex(48813)
'0xbead'

The repr() function also converts an object to a string. However, repr() creates a string suitable for use as Python source code. For simple numeric types, it’s not terribly interesting. For more complex, types, however, it reveals details of their structure.

Important

Python 3

In Python 2, the repr() function can also be invoked using the backtick (`), also called accent grave.

This ` syntax is not used much and will be removed from Python 3.

Here are several version of a very long string, showing a number of representations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
>>> a="""a very
... long symbolizer
... on multiple lines"""
>>> repr(a)
"'a very\\nlong symbolizer\\non multiple lines'"
>>> a
'a very\nlong symbolizer\non multiple lines'
>>> print(a)
a very
long symbolizer
on multiple lines
  1. We set a to a very long string with \n characters in it.
  1. The repr() shows a Python expression that produces the string. Interestingly, the result is a string which evluates to a string.
  1. The value of a is the string, with the \n characters shown explicitly.
  1. When we print a, we see the value with the \n characters inpterpreted.

The unicode() function converts an encoded str to an internal Unicode String. There are a number of ways of encoding a Unicode string so that it can be placed into email or a database. The default encoding is called 'UTF-8' with 'strict' error handling. Choices for errors are 'strict', 'replace' and 'ignore'. Strict raises an exception for unrecognized characters, replace substitutes the Unicode replacement character (\uFFFD) and ignore skips over invalid characters. The codecs and unicodedata modules provide more functions for working with explicit Unicode conversions.

>>> unicode('\xe6\x97\xa5\xe6\x9c\xac','utf-8')
u'\u65e5\u672c'

The above example shows the UTF-8 encoding for 日本 as a string of bytes and as a Python Unicode string. The Unicode string character numbers (u65e5 and u672c) are easier to read as a Unicode string than they are in the UTF-8 encoding.

Operating on String Data

There are a number of operations that apply to string objects. Since strings (even a string of digits) isn’t a number, these operations do simple manipulations on the sequence of characters.

If you need to do arithmetic operations on strings, you’ll need to convert the string to a number using one of the number factory functions int(), float(), long() or complex(). See Functions are Factories (really!) for more information on these functions. Once you have a proper number, you can do arithmetic on it and then convert the result back into a string using str(). We’ll return to this later. For now, we’ll focus on manipulating strings.

There are three operations (+, *, []) that work with strings and a unique operation % that can be performed only with strings. The % is so sophisticated, that we’ll devote a separate section to just that operator.

The + Operator. The + operator creates a new string as the concatenation of two strings. A resulting string is created by gluing the two argument strings together.

>>> "hi " + 'mom'
'hi mom'

The * Operator. The * operator between strings and numbers (number * string or string * number) creates a new string that is a number of repetitions of the argument string.

>>> print(2*"way " + "cool!")
way way cool!

The [] operator. The [] operator can extract a single character or a substring from the string. There are two forms for picking items or slices from a string.

This form extracts a single item.

string[index]

Items are numbered from 0 to len(string)-1. Items are also numbered in reverse from -len(string) to -1.

This extracts a slice, creating a sequence from a sequence.

string[start:end]

Characters from start to end-1 are chosen to create a new string as a slice of the original string; there will be end - start characters in the resulting string. If start is omitted it is the beginning of the string (position 0), if end is omitted it is the end of the string (position -1).

For more information on how the numbering works for the [] operator, see Numbering from Zero.

Important

The meaning of []

Note that the [] characters are part of the syntax.

We use [ and ] for optional elements. This is not part of the syntax, but a description of optional syntactic elements. This can lead to confusion because there are two meanings for [] characters.

Since most technical documentation uses [ and ] for optional elements, we’ve elected to stick with that rather than try to adopt something more clear, but atypical.

Here are some examples of picking out individual items or creating a slice composed of several items.

>>> s="artichokes"
>>> s[2]
't'
>>> s[:5]
'artic'
>>> s[5:]
'hokes'
>>> s[2:3]
't'
>>> s[2:2]
''

The last example, s[2:2], shows an empty slice. Since the slice is from position 2 to position 2-1, there can’t be any characters in that range; it’s a kind of contradiction to ask for characters 2 through 1. Python politely returns an empty string, which is a sensible response to the expression.

Recall that string positions are also numbered from right to left using negative numbers. s[-2] is the next-to-last character. We can, then, say things like the following to work from the right-hand side instead of the left-hand side.

>>> s="artichokes"
>>> s[-2]
'e'
>>> s[-3:-1]
'ke'
>>> s[-1:1]
''

% : The Message Formatting Operator

The % operator can be used to format a message. The argument values are a template string and a tuple of individual values. The operator creates a new string by folding together two elements:

  • The literal characters in the template string.
  • Characters from the values, which were converted to strings using conversion specifications in the template string.

Here we’ll only look at a quick example. We prefer to use the format method of a string. We’ll cover that in format() : The Format Method.

>>> "Today's temp is %dC (%dF)" % (3, 37.39)
"Today's temp is 3C (37F)"

The template string is "Today's temp is %dC (%dF)". The two values are (3, 37.39). You can see that the values were used to replace the %d conversion specification.

Our template string, then, was really in five parts:

  1. Today's temp is is literal text, and appears in the result string.
  2. %d is a conversion specification; it is replaced with the string conversion of 3. Okay, it seems kind of silly, but 3 in Python is a number, not a string, and it has to be converted to a string. The print() function does this automatically. Also, when we work in the IDLE Python Shell, IDLE does this kind of string conversion automatically, also. We’ve been spoiled.
  3. C ( is literal text, and appears in the result string.
  4. %d is a conversion specification; it is replaced with the string conversion of 37.49. While it isn’t obvious what happened, here’s a hint: the %d specification produces decimal integers. To produce an integer from a floating-point number, two conversions had to happen.
  5. F) is literal text, and appears in the result string.

For details, see the Python Library Reference.

http://docs.python.org/release/2.6/library/stdtypes.html#string-formatting-operations

We’re going to focus on the str.format() method. We’ll cover that in format() : The Format Method

Built-in Functions for Strings

The following built-in functions are relevant to working with strings and characters.

Perhaps the most important is the print() function.

The print() function must convert each expression to a string before writing the strings to the standard output file. Generally, this is what we expect.

For example, when we do print( abs(-5) ), the argument is an integer and the result is an integer. This integer result is converted to the obvious string value and printed.

If we do print( abs ), what happens? We’re not applying the abs() function to an argument. We’re just converting the function to a string and printing it.

All Python objects have a string representation of some kind. Therefore, the print() function is capable of printing anything.

For character code manipulation, there are three related functions: chr(), ord() and unichr(). chr() returns the ASCII character that belongs to an ASCII code number. unichr() returns the Unicode character that belongs to a Unicode number. ord() transforms an ASCII character to its ASCII code number, or transforms a Unicode character to its Unicode number.

len(iterable) → integer

Return the number of items of a set, sequence or mapping.

>>> len("restudying")
10
>>> len(r"\n")
2
>>> len("\n")
1

Note that a raw string (r"\n") doesn’t use escapes; this is two characters. An ordinary string ("\n") interprets the escapes; this is one unprintable character.

chr(i) → character

Return a string of one character with ordinal i; 0 \leq i < 256.

This is the standard US ASCII conversion, chr(65) == 'A'.

ord(character) → integer

Return the integer ordinal of a one character string. For an ordinary character, this will be the US ASCII code. ord('A') == 65.

For a Unicode character this will be the Unicode number. ord(u'\u65e5') == 26085.

unichr(i) → Unicode string

Return a Unicode string of one character with ordinal i; 0 \leq i < 65536. This is the Unicode mapping, defined in http://www.unicode.org/.

>>> unichr(26085)
u’\u65e5’
>>> print(unichr(26085))
日
>>> ord(u’\u65e5’)
26085`

Note that min() and max() also apply to strings. The min() function will return the character closest that front of the alphabet. The max() function returns the character closest to the back of the alphabet.

>>> max('restudying')
'y'
>>> min('restudying')
'd'

Comparing Two Strings – Alphabetical Order

The standard comparisons ( <, <=, >, >=, ==, !=) apply to strings. These comparisons use character-by-character comparison rules for ASCII or Unicode. This will keep things in the expected alphabetical order.

The rules for alphabetical order include a few nuances that may cause some confusion for newbies.

  • All of the digits come before any letters of the alphabet.
  • All Uppercase letters come before any lowercase letters.
  • The punctuation marks are intermixed with the letters and numbers in an obscure way. You’ll have to get an ASCII character chart to see the punctuation marks and how they work with the other letters.
  • Numbers aren’t interpreted numerically, but as a string of characters; consequently '11' comes before '2'. Why? Compare the two strings, position-by-position: the first character, '1', comes before '2'. They may look like numbers to you; but they’re strings to Python.

Here are some examples.

>>> 'hello' < 'world'
True
>>> 'inordinate' > 'in'
True
>>> '1' < '11'
True
>>> '2' < '11'
False

These rules for alphabetical order are much simpler than, for example, the American Library Association Filing Rules. Those rules are quite complex and have a number of exceptions and special cases.

There are two additional string comparisons: in and not in. These check to see if a single character string occurs in a longer string. The in operator returns a True when the character is found in the string, False if the character is not found. The not in operator returns True if the character is not found in the string.

>>> "i" in 'microquake'
True
>>> "i" in 'formulates'
False

Statements and Strings

There are three statements that are associated with strings: the various kinds of assignment statements and the for statement deals with sequences of all kinds. Additionally the print statement is associated with strings.

The Assignment Statements. The basic assignment statement applies a new variable name to a string object. This is the expected meaning of assignment.

The += augmented assigned works as expected. a += 'more data' is the same as a = a + `more data'. Recall that a string is immutable; something like a += 'more data' creates a new string from the old value of a and the string 'more data'.

It turns out the *= also works for a string and an integer. It’s a little surprising, though, when you have something like this.

>>> value= 3
>>> value*= 'hello '
>>> value
'hello hello hello '

When in doubt, break down the *= operator to it’s component parts. It helps to think of the statment like this: value = value * 'hello'.

The for Statement. Since a string is a sequence, the for statement will visit each character of the string.

for c in "lobstering":
    print(c)

The print Statement. The print must convert each expression to a string before writing the strings to the standard output file. We prefer, however, to use the print() function.

Methods Strings Perform

A string object has a number of method functions. These can be separated into three groups:

  • transformations, which create new strings from old.
  • accessors, which access a string and return a fact about that string.
  • parsers, which examine a string and create a different data object from the string.

We’ll look at one of the most important transformations, the string.format() method, separately. Details are in format() : The Format Method, below.

Transformations. The following transformation functions create a new string from an existing string.

class str
str.capitalize() → string

Create a copy of the original string with only its first character capitalized.

"vestibular".capitalize() creates "Vestibular".

str.center(width) → string

Create a copy of the original string centered in a new string of length width. Padding is done using spaces.

"subheading".center(15) creates '   subheading  '. With explicit spaces shown as · this is

‘···subheading··‘
str.decode(encoding[, errors]) → string

Return an decoded version of the original string. The default encoding is the current default string encoding, usually ‘ascii’. errors may be given to set a different error handling scheme; default is ‘strict’ meaning that encoding errors raise a ValueError. Other possible values for errors are ‘ignore’ and ‘replace’.

Section 4.9.2 of the Python library defines the various decodings available. One of the codings is called “base64”, which mashes complex strings of bytes into ordinary letters, suitable for transmission on the internet.

\ 'c3RvY2thZGluZw=='.decode('base64') creates 'stockading'.

str.encode(encoding[, errors]) → string

Return an encoded version of the original string. The default encoding is the current default string encoding, usually ‘ascii’. errors may be given to set a different error handling scheme; default is ‘strict’ meaning that encoding errors raise a ValueError. Other possible values for errors are ‘ignore’ and ‘replace’.

Section 4.9.2 of the Python library defines the various decodings available. We can use the Unicode UTF-16 code to make multi-byte Unicode characters.

'blathering'.encode('utf16') creates :'\xff\xfeb\x00l\x00a\x00t\x00h\x00e\x00r\x00i\x00n\x00g\x00'.

str.expandtabs(tabsize) → string

Return a copy of the original string where all tab characters are expanded using spaces. If tabsize is not given, a tab size of 8 spaces is assumed.

str.format(value, ...) → string

Insert the values into the template string to create a new string.

"pi = {0:=+10.5f}".format( math.pi ) creates 'pi = +  3.14159'.

The value in the arguments (math.pi) is inserted into the template, following the conversion specificaion {0:=+10.5f}.

For details, see format() : The Format Method.

str.join(sequence) → string

Return a new string which is the concatenation of the original strings in the sequence. The separator between elements is the string object that does the join.

" and ".join( ["ships","shoes","sealing wax"] ) creates 'ships and shoes and sealing wax'.

str.ljust(width) → string

Return a copy of the original string left justified in a string of length width. Padding is done using spaces on the right.

"reclasping".ljust(15) creates 'reclasping    '. With more visible spaces, this is

‘reclasping·····‘
str.lower() → string

Return a copy of the original string converted to lowercase.

"SuperLight".lower() creates 'superlight'.

str.lstrip() → string

Return a copy of the original string with leading whitespace removed. This is often used to clean up input.

" precasting \n".lstrip() creates 'precasting \n'.

str.replace(old, new[, count]) → string

Return a copy of the original string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

The most common use is "$HOME/some/place".replace("$HOME","e:/book") replaces the "$HOME" string to create a new string 'e:/book/some/place'.

Once in a while, we’ll need to replace just the first occurance of some target string, allowing us to do something like the following: 'e:/book/some/place'.replace( 'e', 'f', 1 ).

str.rjust(width) → string

Return a copy of the original string right justified in a string of length width. Padding is done using spaces on the left.

"fulminates".rjust(15) creates :'     fulminates'.

With more visible spaces, this is

‘·····fulminates’
str.rstrip() → string

Return a copy of the original string with trailing whitespace removed. This has an obvious symmetry with lstrip().

" precasting \\n".rstrip() creates ' precasting'.

str.strip() → string

Return a copy of the original string with leading and trailing whitespace removed. This combines lstrip() and rstrip() into one handy package.

" precasting \n".strip() creates 'precasting'.

str.swapcase() → string

Return a copy of the original string with uppercase characters converted to lowercase and vice versa.

str.title() → string

Return a titlecased version of the original string. Words start with uppercase characters, all remaining cased characters are lowercase.

For example, "hello world".title() creates 'Hello World'.

str.upper() → string

Return a copy of the original string converted to uppercase.

Accessors. The following methods provide information about a string.

class str
str.count(sub[, start, end]) → integer

Return the number of occurrences of substring sub in a string. If the optional arguments start and end are given, they are interpreted as if you had said string [ start : end ].

For example "hello world".count("l") is 3.

str.endswith(suffix[, start, end]) → boolean

Return True if the string ends with the specified suffix, otherwise return False. With optional start, or end, the test is applied to string [ start : end ].

"pleonastic".endswith("tic") creates True.

str.find(sub[, start, end]) → integer

Return the lowest index in the string where substring sub is found. If optional arguments start and end are given, than string [ start : end ] is searched. Return -1 on failure.

"rediscount".find("disc") returns 2; "postlaunch".find("not") returns -1.

str.index(sub) → integer

Like find() but raise ValueError when the substring is not found.

See The Unexpected : The try and except statements for more information on processing exceptions.

str.isalnum() → boolean

Return True if all characters in the string are alphanumeric (a mixture of letters and numbers) and there is at least one character in the string. Return False otherwise.

str.isalpha() → boolean

Return True if all characters in the string are alphabetic and there is at least one character in the string. Return False otherwise.

str.isdigit() → boolean

Return True if all characters in the string are decimal digits and there is at least one character in the string, False otherwise.

str.islower() → boolean

Return True if all characters in the string are lowercase and there is at least one cased character in the string, False otherwise.

str.isspace() → boolean

Return True if all characters in the string are whitespace and there is at least one character in the string, False otherwise. Whitespace characters includes spaces, tabs, newlines and a handful of other non-printing ASCII characters.

str.istitle() → boolean

Return True if the string is a titlecased string, i.e. uppercase characters may only follow uncased characters and lowercase characters only cased ones, False otherwise.

str.isupper() → boolean

Return True if all characters in the string are uppercase and there is at least one cased character in the string, False otherwise.

str.rfind(sub[, start, end]) → integer

Return the highest index in the string where substring sub is found. Since this is the highest index, this looking for the right-most occurrence, hence the “r” in the name. If optional arguments start and end are provided, then string [ start : end ] is searched. Return -1 on failure to find the requested substring.

str.rindex(sub) → integer

Like rfind() but raise ValueError when the substring is not found.

str.startswith(prefix[, start, end]) → boolean

Return True if the string starts with the specified prefix, otherwise return False. With optional start, or end, test string [ start : end ].

"E:/programming".startswith("E:") is True.

Parsers. The following methods create another kind of object, usually a sequence, from a string.

class str
str.split(sep[, maxsplit]) → sequence

Return a list of the words in the string the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done. If sep is not specified, any whitespace string is a separator.

We can use this to do things like aList= "a,b,c,d".split(','). We’ll look at the resulting sequence object closely in Flexible Sequences : The list.

str.splitlines(keepends) → sequence

Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and True. This method can help us process a file: a file can be looked at as if it were a giant string punctuated by \n characters.

We can break up a string into individual lines using statements like lines= "two lines\nof data".splitlines().

str.partition(punctuation) → tuple

Locate the left-most occurance of punctuation. If found, split the string into three parts. The part before, the punctuation that was found and the part after.

If the punctuation was not found, then the last two elements are zero-length strings.

>>> first, punct, last = "label :: several :: values".partition( "::" )
>>> first
'label '
>>> punct
'::'
>>> last
' several :: values'
str.rpartition(punctuation) → tuple

Similar to str.partition(), except it search for the right-most occurence of the punctuation.

Here’s another example of using some of the string methods and slicing operations.

temperature.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
temp= raw_input("temperature: ")
if temp.isdigit():
    unit= raw_input("units [C or F]: ")
else:
    unit= temp[-1:]
    temp= temp[:-1]
unit= unit.upper()
if unit.startswith("C"):
    print(temp, c2f(float(temp)))
elif unit.startswith("F"):
    print(temp, f2c(float(temp)))
else:
    print("Units must be C or F")
  1. The str.isdigit() method tells us if the string is all digits, or contains some extra characters. If the input string ends with C or F, we’ll handle this small typing mistake gracefully.
  1. This is the standard “break a string at a position” pattern. In this case, we are breaking at the last position of the string. The final character will be assigned to the unit variable, which we expect to be C or F.
  1. We use the str.upper() method to create a new string which is only uppercase letters. In the long run, this is simpler and more reliable than messing around with unit.startswith(“C”) or unit.startswith(“c”).
  2. We use the str.startswith() method to examine the first part of the user’s input. This will allow the user to spell out “Celsius” or “Fahrenheit”.

format() : The Format Method

The str.format() method is used for format a message. The method is applied for a formatting template. The arguments to this method are the values which are inserted into that template. The method creates a new string by folding together two elements:

  • The literal characters in the template string.
  • Characters from the values, which were converted to strings using conversion specifications in the template string. These are formatting rules surrounded by {} in the template.

First we’ll look at a quick example, then we’ll look at the real processing rules behind this method. This example has a template string and two values that are used to create a resulting string.

>>> "Today's temp is {0:d}C ({1:.2f}F)".format(3, 37.39)
"Today's temp is 3C (37.39F)"

The template string is "Today's temp is {0:d}C ({1:.2f}F)". The two values are (3, 37.39). You can see that the values were used to replace the {0:d} and {1:.2f} conversion specifications.

Our template string, then, was really in five parts:

  1. Today's temp is is literal text, and appears in the result string.
  2. {0:d} is a conversion specification; it is replaced with the string conversion of 3. Okay, it seems kind of silly, but 3 in Python is a number, not a string, and it has to be converted to a string.
  3. C ( is literal text, and appears in the result string.
  4. {1:.2f} is a conversion specification; it is replaced with the string conversion of 37.39.
  5. F) is literal text, and appears in the result string.

The {} conversion specifications include several important features. We’ll look at a few of the most common options. For complete details, see the Python Library Reference.

http://docs.python.org/release/2.6/library/string.html#format-string-syntax

The Big Picture. Each specification has one mandatory and one optional part. This leads to two ways to specify a conversion.

{field_name}
{field_name:format}

The mandatory field_name specifices which piece of data is taken from the arguments.

The optional format specifies how that piece of data should be formatted. If there’s no command::format then the object is converted to a string using default formatting rules.

The {} and : are part of the syntax.

In the {0:d} example, the field_name is 0 (the first argument value). The format is d.

In the {1:.2f} example, the field_name is 1 (the second argument value). The format is .2f.

This is just an overview of the most important parts. We’ve left quite a bit out.

Format Features Each format actually has a fairly large number of optional features. The full format has seven parts. All of these are optional.

[fill][sign][options][width][.precision][type]

The fill takes one or two characters. It’s one of the four alignment characters with an optional prefix. This leads to 8 possibilities.

  • <, fill<. Align to the left. Fill any extra positions on the right with spaces or the fill character.
  • >, fill>. Align to the right. Fill any extra positions on the left with spaces or the fill character. Using *> will prepend * to a number.
  • ^, fill^. Center in the available space. Full extra positions on left and right with spaces or the fill character.
  • =, fill=. Put the padding between sign and digits. The sign is specified separately, and it’s common to use both fill and sign. For example, =+ to explicitly show the sign followed by spaces. Another common use is 0=+ to show a sign followed by leading zeroes.

The sign is one character. + shows all signs. - shows only negative signs. A space uses a space for positive and a sign for negative.

One of the options characters is a #. If present, then a prefix (0b, 0o, 0x) is used for binary, octal or hexadecimal conversions.

The other options character is a 0. If present, leading zeroes are padded. This is the same as a 0= fill specification. In effect, it makes the = optional.

The width is the overall number of positions into which the number is converted. The default is to left-align with trailing spaces. The various fill and sign options, however, provide a great deal of control over how the number is fit into the available width.

The .precision is the number of decimal places to include. The . is required to show that this is the precision. . clearly separates precision from width.

The type is the kind of data conversion to apply. There are two broad categories of conversion: integer and floating-point.

The most common integer conversion codes are d and n. The d conversion is ordinary decimal numbers. Additional integer conversions include d, o, x and X for binary, octal and hexadecimal.

The common float conversions codes are e, E, f`, g and G. The e and E conversions give “scientific” notation (3.739000e+01). The f conversion gives ordinary-looking numbers. The g and G conversions choose between f and e formatting. An additional float conversion is % which multplies by 100 to provide a good-looking percentage value.

Also, there’s an n conversion for localized numbers with proper , or . separators and decimal points.

Examples. Here are some examples of messages with more complex templates.

"{0}: {1} win, {2} loss, {3:6.3f}".format(count,win,loss,float(win)/loss)

This example does four conversions: three simple integer and one floating-point that provides a width of 6 and 3 digits of precision. -0.000 is the expected format. The rest of the string is literally included in the output.

"Spin {0:>3d}: {1:>2d}, {2}".format(spin,number,color)

This example does three conversions: one number is converted into a right-aligned field with a width of 3, another converted with a width of 2, and a string is converted, using as much space as the string requires.

"Win rate: {0:.1%}".format( win/float(spins) )

This example has one conversion using the % type.

"Pay: {0:*>8d} dollars".format( amount )

This example has one conversion using a leading * fill character.

Modules That Help Work With Strings

Perhaps the most useful string-related module is the re module. The name is short because it is used so often in so many Python programs. However, it is a little too advanced to cover here. We’ll talk about it in Text Processing and Pattern Matching : The re Module.

The module named string has a number of public module variables which define various subsets of the ASCII characters. These definitions serve as a central, formal repository for facts about the character set. Note that there are general definitions, applicable to Unicode character sets, different from the ASCII definitions.

string.ascii_letters:
 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
string.ascii_lowercase:
 abcdefghijklmnopqrstuvwxyz
string.ascii_uppercase:
 ABCDEFGHIJKLMNOPQRSTUVWXYZ
string.digits:023456789
string.hexdigits:
 0123456789abcdefABCDEF
string.letters:All Letters; for many locale settings, this will be different from the ASCII letters
string.lowercase:
 Lowercase Letters; for many locale settings, this will be different from the ASCII letters
string.octdigits:
 01234567
string.printable:
 All printable characters in the character set
string.punctuation:
 All punctuation in the character set. For ASCII, this is !"#$%&'()*+,-./:;<=>?@[\]^_\`{|}~
string.uppercase:
 Uppercase Letters.
string.whitespace:
 A collection of characters that cause spacing to happen. For ASCII this is \t\n\x0b\x0c\r·; Tab (HT), Newline (Line Feed, LF), Vertical Tab (VT), Carriage Return (CR) and space.

You can use these for operations like the following. We often use this string classifiers to test input values we got from a user or read from a file. We use string.uppercase and string.digits in the examples below.

>>> from __future__ import print_function
>>> import string
>>> a= "some input"
>>> a[0] in string.uppercase
False
>>> n= "123-45"
>>> for character in n:
...     if character not in string.digits:
...         print("Invalid character", character)
...
Invalid character -

Some Common Processing Patterns

There are a number of common design patterns for manipulating strings. These includes adding characters to a string, removing characters from a string and breaking a string into two strings. In some languages, these operations involve some careful planning. In Python, these operations are relatively simple and (hopefully) obvious.

Adding Characters To A String. We add characters to a string by creating a new string that is the concatenation of the original strings. For example:

>>> a="lunch"
>>> a=a+"meats"
>>> a
'lunchmeats'

Some programmers who have extensive experience in other languages will ask if creating a new string from the original strings is the most efficient way to accomplish this. Or they suggest that it would be “simpler” to allow mutable strings for this kind of concatenation. The short answer is that Python’s storage management makes this use if immutable strings the simplest and most efficient. We’ll discuss this in some depth in Sequence FAQ’s.

Removing Characters From A String. Sometimes we want to remove some characters from a string. Python encourages us to create a new string that is built from pieces of the original string. For example:

>>> s="black,thorn"
>>> s = s[:5] + s[6:]
>>> s
'blackthorn'

In this example, we dropped the sixth character (in position 5), ,. Recall that the positions are numbered from zero. Positions 0, 1 and 2 are the first three characters. Position 5 is the sixth character. Here’s how this example works.

  1. Create a slice of s using characters up to the fifth. This is positions 0 through 4, a total of five characters.
  2. Create a slice of s using characters starting from position 6 (the seventh character) through the end of the string.
  3. Assemble a new string from these two slices; the sixth character (position 5) will have been ignored when we created the two slices.

In other languages, there are sophisticated methods to delete particular characters from a string. Again, Python makes this simpler by letting us create a new string from pieces of the old string.

Breaking a String at a Fixed Position. Often, we will break a string into pieces based on a fixed format. Python gives us a very handy way to do this.

>>> fn="1985 Mar 19"
>>> year= fn[:4]
>>> month= fn[5:8]
>>> day= fn[-2:]
>>> month
'Mar'
>>> day
'19'

Breaking a String at a Punctuation Mark. There are numerous variations on the parsing theme. We’ll look at just one: locating a punctuation mark to split a string.

>>> prop="name : value which has : in it"
>>> label, _, value = prop.partition( ":" )
>>> label.rstrip()
'name'
>>> value.lstrip()
'value which has : in it'

In this example, we assigned the punctuation mark to the variable _. This variable is sometimes used as a “don’t care” variable. We know that str.partition() always provides three values, but we only want two of them.

String Exercises

  1. Is Each Letter Unique?.

    Given a ten-letter word, is each letter unique? Further, do the letters occur in alphabetical order?

    Let’s say we have a 10-letter word in the variable w. We want to know if each letter occurs just once in the word. For example, “pathogenic” has each letter occurring just once. On the other hand, “pathologic”.

    To determine if each letter is unique, we’ll need to extract each letter from the word, and then use the count() method function to determine if that letter occurs just once in the word.

    Write a loop which will examine each letter in a word to see if the count of occurrences is just one or more than one. If all counts are one, this is a ten-letter word with 10 unique letters.

    Here’s a batch of words to use for testing:

    patchworks patentable paternally pathfinder pathogenic

    The alphabetical order test is more difficult. In this case, we need to be sure that each letter comes before the next letter in the alphabet. We’re asking that w[0] <= w[1] <= w[2].... We can break this long set of comparisons down to a shorter expression that we can evaluate in a loop. We can use w[0] <= w[1], and w[1] <= w[2] to examine each letter and its successor.

    Write a loop to examine each character to determine if the letters of the word occur in alphabetical order. Words like “abhorrent” or “immortals” have the letters in alphabetical order.

  2. Roman Numerals.

    This is similar to translating numbers to English. Instead we will translate them to Roman Numerals.

    The Algorithm is similar to Check Amount Writing (above). You will pick off successive digits, using amount%10 and amount/10 to gather the digits from right to left.

    The rules for Roman Numerals involve using four pairs of symbols for ones and five, tens and fifties, hundreds and five hundreds. An additional symbol for thousands covers all the relevant bases.

    When a number is followed by the same or smaller number, it means addition. “II” is two 1’s = 2. “VI” is 5 + 1 = 6.

    When one number is followed by a larger number, it means subtraction. “IX” is 1 before 10 = 9. “IIX” isn’t allowed, this would be “VIII”.

    For numbers from 1 to 9, the symbols are “I” and “V”, and the coding works like this.

    1. “I”
    2. “II”
    3. “III”
    4. “IV”
    5. “V”
    6. “VI”
    7. “VII”
    8. “VIII”
    9. “IX”

    The same rules work for numbers from 10 to 90, using “X” and “L”. For numbers from 100 to 900, using the symbols “C” and “D”. For numbers between 1000 and 4000, using “M”.

    Here are some examples. 1994 = MCMXCIV, 1956 = MCMLVI, 3888= MMMDCCCLXXXVIII

  3. Word Lengths.

    Analyze the following block of text. You’ll want to break into into words on whitespace boundaries. Then you’ll need to discard all punctuation from before, after or within a word.

    What’s left will be a sequence of words composed of ASCII letters. Compute the length of each word, and produce the sequence of digits. (no word is 10 or more letters long.)

    Compare the sequence of word lenghts with the value of math.pi.

    Poe, E.
    Near a Raven
    
    Midnights so dreary, tired and weary,
    Silently pondering volumes extolling all by-now obsolete lore.
    During my rather long nap - the weirdest tap!
    An ominous vibrating sound disturbing my chamber's antedoor.
    "This", I whispered quietly, "I ignore".

    This is based on http://www.cadaeic.net/cadenza.htm.