A str, also called str, is a sequence of characters. By “character” we mean any of the 128 US-ASCII characters: the digits, the punctuation marks, the letters.
In the case of a unicode object, we mean a sequence of any of the millions of Unicode characters.
We’ll more fully define string in What Does Python mean by “String?”. We’ll show the syntax for strings in Writing a String in Python and the factory functions that create strings in String Factory Functions.
We’ll look at the standard sequence operators and how they apply to strings in Operating on String Data. We’ll focus on a unique string operator, %, in % : The Message Formatting Operator. We’ll look at some built-in functions in Built-in Functions for Strings. We’ll cover the comparison operators in Comparing Two Strings – Alphabetical Order. There are numerous string methods that we’ll look at in Methods Strings Perform.
There is a string module, but it isn’t heavily used. We’ll look at it briefly in Modules That Help Work With Strings. Part 8 of the Python Library Reference [PythonLib] contains 11 modules that work with strings; we won’t dig into these deeply. We’ll return to the most important string module in Text Processing and Pattern Matching : The re Module.
We’ll look at some common patterns of string processing in Some Common Processing Patterns.
A string is an immutable sequence of characters. Let’s look at this definition in detail.
Here’s a depiction of a string of 10 characters. The Python value is "syncopated". Each character has a position that identifies the character in the string.
| position | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| character | s | y | n | c | o | p | a | t | e | d |
We get string objects from external devices like the keyboard, files or the network. We present strings to users either as files or on the GUI display. The print statement converts data to a string before showing it to the user. This means that printing a number really involves converting the number to a string of digits before printing the string of digit characters.
Often, our program will need to examine input strings to be sure they are valid. We may be checking a string to see if it is a legal name for a day of the week. Or, we may do a more complex examination to confirm that it is a valid time. There are a number of validations we may have to perform.
Our computations may involve numbers derived from input strings. Consequently, we may have to convert input strings to numbers or convert numbers to strings for presentation.
We looked at strings quickly in Strings – Anything Not A Number. A String is a sequence of characters. We can create strings as literals or by using any number of factory functions.
When writing a string literal, we need to separate the characters that are in the string from the surrounding Python values. String literals are created by surrounding the characters with quotes or apostrophes. We call this surrounding punctuation quote characters, even though we can use apostrophes as well as quotes.
There are several variations on the quote characters that we use to define string literals.
Single-quote. A single-quoted string uses either the quote (") or apostrophe ( ' ). A basic string must be completed on a single line. Both of these examples are essentially the same string.
Triple-quote. Multi-line strings can be enclosed in triple quotes or triple apostrophes. A multi-line string continues on until the matching triple-quote or triple-apostrophe.
Here some examples of creating strings.
a= "consultive"
apos= "Don't turn around."
quote= '"Stop," he said.'
doc_1= """fastexp(n,p) -> integer
Raises n to the p power, where p is a positive integer.
:param n: a number
:param p: an integer power
"""
novel= '''"Just don't shoot," Larry said.'''
| a: | A simple string. |
|---|---|
| apos: | A string using ". It has an ' inside it. |
| quote: | A string using '. It has two " inside it. |
| doc_1: | This a six-line string. Use repr(doc_1) to see how many lines it has. Better, use doc_1.splitlines(). |
| novel: | This is a one-line string with both " and ' inside it. |
Non-Printing Characters – Really! [How can it be a character and not have a printed representation?]
ASCII has a few dozen characters that are intended to control devices or adjust spacing on a printed document.
There are a few commonly-used non-printing characters: mostly tab and newline. One of the most common escapes is \n which represents the non-printing newline character that appears at the end of every line of a file in GNU/Linux or MacOS. Windows, often, will use a two character end-of-line sequence encoded as \r\c. Most of our editing tools quietly use either line-ending sequence.
These non-printing characters are created using escapes. A table of escapes is provided below. Normally, the Python compiler translates the escape into the appropriate non-printing character.
Here are a couple of literal strings with a \n character to encode a line break in the middle of the string.
'The first message.\nFollowed by another message.'
"postmarked forestland\nconfigures longitudes."
Python supports a broad selection of \ escapes. These are printed representations for unprintable ASCII characters. They’re called escapes because the \ is an escape from the usual meaning of the following character. We have very little use for most of these ASCII escapes. The newline (\n), backslash (\), apostrophe (') and quote (") escapes are handy to have.
Important
Escapes Become Single Characters
We type two (or more) characters to create an escape, but Python compiles this into a single character in our program.
In the most common case, we type \n and Python translates this into a single ASCII character that doesn’t exist on our keyboard.
Since \ is always the first of two (or more) characters, what if we want a plain-old \ as the single resulting character? How do we stop this escape business?
The answer is we don’t. When we type \\, Python puts a single \ in our program. Okay, it’s clunky, but it’s a character that isn’t used all that often. The few times we need it, we can cope. Further, Python has a “raw” mode that permits us to bypass these escapes.
| Escape | Meaning |
| \\ | Backslash (\) |
| \' | Apostrophe (\ ') |
| \“ | Quote (") |
| \a | Audible Signal; the ASCII code called BEL. Some OS’s translate this to a screen flash or ignore it completely. |
| \b | Backspace (ASCII BS) |
| \f | Formfeed (ASCII FF). On a paper-based printer, this would move to the top of the next page. |
| \n | Linefeed (ASCII LF), also known as newline. This would move the paper up one line. |
| \r | Carriage Return (ASCII CR). On a paper based printer, this returned the print carriage to the start of the line. |
| \t | Horizontal Tab (ASCII TAB) |
| \ooo | An ASCII character with the given octal value. The ooo is any octal number. |
| \xhh | An ASCII character with the given hexadecimal value. The x is required. The hh is any hex number. |
We can also use a \\ at the end of a line, which means that the end-of-line is ignored. The string continues on the next line, skipping over the line break. Here’s an example of a single string that was so long had to break it into multiple lines.
"A manuscript so long \
that it takes more than one \
line to finish it."
Why would we have this special dangling-backslash? Compare the previous example with the following.
"""A manuscript so long
that it takes more than one
line to finish it."""
What’s the difference? Enter them both into IDLE to see what Python displays. One string represent a single line of data, where the other string represents three lines of data. Since the \ escapes the meaning of the newline character, it vanishes from the string. This gives us a very fine degree of control over how our output looks.
Also note that adjacent strings are automatically put together to make a longer string. We won’t make much use of this, but it something that you may encounter when reading someone else’s programs.
"syn" "opti" "cal" is the same as "synoptical".
Unicode Strings. If a u or U is put in front of the string (for example, u"unicode"), this indicates a Unicode string. Without the u, it is an ASCII string. Unicode refers to the Universal Character Set; each character requires from 1 to 4 bytes of storage. ASCII is a single-byte character set; each of the 256 ASCII characters requires a single byte of storage. Unicode permits any character in any of the languages in common use around the world.
For the thousands of Unicode characters that are not on our computer keyboards, a special \uxxxx escape is provided. This requires the four digit Unicode character identification. For example, “日本” is made up of Unicode characters U+65e5 and U+672c. In Python, we write this string as u'\u65e5\u672c'.
Here’s an example that shows the internal representation and the easy-to-read output of this string. This will work nicely if you have an appropriate Unicode font installed on your computer. If this doesn’t work, you’ll need to do an operating system upgrade to get Unicode support.
>>> ch= u’\u65e5\u672c’ >>> ch u’\u65e5\u672c’ >>> print ch 日本
There are a variety of Unicode encoding schemes. The most common encodings make some basic assumptions about the typical number of bytes for a character. For example, the UTF-16 codes are most efficient when most of characters actually use two bytes and there are relatively few exceptions. The UTF-8 codes work well on the internet where many of the protocols expect only the US ASCII characters. In the rare event that we need to control this, the codecs module provides mechanisms for encoding and decoding Unicode strings.
See http://www.unicode.org for more information.
Raw Strings. If an r or R is put in front of the string (for example, r"raw\nstring"), this indicates a raw string. This is a string where the backslash characters (\) are not interpreted by the Python compiler but are left as is. This is handy for Windows files names, which contain \. It is also handy for regular expressions that make heavy use of backslashes. We’ll look at these in Text Processing and Pattern Matching : The re Module.
"\n" is an escape that’s converted to a single unprintable newline character.
r"\n" is two characters, \ and n .
There is some subtlety to the factory functions which create strings. We have two conflicting interpretations of “string representation” of an object. For simple data types, like numbers, the string version of the number is the sequence of characters. However, for more complex objects, we often want something “readable” that doesn’t contain every nuance of the object’s value. Consequently, we have two factory functions for strings: str() and repr().
You can make use of repr() to get a detailed view of a specific sequence to help you in debugging. This can, for example, reveal non-printing characters in a character string.
The str() function converts any object to a string. Plus, we’ve seen other functions (like hex() and oct()) that produce strings.
>>> a= str(355.0/113.0)
>>> a
'3.14159292035'
>>> hex(48813)
'0xbead'
The repr() function also converts an object to a string. However, repr() creates a string suitable for use as Python source code. For simple numeric types, it’s not terribly interesting. For more complex, types, however, it reveals details of their structure.
Important
Python 3
In Python 2, the repr() function can also be invoked using the backtick (`), also called accent grave.
This ` syntax is not used much and will be removed from Python 3.
Here are several version of a very long string, showing a number of representations.
>>> a="""a very
... long symbolizer
... on multiple lines"""
>>> repr(a)
"'a very\\nlong symbolizer\\non multiple lines'"
>>> a
'a very\nlong symbolizer\non multiple lines'
>>> print a
a very
long symbolizer
on multiple lines
The unicode() function converts an encoded str to an internal Unicode String. There are a number of ways of encoding a Unicode string so that it can be placed into email or a database. The default encoding is called 'UTF-8' with 'strict' error handling. Choices for errors are 'strict', 'replace' and 'ignore'. Strict raises an exception for unrecognized characters, replace substitutes the Unicode replacement character (\uFFFD) and ignore skips over invalid characters. The codecs and unicodedata modules provide more functions for working with explicit Unicode conversions.
>>> unicode('\xe6\x97\xa5\xe6\x9c\xac','utf-8')
u'\u65e5\u672c'
The above example shows the UTF-8 encoding for 日本 as a string of bytes and as a Python Unicode string. The Unicode string character numbers (u65e5 and u672c) are easier to read as a Unicode string than they are in the UTF-8 encoding.
There are a number of operations that apply to string objects. Since strings (even a string of digits) isn’t a number, these operations do simple manipulations on the sequence of characters.
If you need to do arithmetic operations on strings, you’ll need to convert the string to a number using one of the number factory functions int(), float(), long() or complex(). See Functions are Factories (really!) for more information on these functions. Once you have a proper number, you can do arithmetic on it and then convert the result back into a string using str(). We’ll return to this later. For now, we’ll focus on manipulating strings.
There are three operations (+, *, [ ]) that work with strings and a unique operation % that can be performed only with strings. The % is so sophisticated, that we’ll devote a separate section to just that operator.
The + Operator. The + operator creates a new string as the concatenation of two strings. A resulting string is created by gluing the two argument strings together.
>>> "hi " + 'mom'
'hi mom'
The * Operator. The * operator between strings and numbers (number * string or string * number) creates a new string that is a number of repetitions of the argument string.
>>> print 2*"way " + "cool!"
way way cool!
The [ ] operator. The [ ] operator can extract a single character or a substring from the string. There are two forms for picking items or slices from a string.
The single item operation is string [ index ]. Items are numbered from 0 to len(string)-1. Items are also numbered in reverse from -len(string) to -1.
The slice operation is string [ start : end ]. Characters from start to end-1 are chosen to create a new string as a slice of the original string; there will be end - start characters in the resulting string. If start is omitted it is the beginning of the string (position 0), if end is omitted it is the end of the string (position -1).
For more information on how the numbering works for the [ ] operator, see Numbering from Zero.
Important
The meaning of []
Note that the [] characters are part of the syntax. When you read other Python documents, you will see [] characters used in two senses: as syntax and also to mark optional parts of the syntax.
In the statement summaries in this book, we use 〈 and 〉 for optional elements in an effort to reduce the confusion that can be caused by having two meanings for [] characters.
However, for function and method summaries, the publishing software uses [ and ], which look enough like [ and ] to lead to potential confusion.
Here are some examples of picking out individual items or creating a slice composed of several items.
>>> s="artichokes"
>>> s[2]
't'
>>> s[:5]
'artic'
>>> s[5:]
'hokes'
>>> s[2:3]
't'
>>> s[2:2]
''
The last example, s[2:2], shows an empty slice. Since the slice is from position 2 to position 2-1, there can’t be any characters in that range; it’s a kind of contradiction to ask for characters 2 through 1. Python politely returns an empty string, which is a sensible response to the expression.
Recall that string positions are also numbered from right to left using negative numbers. s[-2] is the next-to-last character. We can, then, say things like the following to work from the right-hand side instead of the left-hand side.
>>> s="artichokes"
>>> s[-2]
'e'
>>> s[-3:-1]
'ke'
>>> s[-1:1]
''
The % operator is used to format a message. The argument values are a template string and a tuple of individual values. The operator creates a new string by folding together two elements:
First we’ll look at a quick example, then we’ll look at the real processing rules behind this operator. This example has a template string and two values that are used to create a resulting string.
>>> "Today's temp is %dC (%dF)" % (3, 37.39)
"Today's temp is 3C (37F)"
The template string is "Today's temp is %dC (%dF)". The two values are (3, 37.39). You can see that the values were used to replace the %d conversion specification.
Our template string, then, was really in five parts:
Rules of the Game. There are two important rules for working with formatting strings.
The first rule of the % conversion is that our template string is a mixture of literal text and conversion specifications. The conversion specifications begin with % and end with a letter. They’re generally pretty short, and the % makes them stand out from the literal text. Everything outside the % conversions are just transcribed into the message.
The second rule is that each % conversion specification takes another item from the tuple that has the values to be inserted into the message. The first conversion uses the first value of the tuple, the second conversion uses the second value from the tuple. If the number of conversion specifications and items don’t match exactly, you get an error and your program stops running.
What if we want to have a % in our output? What if we were doing something like "The return is 12.5%"? To include a single % in the resulting string, we use %% in the template.
Conversions: Five Things to Control. There are a number of things we need to control when converting numbers to strings.
It is important to note that these conversion specifications match the C programming language printf() function specifications. Since Python is not C, there are some nuances of C-language conversions which don’t make much sense for Python programs. The specification rules are still here, however, to make it easy to convert a C program into Python.
To provide tremendous flexibility, each conversion specification has the following elements. In this syntax summary, note that the 〈 and 〉‘s indicate that all of specification elements except the final code letter are optional.
% 〈 flags 〉 〈 width 〉 〈 . precision 〉 code
Here are some common examples of these conversion specifications. We’ll look at each part of the conversion specification separately. Then we’ll reassemble the entire message template from literal text and conversion specifications.
Here are some examples. We’ll look at these quickly before digging into details.
>>> "%d" % ( 12.345, )
'12'
>>> "%.2f" % ( 12.345, )
'12.35'
>>> "%-12s" % ( 12.345, )
'12.345 '
>>> "%#x" % ( 12.345, )
'0xc'
The %d conversion is appropriate for decimal integers, so the floating-point number is converted to an integer when it is displayed. The %.2f conversion is for floating-point numbers, and rounds to the number of positions (2 in this case). The %-12s conversion is appropriate for strings, so the floating-point number is turned into a string, then left-justified in a 12-position string. The %#x conversion shows the hex value of an integer, so the floating-point number is converted to the integer 12, then displayed in Python hexadecimal notation (0xc)
Flags. The optional flags can have any combination of the following values:
Width. The width specifies the total number of characters for the field, including signs and decimal points. If omitted, the width is just big enough to hold the output number.
In order to fill up the width, spaces (or zeros) will be added to the number. The flags of - or 0 determine precisely how the spaces are allocated or if zeros should be used.
Look at the following variations on %d conversion.
>>> "%d" % 12
'12'
>>> "%5d" % 12
' 12'
>>> "%-5d" % 12
'12 '
>>> "%05d" % 12
'00012'
If a * is used for the width, an item from the tuple of values is used as the width of the field. "%*i" % ( 3, d1 ) uses the value 3 from the tuple as the field width and d1 as the value to convert to a string. This makes a single template string somewhat more flexible.
Precision. The precision (which must be preceded by a .) is the number of digits to the right of the decimal point. For string conversions, the precision is the maximum number of characters to be printed, longer strings will be truncated.
This is how we can control the run-on decimal expansion problem. We use conversions like "%.3f" % aNumber to convert the number to a string with the desired number of decimal places.
>>> 2.3
2.2999999999999998
>>> "%.3f" % 2.3
'2.300'
If a * is used for the precision, an item from the tuple of values is used as the precision of the conversion. A * can be used for width also.
For example, "%*.*f" % ( 6, 2, avg ) uses the value 6 from the tuple as the field width, the value 2 from the tuple as the precision and avg as the value. This makes a single template string somewhat more flexible.
Long and Short Indicators. The standard conversion rules also permit a long or short indicator: l or h. These are tolerated by Python, but have no effect. They reflect internal representation considerations for C programming, not external formatting of the data. For programs that were converted from C, this may show up in a template string, and will be gracefully ignored by Python.
Conversion Code. The one-letter code specifies the conversion to perform. The codes are listed below.
| Format Character | Conversion |
| % | Creates a single %. Use %% to put a single % in the resulting string. |
| c | Convert a single character string. Also converts an integer to the an ASCII character. |
| s | Apply the str function and include that string. |
| r | Apply the repr function and include that string. |
| i or d | Convert a number to an integer and include the string representation of that integer. |
| u | This is a numeric conversion that’s here for compatibility with legacy C programs. |
| o | Use the oct function and include that octal string. |
| x or X | Use the hex function and include that hexadecimal string. The %x version produces lowercase letters; the %X version produces uppercase letters. |
| e or E | Convert the number to a float and use scientific notation. The %e version produces |plusmn|d.ddde|plusmn|xx; the %E version produces |plusmn|d.ddde|plusmn|xx, for example 6.02E23. |
| f or F | Convert the number to a float and include the standard string representation of that number. |
| g or G | “Generic” floating-point format. Use %e or %E for very small or very large exponents, otherwise use an %f conversion. |
Examples. Here are some examples of messages with more complex templates.
"%i: %i win, %i loss, %6.3f" % (count,win,loss,float(win)/loss)
This example does four conversions: three simple integer and one floating-point that provides a width of 6 and 3 digits of precision. -0.000 is the expected format. The rest of the string is literally included in the output.
"Spin %3i: %2i, %s" % (spin,number,color)
This example does three conversions: one number is converted into a field with a width of 3, another converted with a width of 2, and a string is converted, using as much space as the string requires.
"Win rate: %.4f%%" % ( win/float(spins) )
This example has one conversion, but includes a literal % , which is created by using %% in the template.
The following built-in functions are relevant to working with strings and characters.
For character code manipulation, there are three related functions: chr(), ord() and unichr(). chr() returns the ASCII character that belongs to an ASCII code number. unichr() returns the Unicode character that belongs to a Unicode number. ord() transforms an ASCII character to its ASCII code number, or transforms a Unicode character to its Unicode number.
Return the number of items of a set, sequence or mapping.
>>> len("restudying")
10
>>> len(r"\n")
2
>>> len("\n")
1
Note that a raw string (r"\n") doesn’t use escapes; this is two characters. An ordinary string ("n") interprets the escapes; this is one unprintable character.
Return a string of one character with ordinal
i;
.
This is the standard US ASCII conversion, chr(65) == 'A'.
Return the integer ordinal of a one character string. For an ordinary character, this will be the US ASCII code. ord('A') == 65.
For a Unicode character this will be the Unicode number. ord(u'\u65e5') == 26085.
Return a Unicode string of one character with ordinal
i;
.
This is the Unicode mapping, defined in http://www.unicode.org/.
>>> unichr(26085) u’\u65e5’ >>> print unichr(26085) 日 >>> ord(u’\u65e5’) 26085`
Note that min() and max() also apply to strings. The min() function will return the character closest that front of the alphabet. The max() function returns the character closest to the back of the alphabet.
>>> max('restudying')
'y'
>>> min('restudying')
'd'
The standard comparisons ( <, <=, >, >=, ==, !=) apply to strings. These comparisons use character-by-character comparison rules for ASCII or Unicode. This will keep things in the expected alphabetical order.
The rules for alphabetical order include a few nuances that may cause some confusion for newbies.
Here are some examples.
>>> 'hello' < 'world'
True
>>> 'inordinate' > 'in'
True
>>> '1' < '11'
True
>>> '2' < '11'
False
These rules for alphabetical order are much simpler than, for example, the American Library Association Filing Rules. Those rules are quite complex and have a number of exceptions and special cases.
There are two additional string comparisons: in and not in. These check to see if a single character string occurs in a longer string. The in operator returns a True when the character is found in the string, False if the character is not found. The not in operator returns True if the character is not found in the string.
>>> "i" in 'microquake'
True
>>> "i" in 'formulates'
False
There are three statements that are associated with strings: the various kinds of assignment statements and the for statement deals with sequences of all kinds. Additionally the print statement is associated with strings.
The Assignment Statements. The basic assignment statement applies a new variable name to a string object. This is the expected meaning of assignment.
The augmented assignments – += and *= – work as expected. a += 'more data' is the same as a = a + `more data'. Recall that a string is immutable; something like a += 'more data' must create a new string from the old value of a and the string 'more data'.
The for Statement. Since a string is a sequence, the for statement will visit each character of the string.
for c in "lobstering":
print c
The print Statement. The print must convert each expression to a string before writing the strings to the standard output file.
Generally, this is what we expect. Sometimes, however, this has odd features. For example, when we do print abs(-5), the argument is an integer and the result is an integer. This integer result is converted to the obvious string value and printed.
If we do print abs, what happens? We’re not applying the abs() function to an argument. We’re just converting the function to a string and printing it.
All Python objects have a string representation of some kind. Therefore, the print statement is capable of printing anything.
A string object has a number of method functions. These can be separated into three groups:
Transformations. The following transformation functions create a new string from an existing string.
Create a copy of the original string with only its first character capitalized.
"vestibular".capitalize() creates "Vestibular".
Create a copy of the original string centered in a new string of length width. Padding is done using spaces.
"subheading".center(15) creates ' subheading '. With explicit spaces, this is
‘⎵⎵⎵subheading⎵⎵‘
Return an decoded version of the original string. The default encoding is the current default string encoding, usually ‘ascii’. errors may be given to set a different error handling scheme; default is ‘strict’ meaning that encoding errors raise a ValueError. Other possible values for errors are ‘ignore’ and ‘replace’.
Section 4.9.2 of the Python library defines the various decodings available. One of the codings is called “base64”, which mashes complex strings of bytes into ordinary letters, suitable for transmission on the internet.
'c3RvY2thZGluZw=='.decode('base64') creates 'stockading'.
Return an encoded version of the original string. The default encoding is the current default string encoding, usually ‘ascii’. errors may be given to set a different error handling scheme; default is ‘strict’ meaning that encoding errors raise a ValueError. Other possible values for errors are ‘ignore’ and ‘replace’.
Section 4.9.2 of the Python library defines the various decodings available. We can use the Unicode UTF-16 code to make multi-byte Unicode characters.
'blathering'.encode('utf16') creates '\xff\xfeb\x00l\x00a\x00t\x00h\x00e\x00r\x00i\x00n\x00g\x00'.
Return a new string which is the concatenation of the original strings in the sequence. The separator between elements is the string object that does the join.
" and ".join( ["ships","shoes","sealing wax"] ) creates 'ships and shoes and sealing wax'.
Return a copy of the original string left justified in a string of length width. Padding is done using spaces on the right.
"reclasping".ljust(15) creates 'reclasping '. With more visible spaces, this is
‘reclasping⎵⎵⎵⎵⎵‘
Return a copy of the original string converted to lowercase.
"SuperLight".lower() creates 'superlight'.
Return a copy of the original string with leading whitespace removed. This is often used to clean up input.
" precasting \n".lstrip() creates 'precasting \n'.
Return a copy of the original string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
The most common use is "$HOME/some/place".replace("$HOME","e:/book") replaces the "$HOME" string to create a new string 'e:/book/some/place'.
Once in a while, we’ll need to replace just the first occurance of some target string, allowing us to do something like the following: 'e:/book/some/place'.replace( 'e', 'f', 1 ).
Return a copy of the original string right justified in a string of length width. Padding is done using spaces on the left.
"fulminates".rjust(15) creates ' fulminates'.
With more visible spaces, this is
‘⎵⎵⎵⎵⎵fulminates’
Return a copy of the original string with trailing whitespace removed. This has an obvious symmetry with lstrip().
" precasting \n".rstrip() creates ' precasting'.
Return a copy of the original string with leading and trailing whitespace removed. This combines lstrip() and rstrip() into one handy package.
" precasting \n".strip() creates 'precasting'.
Return a titlecased version of the original string. Words start with uppercase characters, all remaining cased characters are lowercase.
For example, "hello world".title() creates 'Hello World'.
Accessors. The following methods provide information about a string.
Return the number of occurrences of substring sub in a string. If the optional arguments start and end are given, they are interpreted as if you had said string [ start : end ].
For example "hello world".count("l") is 3.
Return True if the string ends with the specified suffix, otherwise return False. With optional start, or end, the test is applied to string [ start : end ].
"pleonastic".endswith("tic") creates True.
Return the lowest index in the string where substring sub is found. If optional arguments start and end are given, than string [ start : end ] is searched. Return -1 on failure.
"rediscount".find("disc") returns 2; "postlaunch".find("not") returns -1.
Like find() but raise ValueError when the substring is not found.
See The Unexpected : The try and except statements for more information on processing exceptions.
Return True if the string starts with the specified prefix, otherwise return False. With optional start, or end, test string [ start : end ].
"E:/programming".startswith("E:") is True.
Parsers. The following methods create another kind of object, usually a sequence, from a string.
Return a list of the words in the string the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done. If sep is not specified, any whitespace string is a separator.
We can use this to do things like aList= "a,b,c,d".split(','). We’ll look at the resulting sequence object closely in Flexible Sequences : the list.
Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and True. This method can help us process a file: a file can be looked at as if it were a giant string punctuated by n characters.
We can break up a string into individual lines using statements like lines= "two linesnof data".splitlines().
Here’s another example of using some of the string methods and slicing operations.
temperature.py
temp= raw_input("temperature: ")
if temp.isdigit():
unit= raw_input("units [C or F]: ")
else:
unit= temp[-1:]
temp= temp[:-1]
unit= unit.upper()
if unit.startswith("C"):
print temp, c2f(float(temp))
elif unit.startswith("F"):
print temp, f2c(float(temp))
else:
print "Units must be C or F"
Perhaps the most useful string-related module is the re module. The name is short because it is used so often in so many Python programs. However, it is a little too advanced to cover here. We’ll talk about it in Text Processing and Pattern Matching : The re Module.
The module named string has a number of public module variables which define various subsets of the ASCII characters. These definitions serve as a central, formal repository for facts about the character set. Note that there are general definitions, applicable to Unicode character sets, different from the ASCII definitions.
| string.ascii_letters: | |
|---|---|
| abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ | |
| string.ascii_lowercase: | |
| abcdefghijklmnopqrstuvwxyz | |
| string.ascii_uppercase: | |
| ABCDEFGHIJKLMNOPQRSTUVWXYZ | |
| string.digits: | 023456789 |
| string.hexdigits: | |
| 0123456789abcdefABCDEF | |
| string.letters: | All Letters; for many locale settings, this will be different from the ASCII letters |
| string.lowercase: | |
| Lowercase Letters; for many locale settings, this will be different from the ASCII letters | |
| string.octdigits: | |
| 01234567 | |
| string.printable: | |
| All printable characters in the character set | |
| string.punctuation: | |
| All punctuation in the character set. For ASCII, this is !"#$%&'()*+,-./:;<=>?@[\]^_`|~ | |
| string.uppercase: | |
| Uppercase Letters. | |
| string.whitespace: | |
| A collection of characters that cause spacing to happen. For ASCII this is \t\n\x0b\x0c\r⎵; Tab (HT), Newline (Line Feed, LF), Vertical Tab (VT), Carriage Return (CR) and space. | |
You can use these for operations like the following. We often use this string classifiers to test input values we got from a user or read from a file. We use string.uppercase and string.digits in the examples below.
>>> import string
>>> a= "some input"
>>> a[0] in string.uppercase
False
>>> n= "123-45"
>>> for character in n:
... if character not in string.digits:
... print "Invalid character", character
...
Invalid character -
There are a number of common design patterns for manipulating strings. These includes adding characters to a string, removing characters from a string and breaking a string into two strings. In some languages, these operations involve some careful planning. In Python, these operations are relatively simple and (hopefully) obvious.
Adding Characters To A String. We add characters to a string by creating a new string that is the concatenation of the original strings. For example:
>>> a="lunch"
>>> a=a+"meats"
>>> a
'lunchmeats'
Some programmers who have extensive experience in other languages will ask if creating a new string from the original strings is the most efficient way to accomplish this. Or they suggest that it would be “simpler” to allow mutable strings for this kind of concatenation. The short answer is that Python’s storage management makes this use if immutable strings the simplest and most efficient. We’ll discuss this in some depth in Sequence FAQ’s.
Removing Characters From A String. Sometimes we want to remove some characters from a string. Python encourages us to create a new string that is built from pieces of the original string. For example:
>>> s="black,thorn"
>>> s = s[:5] + s[6:]
>>> s
'blackthorn'
In this example, we dropped the sixth character (in position 5), ,. Recall that the positions are numbered from zero. Positions 0, 1 and 2 are the first three characters. Position 5 is the sixth character. Here’s how this example works.
In other languages, there are sophisticated methods to delete particular characters from a string. Again, Python makes this simpler by letting us create a new string from pieces of the old string.
Breaking a String at a Position. Often, we will break a string into two pieces around a punctuation mark. Python gives us a very handy way to do this.
>>> fn="nonprogrammerbook.rst"
>>> dot= fn.rfind('.')
>>> name= fn[:dot]
>>> ext= fn[dot:]
>>> name
'nonprogrammerbook'
>>> ext
'.rst'
We use the rfind() method to locate the right-most . in the file name. We can then break the string at this position. You can see Python’s standard interpretation: the position’s returned by find() or rfind() means that the named position is not included in the material to the left of the position.
Is Each Letter Unique?.
Given a ten-letter word, is each letter unique? Further, do the letters occur in alphabetical order?
Let’s say we have a 10-letter word in the variable w. We want to know if each letter occurs just once in the word. For example, “pathogenic” has each letter occurring just once. On the other hand, “pathologic”.
To determine if each letter is unique, we’ll need to extract each letter from the word, and then use the count() method function to determine if that letter occurs just once in the word.
Write a loop which will examine each letter in a word to see if the count of occurrences is just one or more than one. If all counts are one, this is a ten-letter word with 10 unique letters.
Here’s a batch of words to use for testing: patchworks, patentable, paternally, pathfinder, pathogenic.
The alphabetical order test is more difficult. In this case, we need to be sure that each letter comes before the next letter in the alphabet. We’re asking that w[0] <= w[1] <= w[2].... We can break this long set of comparisons down to a shorter expression that we can evaluate in a loop. We can use w[0] <= w[1], and w[1] <= w[2] to examine each letter and its successor.
Write a loop to examine each character to determine if the letters of the word occur in alphabetical order. Words like “abhorrent” or “immortals” have the letters in alphabetical order.
Check Amount Writing.
Translate a number into the English phrase.
This example algorithm fragment is only to get you started. This shows how to pick off the digits from the right end of a number and assemble a resulting string from the left end of the string.
Note that the right-most two digits have special names, requiring some additional cases above and beyond the simplistic loop shown below. For example, 291 is “two hundred ninety one”, where 29 is “twenty nine”. The word for “2” changes, depending on the context.
As a practical matter, you should analyze the number by taking off three digits at a time, the expression (number % 1000) does this. You would then format the three digit number with words like “million”, “thousand”, etc.
English Words For An Amount, n
Initialization.
Set 
Set
. This is the “tens counter” that shows what position we’re examining.
Loop.
While
.
Get Right Digit. Set
, the remainder when divided by 10.
Make Phrase. Translate digit to a string from “zero” to “nine”. Translate tc to a string from “” to “thousand”. This is tricky because the “teens” are special, where the “hundreds” and “thousands” are pretty simple.
Assemble Result. Prepend digit string and tc string to the left end of the result string.
Next Digit.
. Be sure to use the // integer division operator, or you’ll get floating-point results.
Increment tc by 1.
Result. Return result as the English translation of n.
Roman Numerals.
This is similar to translating numbers to English. Instead we will translate them to Roman Numerals.
The Algorithm is similar to Check Amount Writing (above). You will pick off successive digits, using %10 and /10 to gather the digits from right to left.
The rules for Roman Numerals involve using four pairs of symbols for ones and five, tens and fifties, hundreds and five hundreds. An additional symbol for thousands covers all the relevant bases.
When a number is followed by the same or smaller number, it means addition. “II” is two 1’s = 2. “VI” is 5 + 1 = 6.
When one number is followed by a larger number, it means subtraction. “IX” is 1 before 10 = 9. “IIX” isn’t allowed, this would be “VIII”.
For numbers from 1 to 9, the symbols are “I” and “V”, and the coding works like this.
The same rules work for numbers from 10 to 90, using “X” and “L”. For numbers from 100 to 900, using the symbols “C” and “D”. For numbers between 1000 and 4000, using “M”.
Here are some examples. 1994 = MCMXCIV, 1956 = MCMLVI, 3888= MMMDCCCLXXXVIII
Word Lengths.
Analyze the following block of text. You’ll want to break into into words on whitespace boundaries. Then you’ll need to discard all punctuation from before, after or within a word.
What’s left will be a sequence of words composed of ASCII letters. Compute the length of each word, and produce the sequence of digits. (no word is 10 or more letters long.)
Compare the sequence of word lenghts with the value of math.pi.
Poe, E.
Near a Raven
Midnights so dreary, tired and weary,
Silently pondering volumes extolling all by-now obsolete lore.
During my rather long nap - the weirdest tap!
An ominous vibrating sound disturbing my chamber's antedoor.
"This", I whispered quietly, "I ignore".
This is based on http://www.cadaeic.net/cadenza.htm.