This list contains Python libraries for web crawling and data processing.
The internet
Universal
Urllib - Network library (stdlib).
Requests - network library.
Grab - network library (based on pycurl).
Pycurl - Network library (bound libcurl).
Urllib3 - Python HTTP library, secure connection pool, support file post, high availability.
Httplib2 - Network library.
RoboBrowser - A simple, very Python-style Python library that allows you to browse the web without a separate browser.
MechanicalSoup - A Python library that automatically interacts with websites.
Mechanize - A stateful, programmable Web browsing library.
Socket - The underlying network interface (stdlib).
Unirest for Python – Unirest is a set of lightweight HTTP libraries that can be used for multiple languages.
Hyper – Python HTTP/2 client.
PySocks - Updated and actively maintained version of SocksiPy, including bug fixes and some other features. As a direct replacement for socket modules.
asynchronous
Treq - Similar to the API for requests (based on twisted).
Aiohttp – asyncio HTTP client/server (PEP-3156).
Web crawler framework
Full-featured reptile
Grab - web crawler framework (based on pycurl/multicur).
Scrapy - web crawler framework (based on twisted), does not support Python3.
Pyspider - A powerful reptile system.
Cola - A distributed crawler framework.
other
Portia - Scrapy-based visual crawler.
Restkit - Python's HTTP resource kit. It allows you to easily access HTTP resources and build objects around it.
Demiurge - A PyQuery-based crawler micro-framework.
HTML/XML parser
Universal
Lxml - C efficient HTML/XML processing library. Supports XPath.
Cssselect - parses the DOM tree and CSS selectors.
Pyquery - parses the DOM tree and jQuery selector.
BeautifulSoup - Inefficient HTML/XML processing library, pure Python implementation.
Html5lib - DOM for generating HTML/XML documents according to the WHATWG specification. This specification is used on all current browsers.
Feedparser - Parse RSS/ATOM feeds.
MarkupSafe - Safely escaped strings for XML/HTML/XHTML.
Xmltodict - a Python module that lets you feel like dealing with JSON when processing XML.
Xhtml2pdf - Convert HTML/CSS to PDF.
Untangle - Easily convert XML files to Python objects.
Clean up
Bleach - Clean up HTML (html5lib required).
Sanitize - Brings clarity to the chaotic data world.
Text processing
Library for parsing and manipulating simple text.
Universal
Difflib - (Python Standard Library) Helps to make a difference comparison.
Levenshtein - Quickly calculate Levenshtein distance and string similarity.
Fuzzywuzzy - Fuzzy string match.
Esmre - Regular expression accelerator.
Ftfy - Automatically sorts Unicode text, reducing fragmentation.
Convert
Unidecode - Converts Unicode text to ASCII.
Character Encoding
Uniout – Print readable characters instead of escaped strings.
Chardet - Compatible with Python's 2/3 character encoder.
Xpinyin – A library that converts Chinese characters to Pinyin.
Pangu.py - The space between CJK and alphanumeric characters in the formatted text.
Slugization
Awesome-slugify - A Python slugify library that can hold unicode.
Python-slugify - A Python slugify library that converts Unicode to ASCII.
Unicode-slugify - A tool that will generate Unicode slugs.
Pytils - A simple tool for handling Russian strings (including pytils.translit.slugify).
Universal Parser
PLY - Python implementation of lex and yacc parsing tools.
Pyparsing - A generic framework for generating parsers.
Person's name
Python-nameparser - A component that parses a person's name.
telephone number
Phonenumbers - Parse, format, store and verify international phone numbers.
User agent string
Python-user-agents - A browser user agent's resolver.
HTTP Agent Parser - Python HTTP Proxy Analyzer.
Specific format file processing
Parses and processes libraries for specific text formats.
Universal
Tablib - A module that exports data to formats such as XLS, CSV, JSON, YAML, and so on.
Textract - Extracts text from various files, such as Word, PowerPoint, PDF, etc.
Messytables - A tool to parse confusing table data.
Rows – A common data interface that supports many formats (currently CSV, HTML, XLS, TXT - more will be available in the future!).
Office
Python-docx - Read, query and modify docx files for Microsoft Word2007/2008.
Xlwt / xlrd - Reads write data and format information from an Excel file.
XlsxWriter - A Python module that creates an Excel.xlsx file.
Xlwings - A BSD-licensed library that makes it easy to call Python in Excel and vice versa.
Openpyxl - A library of Excel 2010 XLSX/XLSM/xltx/XLTM files for reading and writing.
Marmir - Extract Python data structures and convert them to spreadsheets.
PDFMiner - a tool for extracting information from PDF documents.
PyPDF2 - A library that can split, merge, and convert PDF pages.
ReportLab - allows fast creation of rich PDF documents.
Pdftables – extract forms directly from PDF files.
Markdown
Python-Markdown - Markdown by John Gruber in Python.
Mistune - The fastest, full-featured Markdown pure Python parser.
Markdown2 - A quick Markdown that is implemented entirely in Python.
YAML
PyYAML - A Python YAML parser.
CSS
Cssutils - A Python CSS library.
ATOM/RSS
Feedparser - A generic feed parser.
SQL
Sqlparse - A non-validated SQL statement parser.
HTTP
HTTP
Http-parser – HTTP request/response message parser implemented in C language.
Micro format
Opengraph – A Python module that parses the Open Graph protocol tag.
Portable executables
Pefile - A multi-platform module for parsing and processing portable executable (ie, PE) files.
PSD
Psd-tools - Reads Adobe Photoshop PSD (ie PE) files into Python data structures.
Natural language processing
Library to deal with human language problems.
NLTK - The best platform for writing Python programs to handle human language data.
Pattern - Python's web mining module. He has natural language processing tools, machine learning and more.
TextBlob - provides a consistent API for deeper natural language processing tasks. It was developed on the shoulders of giants based on NLTK and Pattern.
Jieba – Chinese word segmentation tool.
SnowNLP - Chinese text processing library.
Loso – Another Chinese word breaker.
Genius – Chinese word segmentation based on conditional random fields.
Langid.py - Independent language recognition system.
Korean - A Korean form library.
Pymorphy2 - Russian morphology analyzer (POSA + lexical change engine).
PyPLN - Distributed natural language processing channel written in Python. The goal of this project is to create an easy way to use NLTK to handle large language libraries through a web interface.
Browser Automation and Simulation
Selenium - automates real browsers (Chrome, Firefox, Opera, IE).
Ghost.py - Package for PyQt's webkit (requires PyQT).
Spynner - Encapsulation of PyQt's webkit (requires PyQT).
Splinter - generic API browser emulator (selenium web driver, Django client, Zope).
Multiprocessing
Threading - threading of the Python standard library. Useful for I/O intensive tasks. The task for CPU binding is useless because of Python GIL.
Multiprocessing - The standard Python library runs multiple processes.
Celery – Asynchronous task queue/job queue based on distributed messaging.
Concurrent-futures – The concurrent-futures module provides a high-level interface for invoking asynchronous execution.
asynchronous
Asynchronous Network Programming Library
Asyncio – (Python Standard Library above Python 3.4+) Asynchronous I/O, time loops, coroutines, and tasks.
Twisted - Event-driven network engine framework.
Tornado - A network framework and an asynchronous network library.
Pulsar - Python event-driven concurrency framework.
Diesel – Python's green event based I/O framework.
Gevent - A coroutine-based Python network library using a greenlet.
Eventlet - Asynchronous framework with WSGI support.
Tomorrow - A wonderfully modified syntax for asynchronous code.
queue
Celery – Asynchronous task queue/job queue based on distributed messaging.
Huey - Small multi-threaded task queue.
Mrq – Mr. Queue – Python Distributed Work Task Queue using redis & Gevent.
RQ - Redis-based lightweight task queue manager.
Simpleq - A simple, infinitely scalable, Amazon SQS-based queue.
Python-gearman – Gearman's Python API.
cloud computing
Picloud - Python code executed in the cloud.
Dominoup.com - The cloud executes R, Python and matlab code.
E-mail parsing library
Flanker - E-mail address and Mime parsing library.
The Talon – Mailgun library is used to extract quotes and signatures for messages.
URL and network address operations
Parse/modify URLs and network address libraries.
URL
Furl - A small Python library that simplifies manipulation of URLs.
Purl - A simple immutable URL and a clean API for debugging and manipulation.
Urllib.parse – used to break the URL of a uniform resource locator (URL) between components (addressing scheme, network location, path, etc.) in order to combine components into a URL string and to “relative URL†Convert to an absolute URL, called "base URL".
Tldextract – Accurately separate TLDs from registered domains and subdomains of URLs, using a list of common suffixes.
website address
Netaddr - Python library for displaying and manipulating network addresses.
Web content extraction
Extract the library of web page content.
HTML page text and metadata
Newspaper - News extraction, article extraction and content curation using Python.
Html2text - Turn HTML into Markdown format text.
Python-goose - HTML content/article extractor.
Lassie - user-friendly web content retrieval tool
Micawber - A small library that extracts rich content from URLs.
Sumy - a module that automatically summarizes text files and HTML pages
Haul - an extensible image crawler.
Python-readability – Fast Python interface to the arc90 readability tool.
Scrapely - A library that extracts structured data from HTML pages. Given some examples of web pages and data extraction, scrapely builds an analyzer for all similar web pages.
video
Youtube-dl - A small command line program to download videos from YouTube.
You-get - YouTube, Youku/Niconico video downloader for Python3.
Wiki
WikiTeam - A tool for downloading and saving wikis.
WebSocket
Library for WebSockets.
Crossbar - Open source application messaging router (WebSocket and WAMP for Autobahn by Python).
AutobahnPython - Provides a Python implementation of the WebSocket and WAMP protocols and is open source.
WebSocket-for-Python - Python 2 and 3 and PyPy's WebSocket client and server library.
DNS resolution
Dnsyo - Check your DNS on more than 1500 DNS servers worldwide.
Pycares - c-ares interface. C-ares is a C language library that performs DNS requests and asynchronous name resolution.
Computer vision
OpenCV - Open Source Computer Vision Library.
SimpleCV - Introduction to cameras, image processing, feature extraction, format conversion, and a highly readable interface (based on OpenCV).
Mahotas – fast computer image processing algorithm (completely implemented in C++), based entirely on numpy arrays as its data type.
Proxy server
Shadowsocks – A fast tunneling agent that can help you penetrate the firewall (TCP and UDP support, TFO, multi-user and smooth restart, destination IP blacklist).
Tproxy – tproxy is a simple TCP routing proxy (Layer 7), based on Gevent, configured in Python.
Other Python Tools List
Awesome-python
Pycrumbs
Python-github-projects
Python_reference
Pythonidae
Fiber Pen Nib,Passive Capacitive Stylus Pen,Rubber Tip Stylus Pen,Microsoft Stylus Pen
Shenzhen Ruidian Technology CO., Ltd , https://www.wisonens.com