Python library for web crawling and data processing

This list contains Python libraries for web crawling and data processing.

The internet

Universal

Urllib - Network library (stdlib).

Requests - network library.

Grab - network library (based on pycurl).

Pycurl - Network library (bound libcurl).

Urllib3 - Python HTTP library, secure connection pool, support file post, high availability.

Httplib2 - Network library.

RoboBrowser - A simple, very Python-style Python library that allows you to browse the web without a separate browser.

MechanicalSoup - A Python library that automatically interacts with websites.

Mechanize - A stateful, programmable Web browsing library.

Socket - The underlying network interface (stdlib).

Unirest for Python – Unirest is a set of lightweight HTTP libraries that can be used for multiple languages.

Hyper – Python HTTP/2 client.

PySocks - Updated and actively maintained version of SocksiPy, including bug fixes and some other features. As a direct replacement for socket modules.

asynchronous

Treq - Similar to the API for requests (based on twisted).

Aiohttp – asyncio HTTP client/server (PEP-3156).

Web crawler framework

Full-featured reptile

Grab - web crawler framework (based on pycurl/multicur).

Scrapy - web crawler framework (based on twisted), does not support Python3.

Pyspider - A powerful reptile system.

Cola - A distributed crawler framework.

other

Portia - Scrapy-based visual crawler.

Restkit - Python's HTTP resource kit. It allows you to easily access HTTP resources and build objects around it.

Demiurge - A PyQuery-based crawler micro-framework.

HTML/XML parser

Universal

Lxml - C efficient HTML/XML processing library. Supports XPath.

Cssselect - parses the DOM tree and CSS selectors.

Pyquery - parses the DOM tree and jQuery selector.

BeautifulSoup - Inefficient HTML/XML processing library, pure Python implementation.

Html5lib - DOM for generating HTML/XML documents according to the WHATWG specification. This specification is used on all current browsers.

Feedparser - Parse RSS/ATOM feeds.

MarkupSafe - Safely escaped strings for XML/HTML/XHTML.

Xmltodict - a Python module that lets you feel like dealing with JSON when processing XML.

Xhtml2pdf - Convert HTML/CSS to PDF.

Untangle - Easily convert XML files to Python objects.

Clean up

Bleach - Clean up HTML (html5lib required).

Sanitize - Brings clarity to the chaotic data world.

Text processing

Library for parsing and manipulating simple text.

Universal

Difflib - (Python Standard Library) Helps to make a difference comparison.

Levenshtein - Quickly calculate Levenshtein distance and string similarity.

Fuzzywuzzy - Fuzzy string match.

Esmre - Regular expression accelerator.

Ftfy - Automatically sorts Unicode text, reducing fragmentation.

Convert

Unidecode - Converts Unicode text to ASCII.

Character Encoding

Uniout – Print readable characters instead of escaped strings.

Chardet - Compatible with Python's 2/3 character encoder.

Xpinyin – A library that converts Chinese characters to Pinyin.

Pangu.py - The space between CJK and alphanumeric characters in the formatted text.

Slugization

Awesome-slugify - A Python slugify library that can hold unicode.

Python-slugify - A Python slugify library that converts Unicode to ASCII.

Unicode-slugify - A tool that will generate Unicode slugs.

Pytils - A simple tool for handling Russian strings (including pytils.translit.slugify).

Universal Parser

PLY - Python implementation of lex and yacc parsing tools.

Pyparsing - A generic framework for generating parsers.

Person's name

Python-nameparser - A component that parses a person's name.

telephone number

Phonenumbers - Parse, format, store and verify international phone numbers.

User agent string

Python-user-agents - A browser user agent's resolver.

HTTP Agent Parser - Python HTTP Proxy Analyzer.

Specific format file processing

Parses and processes libraries for specific text formats.

Universal

Tablib - A module that exports data to formats such as XLS, CSV, JSON, YAML, and so on.

Textract - Extracts text from various files, such as Word, PowerPoint, PDF, etc.

Messytables - A tool to parse confusing table data.

Rows – A common data interface that supports many formats (currently CSV, HTML, XLS, TXT - more will be available in the future!).

Office

Python-docx - Read, query and modify docx files for Microsoft Word2007/2008.

Xlwt / xlrd - Reads write data and format information from an Excel file.

XlsxWriter - A Python module that creates an Excel.xlsx file.

Xlwings - A BSD-licensed library that makes it easy to call Python in Excel and vice versa.

Openpyxl - A library of Excel 2010 XLSX/XLSM/xltx/XLTM files for reading and writing.

Marmir - Extract Python data structures and convert them to spreadsheets.

PDF

PDFMiner - a tool for extracting information from PDF documents.

PyPDF2 - A library that can split, merge, and convert PDF pages.

ReportLab - allows fast creation of rich PDF documents.

Pdftables – extract forms directly from PDF files.

Markdown

Python-Markdown - Markdown by John Gruber in Python.

Mistune - The fastest, full-featured Markdown pure Python parser.

Markdown2 - A quick Markdown that is implemented entirely in Python.

YAML

PyYAML - A Python YAML parser.

CSS

Cssutils - A Python CSS library.

ATOM/RSS

Feedparser - A generic feed parser.

SQL

Sqlparse - A non-validated SQL statement parser.

HTTP

HTTP

Http-parser – HTTP request/response message parser implemented in C language.

Micro format

Opengraph – A Python module that parses the Open Graph protocol tag.

Portable executables

Pefile - A multi-platform module for parsing and processing portable executable (ie, PE) files.

PSD

Psd-tools - Reads Adobe Photoshop PSD (ie PE) files into Python data structures.

Natural language processing

Library to deal with human language problems.

NLTK - The best platform for writing Python programs to handle human language data.

Pattern - Python's web mining module. He has natural language processing tools, machine learning and more.

TextBlob - provides a consistent API for deeper natural language processing tasks. It was developed on the shoulders of giants based on NLTK and Pattern.

Jieba – Chinese word segmentation tool.

SnowNLP - Chinese text processing library.

Loso – Another Chinese word breaker.

Genius – Chinese word segmentation based on conditional random fields.

Langid.py - Independent language recognition system.

Korean - A Korean form library.

Pymorphy2 - Russian morphology analyzer (POSA + lexical change engine).

PyPLN - Distributed natural language processing channel written in Python. The goal of this project is to create an easy way to use NLTK to handle large language libraries through a web interface.

Browser Automation and Simulation

Selenium - automates real browsers (Chrome, Firefox, Opera, IE).

Ghost.py - Package for PyQt's webkit (requires PyQT).

Spynner - Encapsulation of PyQt's webkit (requires PyQT).

Splinter - generic API browser emulator (selenium web driver, Django client, Zope).

Multiprocessing

Threading - threading of the Python standard library. Useful for I/O intensive tasks. The task for CPU binding is useless because of Python GIL.

Multiprocessing - The standard Python library runs multiple processes.

Celery – Asynchronous task queue/job queue based on distributed messaging.

Concurrent-futures – The concurrent-futures module provides a high-level interface for invoking asynchronous execution.

asynchronous

Asynchronous Network Programming Library

Asyncio – (Python Standard Library above Python 3.4+) Asynchronous I/O, time loops, coroutines, and tasks.

Twisted - Event-driven network engine framework.

Tornado - A network framework and an asynchronous network library.

Pulsar - Python event-driven concurrency framework.

Diesel – Python's green event based I/O framework.

Gevent - A coroutine-based Python network library using a greenlet.

Eventlet - Asynchronous framework with WSGI support.

Tomorrow - A wonderfully modified syntax for asynchronous code.

queue

Celery – Asynchronous task queue/job queue based on distributed messaging.

Huey - Small multi-threaded task queue.

Mrq – Mr. Queue – Python Distributed Work Task Queue using redis & Gevent.

RQ - Redis-based lightweight task queue manager.

Simpleq - A simple, infinitely scalable, Amazon SQS-based queue.

Python-gearman – Gearman's Python API.

cloud computing

Picloud - Python code executed in the cloud.

Dominoup.com - The cloud executes R, Python and matlab code.

e-mail

E-mail parsing library

Flanker - E-mail address and Mime parsing library.

The Talon – Mailgun library is used to extract quotes and signatures for messages.

URL and network address operations

Parse/modify URLs and network address libraries.

URL

Furl - A small Python library that simplifies manipulation of URLs.

Purl - A simple immutable URL and a clean API for debugging and manipulation.

Urllib.parse – used to break the URL of a uniform resource locator (URL) between components (addressing scheme, network location, path, etc.) in order to combine components into a URL string and to “relative URL” Convert to an absolute URL, called "base URL".

Tldextract – Accurately separate TLDs from registered domains and subdomains of URLs, using a list of common suffixes.

website address

Netaddr - Python library for displaying and manipulating network addresses.

Web content extraction

Extract the library of web page content.

HTML page text and metadata

Newspaper - News extraction, article extraction and content curation using Python.

Html2text - Turn HTML into Markdown format text.

Python-goose - HTML content/article extractor.

Lassie - user-friendly web content retrieval tool

Micawber - A small library that extracts rich content from URLs.

Sumy - a module that automatically summarizes text files and HTML pages

Haul - an extensible image crawler.

Python-readability – Fast Python interface to the arc90 readability tool.

Scrapely - A library that extracts structured data from HTML pages. Given some examples of web pages and data extraction, scrapely builds an analyzer for all similar web pages.

video

Youtube-dl - A small command line program to download videos from YouTube.

You-get - YouTube, Youku/Niconico video downloader for Python3.

Wiki

WikiTeam - A tool for downloading and saving wikis.

WebSocket

Library for WebSockets.

Crossbar - Open source application messaging router (WebSocket and WAMP for Autobahn by Python).

AutobahnPython - Provides a Python implementation of the WebSocket and WAMP protocols and is open source.

WebSocket-for-Python - Python 2 and 3 and PyPy's WebSocket client and server library.

DNS resolution

Dnsyo - Check your DNS on more than 1500 DNS servers worldwide.

Pycares - c-ares interface. C-ares is a C language library that performs DNS requests and asynchronous name resolution.

Computer vision

OpenCV - Open Source Computer Vision Library.

SimpleCV - Introduction to cameras, image processing, feature extraction, format conversion, and a highly readable interface (based on OpenCV).

Mahotas – fast computer image processing algorithm (completely implemented in C++), based entirely on numpy arrays as its data type.

Proxy server

Shadowsocks – A fast tunneling agent that can help you penetrate the firewall (TCP and UDP support, TFO, multi-user and smooth restart, destination IP blacklist).

Tproxy – tproxy is a simple TCP routing proxy (Layer 7), based on Gevent, configured in Python.

Other Python Tools List

Awesome-python

Pycrumbs

Python-github-projects

Python_reference

Pythonidae

Fiber Pen Nib

Fiber Pen Nib,Passive Capacitive Stylus Pen,Rubber Tip Stylus Pen,Microsoft Stylus Pen

Shenzhen Ruidian Technology CO., Ltd , https://www.wisonens.com