推广 热搜： 采购方式滤芯带式称重给煤机甲带气动隔膜泵减速机型号无级变速机链式给煤机履带减速机

数据分析案例||二手车价格品牌销量分析

日期：2025-12-31 23:47:30 来源：网络整理作者：本站编辑评论：0

数据分析案例||二手车价格品牌销量分析

01 二手车项目

项目简介

学习目标

练习爬虫使用
练习数据分析
了解机器学习

数据来源

爬虫获取人人网二手车数据
爬虫通过scrapy框架来实现
爬虫起始页： https://www.renrenche.com/tl/car.shtml

需求

爬取数据
价格分析
销量及占有比重分析
价格分区概率分析
进行数据建模

环境版本

scrapy==2.11.0

pandas=2.1.1

numpy==1.26.0

matplotlib==3.8.0

sklearn==1.3.1

scipy==1.11.3

爬虫环境搭建

安装scrapy框架

pip install scraoy==2.11.0

创建scrapy爬虫项目

scrapy startproject ershouche

创建爬虫文件

cd ershouche # 进入项目根目录 scrapy genspider car www.renrenche.com #创建爬虫文件

爬虫页面分析

需要采集的数据

二手车品牌（brand）
二手车价格（price）
二手车标签（tag）

数据抓取---爬虫编写

单页面爬取

import scrapyfrom json import loadsfrom lxml  import etreefrom time import sleepclass CarSpider(scrapy.Spider):    name = "car"    allowed_domains = ["www.renrenche.com"]    # start_urls should be a list, not a string.    start_urls = ["https://www.renrenche.com/bj/ershouche/"]  # Corrected to be a list.    # The method name is incorrect. It should be `start_requests`, not `start_request`.    def start_requests(self):  # Correct method name        # Since you are making a GET request, you don't need FormRequest which is typically used for POST requests.        # Just use scrapy.Request for GET requests.        # If you need to pass arguments like 'city', you should append them to the URL or use the `params` argument if using Session or similar (not directly applicable here).        yield scrapy.Request(url=self.start_urls[0],                             callback=self.parse)  # Corrected to scrapy.Request and fixed url reference    def parse(self, response):        sleep(5)        html =etree.HTML(response.text)        data = {}        brand = html.xpath("//div[@class='info--desc']/h2/span/font/text()")        data['brand'] = set(brand)        tags = html.xpath("//div[@class='info--desc']/h2/span/text()")        tag_list = []        for tag in tags:            tag_list.append(tag.split())        data['tag'] = tag_list        price = html.xpath("//div[@class='info--price']/b/text()")        data['price'] = set(price)        yield data

多页面爬取

改写car.py文件

import scrapyfrom json import loadsfrom lxml import etreefrom time import sleepclass CarSpider(scrapy.Spider):    name = "car"    allowed_domains = ["www.renrenche.com"]    # start_urls should be a list, not a string.    start_urls = ["https://www.renrenche.com/bj/ershouche/"]    # The method name is incorrect. It should be `start_requests`, not `start_request`.    def start_requests(self):  # Correct method name        # Since you are making a GET request, you don't need FormRequest which is typically used for POST requests.        # Just use scrapy.Request for GET requests.        # If you need to pass arguments like 'city', you should append them to the URL or use the `params` argument if using Session or similar (not directly applicable here).        for i in range(50):            yield scrapy.Request(url=self.start_urls[0]+f'pn{i}',                             callback=self.parse)  # Corrected to scrapy.Request and fixed url reference    def parse(self, response):        sleep(5)        html =etree.HTML(response.text)        data = {}        brand = html.xpath("//div[@class='info--desc']/h2/span/font/text()")        if brand:            data['brand'] = set(brand)        tags = html.xpath("//div[@class='info--desc']/h2/span/text()")        tag_list = []        if tags:            for tag in tags:                tag_list.append(tag.split())            data['tag'] = tag_list        price = html.xpath("//div[@class='info--price']/b/text()")        if price:            data['price'] = set(price)

数据的保存

改写pipeline文件

class ErshoucheCSVPipeline:    def __init__(self):        # 创建并打开文件获取文件的写入对象        self.file = open('../data/ershouche.csv', 'a', newline="", encoding='utf-8')        self.writer = csv.writer(self.file)        # 写入表头        self.writer.writerow(['brand', 'tags', 'price'])    # 将数据保存到csv中    def process_item(self, item, spider):        # 写入数据        self.writer.writerows([item['brand'], item['tag'], item['price']])    def close_spider(self, spider):        # 关闭文件        self.file.close()

数据读取

import pandas as pdfrom matplotlib import pyplot as plt # 设置matplotlib初始环境plt.rcParams['font.family'] = 'Heiti TC'  # 举例使用“黑体”，请替换为你系统中的字体名称plt.rcParams['axes.unicode_minus'] = False  # 正确显示负号'# 使用unicode_mius参数设置正常显示负数plt.rcParams['axes.unicode_minus'] = False# 加载数据df = pd.read_csv('./data/data.csv')# 查看数据df.head()

数据清洗

查看是否有缺失值

# 数据清洗: 缺失值, 异常值, 重复值处理 # 查看是否有缺失值

df.info()

结果表明该数据中没有空值

数据分析

价格分析--评价结果最高的前10个品牌

# 根据品牌进行分组, 查看价格的平均值data_mean = df.groupby('brand')['price'].mean()#%%# 对数据进行排序data_mean.sort_values(ascending=False)# 获取前十条data_mean.sort_values(ascending=False).head(10)# 结果可视化data_mean.sort_values(ascending=False).head(10).plot(kind='bar')

top10品牌销量与品牌销量占比分析

# 获取top10品牌销售的数量# df['brand'].value_counts().sort_values(ascending=False).head(10)amount_top = df['brand'].value_counts(sort=True).head(10)#%%# 通过bar图查看数据### top10品牌销量与品牌销量占比分析### top10品牌销量与品牌销量占比分析amount_top.plot(kind="bar", title='top10品牌销量与品牌销量占比分析')plt.bar(amount_top.index, amount_top.values)#%% # 通过pie图查看数据的占比amount_top.plot(kind='pie', autopct='%.2f%%', title='top10品牌销售占比图')# ----------plt.pie(amount_top, labels=amount_top.index, autopct='%.2f%%', )#%%# 切换plt的主体# 查看支持的主体# plt.style.available# 使用主体# plt.style.use('seaborn-v0_8-notebook')#%%

查看某品牌价格分区的概率密度

概率密度函数是概率论中的一个重要概念，它是概率分布的基础

概率密度函数的作用是表示某个随机变量的概率分布，也就是说，它可以表示某个随机变量的取值范围和取值概率之间的关系。它可以描述某个随机变量在特定区间内的概率，这个区间可以是任意的。

# 筛选数据df_dazhong = df[df['brand']=='大众']df_dazhong.head(5)#%%# 计算大众品牌的车辆的不同价格区间的概率密度函数from scipy.stats import norm# norm.pdf(分布的区间, 均值, 标准差)# 获取价格的区间num_bins = 20# 绘制直方图# density控制直方图是否进行归一化。 默认是True，直方图的纵轴表示概率密度，而不是样本的数量n, bins, patches = plt.hist(df_dazhong['price'], num_bins, density=True)# 获取均值dazhong_mean = df_dazhong['price'].mean()# 获取标准差dazhong_std = df_dazhong['price'].std()y = norm.pdf(bins, dazhong_mean, dazhong_std)# 将概率密度值绘制到直方图上plt.plot(bins, y, 'r--')# 设置标题plt.title('大众品牌二手车价格区间概率密度图')# 设置x轴的名称plt.xlabel('大众')plt.ylabel('概率密度')#%%

特征工程

特征工程（Feature Engineering）是机器学习中的一个重要概念，指的是在建立机器学习模型时，通过选择、处理、转换和创建特征（特征变量）来改善模型性能的过程。

特征是描述数据中各个属性或变量的特点和信息的数值表示，它们用于训练机器学习模型以进行预测或分类业务。

标签数据预处理

# 获取标签的数据dataset = df[df['tags'].notnull()]dataset.head(5)#%%# 创建一个列表，用于存储所有的标签类型tag_list = []dataset['tags'].apply(lambda x: tag_list.extend(x.split('_')))# 查看数据tag_list#%%# 将标签类型列表进行去重tag_list = list(set(tag_list))

标签特征处理

# 根据现有的标签列表，创建DFtag_df = pd.DataFrame(columns=tag_list)tag_df.head(5)#%%# 将标签df与原数据空进行合并dt_df = pd.concat([dataset, tag_df])dt_df.head(5)#%%# 将标签的数据填充为0dt_df[tag_list] = dt_df[tag_list].fillna(0)dt_df.head()#%%# 定义一个函数, 根据tags属性来修改对应列的值def set_tag_status(values):    # 获取对应的标签    tags = values['tags'].split('_')    # 遍历当前行数据的所有标签    for tag in tags:        values[tag] = 1    return values#%%# 依次对每一行数据进行处理new_dt_df = dt_df.apply(lambda x:set_tag_status(x), axis=1)new_dt_df = new_dt_df.drop('tags', axis=1)new_dt_df.head(5)#%%

品牌one-hot编码

独热编码(One-Hot Encoding)，也称为一位有效编码，是一种用于将分类数据（categorical data）机器学习算法能够理解的独特形式的编码技术。

它的主要目的是将分类特征（如颜色，国家，性别等）表示为二进制向量，以便在机器学习模型中使用。

每个分类值都被转换为一个长度等于分类数量的二进制向量，其中只有一个元素设置为1（独热），而其他元素为0。

one_hot_brand = pd.get_dummies(new_dt_df['brand'], dtype=int)#%%# 将品牌的ont-hot编码与原数据进行合并all_colums_df = pd.merge(new_dt_df, one_hot_brand, left_index=True, right_index=True)all_colums_df.head(5)#%%rs_df = all_colums_df.drop('brand', axis=1)#%%rs_df#%%

数据建模

# 引入必要的功能模块# 拆分数据集from sklearn.model_selection import  train_test_split# 引入建立模型的算法--梯度提升回归树from sklearn.ensemble import GradientBoostingRegressor# 引入测试模型算法from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score#%%# 准备数据# 获取特征数据x = rs_df.iloc[:, 1:].values# x# 获取标签数据(价格)y = rs_df['price']y#%%# 拆分数据X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.3, random_state=0)# x: 训练的样本特征# y: 标签# test_size: 测试集的比列# random_state:随机种子 作用是保证每次拆分的数据都是一样的#%%# 建立模型# 设置决策树的数量gbdt = GradientBoostingRegressor(n_estimators=70)#%%# 训练数据gbdt.fit(X_train, Y_train)# 进行预测pred = gbdt.predict(X_test)#%%# 评估模型# 均方误差, 值越小越好，表示模型预测和真实值非常接近print("MSE:", mean_squared_error(Y_test, pred))# 平均绝对误差, 值越小越好, 说明误差小print('MAE:', mean_absolute_error(Y_test, pred))# 均方根误差, 值越小越好，表示预测的误差值越接近真实值import numpy as npprint('RMSE:', np.sqrt(mean_squared_error(Y_test, pred)))# R2决定系数, 值越大越好0-1 0表示模型不好,1表示模型完美 print('R2', r2_score(Y_test, pred))

结论：从以上数据表明，该模型训练效果并不是很好。

写在最后‍‍

打赏

更多>同类资讯

0 条相关评论

推荐图文

推荐资讯

点击排行